neural networks | LIFERAY UI

November 27, 2018 viral

Transfer Learning

Transfer learning does exactly as the name says. The idea is to transfer something learned from one task and apply it to another. Why? Practically speaking, training entire models from scratch every time is inefficient, and its success depends on many factors.

Another important reason is that for certain applications, the datasets that are publicly available are not big enough to train a deep architecture like AlexNet or ResNet without over-fitting, which means failing to generalize. Example applications could be online learning from a few examples given by the user or fine-grained classification, where the variation between the classes is minimal.

A very interesting observation is that final layers can be used to work on different tasks, given that you freeze all the rest, whether it is detection or classification, end up having weights that look very similar.

This leads to the idea of transfer learning. For example, ImageNet can generalize so well that it’s convolutional weights can act as feature extractors, similar to conventional visual representations and can be used to train a linear classifier for various tasks.

When?

Research has shown that feature extraction in convolutional network weights trained on ImageNet outperforms the conventional feature extraction methods such as SURF, Deformable Part Descriptors (DPDs), Histogram of Oriented Gradients (HOG), and bag of words (BoW).

This means we can used Convolutional features equally well with the conventional visual representations. The only drawback being that deeper architectures might require a longer time to extract the features.

That deep convolutional neural network is trained on ImageNet. The visualization of convolution filters in the first layers shows that they learn low-level features similar to edge detection filters. Whereas, the convolution filters at the last layers learn high-level features that capture the class-specific information.

Hence, if you extract the features for ImageNet after the first pooling layer and embed them into a 2D space. The visualization will show that there is some anarchy in the data. However, if we do the same at fully connected layers, we can notice that the data with the same semantic information gets organized into clusters. This implies that the network generalizes quite well at higher levels, and it will be possible to transfer this knowledge to unseen classes.

According to experiments transfer learning conducted on datasets with a small degree of similarity with respect to ImageNet, the features based on convolutional neural network weights trained on ImageNet perform better than the conventional feature extraction methods for the following tasks:

Object recognition: This CNN feature extractor can successfully perform classification tasks on other datasets with unseen classes.
Domain adaptation: This is when the training and testing data are from different distributions, while the labels and number of classes are the same. Different domains can consider images captured with different devices or in different settings and environment conditions. A linear classifier with CNN features successfully clusters images with the same semantic information across different domains, while SURF features overfit to domain-specific characteristics.
Fine–grained classification: This is when we want to classify between the subcategories within the same high-level class. For example, we can categorize between bird species. CNN features, along with logistic regression, although not trained on fine-grained data, perform better than the baseline approaches.
Scene recognition: Here, we need to classify the scene of the entire image. A CNN feature extractor trained on object classification databases with a simple linear classifier on top, outperforms complex learning algorithms applied on traditional feature extractors on recognition data.

Some of the tasks mentioned here are not directly related to image classification, which was the primary goal while training on ImageNet and therefore someone would expect that the CNN features would fail to generalize to unseen scenarios. However, those features, combined with a simple linear classifier, outperform the hand-crafted features. This means that the learned weights of a CNN are reusable.

So when should we use transfer learning? When we have a task where the available dataset is small due to the nature of the problem (such as classify ants/bees). In this case, we can train our model on a larger dataset that contains similar semantic information and subsequently, retrain the last layer only (linear classifier) with the small dataset.

If we have just enough data available, and there is a larger similar dataset to ours, pretraining on this similar dataset may result in a more robust model. As we normally train models with the weights randomly initialized, in this case, they will be initialized with the weights trained on this other dataset. This will facilitate the network to converge faster and generalise better. In this scenario, it would make sense to only fine-tune a few layers at the top end of the model.

How? An overview

There are two typical ways to go about this.

The first and more common way, is to use pre-trained model, a model that has previously been trained on a large scale dataset. Those models are readily available across different deep learning frameworks and are often referred to as “model zoos”.

The pre-trained model is largely dependent on what the current task to be solved is, and the size of the datasets. After the choice of model, we can use all of it or parts of it, as the initialized model for the actual task that we want to solve.

The other, less common way is to pretrain the model ourselves. This typically occurs when the available pretrained networks are not suitable to solve specific problems, and we have to design the network architecture ourselves.

Obviously, this requires more time and effort to design the model and prepare the dataset. In some cases, the dataset to pre-train the network on can even be synthetic, generated from computer graphic engines such as 3D studio Max or Unity, or other convolutional neural networks, such as GANs. The model pre-trained on virtual data can be fine-tuned on real data, and it can work equally well with a model trained solely on real data.

If we want to discriminate between cats and dogs, and we do not have enough data, we can download a network trained on ImageNet from the “model zoo”and use the weights from all but the last of its layers.

The last layer has to be adjusted to have the same size as the number of classes and the weights to be reinitialized and trained.

So it means we can freeze the layers that are not to be trained by setting the learning rate for these layers to zero, or to a very small number. In case a bigger dataset is available, we can train the last three fully connected layers. Sometimes, pre-trained network can be used only to initialize the weights and then be trained normally.

Transfer learning works because the features computed at the initial layers are more general and look similar. The features extracted in the top layers become more specific to the problem that we want to solve.

How? Code example

In this section you will learn the practical skills needed to perform transfer learning in TensorFlow. More specifically, we’ll learn how to select layers to be loaded from a checkpoint and also how to instruct your solver to optimize only specific layers while freezing the others.

TensorFlow useful elements

Transfer learning is about training a network initialized with weights taken from another trained model, we will need to find one. In our example, we will use the encoding part of a pretrained convolutional autoencoder. The advantage of using an autoencoder is that we do not need labelled data. It can be trained completely unsupervised.

An autoencoder without the decoder

An encoder (autoencoder without the decoder part) that consists of two convolutional layers and one fully connected layer is presented as follows. The parent autoencoder was trained on the MNIST dataset. Therefore, the network takes as input an image of size 28x28x1 and at latent space, encodes it to a 10-dimensional vector, one dimension for each class:

# Only half of the autoencoder changed for classification
class CAE_CNN_Encoder(object):
    ......
    def build_graph(self, img_size=28):
        self.__x = tf.placeholder(tf.float32, shape=[None, img_size * img_size], name='IMAGE_IN')
        self.__x_image = tf.reshape(self.__x, [-1, img_size, img_size, 1])
        self.__y_ = tf.placeholder("float", shape=[None, 10], name='Y')

        with tf.name_scope('ENCODER'):
            ##### ENCODER
            # CONV1: Input 28x28x1 after CONV 5x5 P:2 S:2 H_out: 1 + (28+4-5)/2 = 14, 
            # W_out= 1 + (28+4-5)/2 = 14
            self.__conv1_act = tf.layers.conv2d(inputs=self.__x_image, strides=(2, 2), name='conv1',
                              filters=16, kernel_size=[5, 5], padding="same", activation=tf.nn.relu)

            # CONV2: Input 14x14x16 after CONV 5x5 P:0 S:2 H_out: 1 + (14+4-5)/2 = 7,
            # W_out= 1 + (14+4-5)/2 = 7
            self.__conv2_act = tf.layers.conv2d(inputs=self.__conv1_act, strides=(2, 2),      
                name='conv2', filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu)

        with tf.name_scope('LATENT'):
            # Reshape: Input 7x7x32 after [7x7x32]
            self.__enc_out = tf.layers.flatten(self.__conv2_act, name='flatten_conv2')
            self.__dense = tf.layers.dense(inputs=self.__enc_out, units=200, activation=tf.nn.relu,                                                                                                                                                                                               name='fc1')
            self.__logits = tf.layers.dense(inputs=self.__dense, units=10, name='logits')

    def __init__(self, img_size=28):
        if CAE_CNN_Encoder.__instance is None:
            self.build_graph(img_size)

    @property
    def output(self):
        return self.__logits

    @property
    def labels(self):
        return self.__y_

    @property
    def input(self):
        return self.__x

    @property
    def image_in(self):
        return self.__x_image

Selecting layers

Once the model is defined, model=CAE_CNN_Encoder(), it is important to select layers that will be initialized with pretrained weights. Pay attention that the structure of both networks, must be the same. So, for example, the following snippet of code will select all layers with name convs of fc:

frommodelsimportCAE_CNN_Encodermodel=CAE_CNN_Encoder()

list_convs = [v for v in tf.global_variables() if "conv" in v.name]
list_fc_linear = [v for v in tf.global_variables() if "fc" in v.name or "output" in v.name]

Note that those lists are populated from tf.global_variables(); if we choose to print its content, we might observe that it holds all the model variables as shown:

[<tf.Variable 'conv1/kernel:0' shape=(5, 5, 1, 16) dtype=float32_ref>,
 <tf.Variable 'conv1/bias:0' shape=(16,) dtype=float32_ref>,
 <tf.Variable 'conv2/kernel:0' shape=(5, 5, 16, 32) dtype=float32_ref>,
 <tf.Variable 'conv2/bias:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'fc1/kernel:0' shape=(1568, 200) dtype=float32_ref>,
 <tf.Variable 'fc1/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'logits/kernel:0' shape=(200, 10) dtype=float32_ref>,
 <tf.Variable 'logits/bias:0' shape=(10,) dtype=float32_ref>]

Once the layers of the defined graph are grouped into two lists, convolutional and fully connected, you will use tf.Train.Saver to load the weights that you prefer. First, we need to create a saver object, giving as input the list of variables that we want to load from a checkpoint as follows:

# Define the saver object to load only the conv variables
 saver_load_autoencoder = tf.train.Saver(var_list=list_convs)

In addition to saver_load_autoencoder we need to create another saver object that will allow us to store all the variables of the network to be trained into checkpoints.\

# Define saver object to save all the variables during trainingsaver=tf.train.Saver()

Then, after the graph is initialized with init=tf.global_variables_initializer() and a session is created, we can use saver_load_autoencoder to restore the convolutional layers from a checkpoint as follows:

# Restore only the weights (From AutoEncoder)
 saver_load_autoencoder.restore(sess, "../tmp/cae_cnn/model.ckpt-34")

Note that calling restore overrides the global_variables_initializer an all the selected weights are replaced by the ones from the checkpoint.

Training only some layers

Another important part of transfer learning is freezing the weights of the layers that we don’t want to train, while allowing some layers (typically the final ones).

In TensorFlow, we can pass to our solver only the layers that we want to optimize (in this example, only the FC layers):

train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=list_fc_linear)

Complete source

In this example, we will load the weights from a MNIST convolutional autoencoder example. We will restore the weights of the encoder part only and freeze the CONV layers. That train the FC layers to perform digits classification:

import tensorflow as tf 
import numpy as np 
import os 
from models import CAE_CNN_Encoder
SAVE_FOLDER='/tmp/cae_cnn_transfer' 
from tensorflow.examples.tutorials.mnist import input_data 
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)  
model = CAE_CNN_Encoder(latent_size = 20) 
model_in = model.input 
model_out = model.output 
labels_in = model.labels 

# Get all convs weightslist_convs=[vforvintf.global_variables()if"conv"inv.name]
# Get fc1 and logitslist_fc_layers=[vforvintf.global_variables()if"fc"inv.nameor"logits"inv.name]

# Define the saver object to load only the conv variablessaver_load_autoencoder=tf.train.Saver(var_list=list_convs)
# Define saver object to save all the variables during trainingsaver=tf.train.Saver()

# Define loss for classification
withtf.name_scope("LOSS"):loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=model_out,labels=labels_in))correct_prediction=tf.equal(tf.argmax(model_out,1),tf.argmax(labels_in,1))accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))# Solver configurationwithtf.name_scope("Solver"):train_step=tf.train.AdamOptimizer(1e-4).minimize(loss,var_list=list_fc_layers)

# Initialize variablesinit=tf.global_variables_initializer()# Avoid allocating the whole memorygpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.200)sess=tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))sess.run(init)

# Restore only the CONV weights (From AutoEncoder)saver_load_autoencoder.restore(sess,"/tmp/cae_cnn/model.ckpt-34")

# Add some tensors to observe on tensorboad
tf.summary.image("input_image",model.image_in,4)tf.summary.scalar("loss",loss)merged_summary=tf.summary.merge_all()writer=tf.summary.FileWriter(SAVE_FOLDER)writer.add_graph(sess.graph)

#####Train######
num_epoch=200batch_size=10forepochinrange(num_epoch):foriinrange(int(mnist.train.num_examples/batch_size)):# Get batch of 50 imagesbatch=mnist.train.next_batch(batch_size)# Dump summaryifi%5000==0:# Other summariess=sess.run(merged_summary,feed_dict={model_in:batch[0],labels_in:batch[1]})writer.add_summary(s,i)# Train actually here (Also get loss value)            _,val_loss,t_acc=sess.run((train_step,loss,accuracy),feed_dict={model_in:batch[0],labels_in:batch[1]})print('Epoch: %d/%d loss:%d'%(epoch,num_epoch,val_loss))print('Save model:',epoch)saver.save(sess,os.path.join(SAVE_FOLDER,"model.ckpt"),epoch)

If you enjoyed reading this article and want to learn more about convolutional neural networks. You can explore Hands-On Convolutional Neural Networks with TensorFlow. With an emphatic focus on practical implementation and real-world problems.

Hands-On Convolutional Neural Networks with TensorFlow is a must-read for software engineers and data scientists who want to use CNNs to solve problems.

LIFERAY UI