Aman's AI Journal • Primers • Fine-tuning Models

Background
Why do we fine-tune models?
When to fine-tune models?
Fine-tuning Guidelines
Citation

Background

In this article, we’ll delve into a comprehensive overview on the practice of fine-tuning, which is a common practice in Deep Learning.
We shall list out the rationale behind fine-tuning and the techniques involved.

Why do we fine-tune models?

When we are given a Deep Learning task, say, one that involves training a Convolutional Neural Network (CNN) on a dataset of images, our first instinct would be to train the network from scratch. However, in practice, deep neural networks like CNNs have a huge number of parameters, often in the range of millions. Training a CNN on a small dataset (one that is smaller than the number of parameters) greatly affects the CNN’s ability to generalize, often result in overfitting.
Therefore, more often in practice, one would fine-tune existing networks that are trained on a large dataset like the ImageNet (1.2M labeled images) by continue training it (i.e. running back-propagation) on the smaller dataset we have. Provided that our dataset is not drastically different in context to the original dataset (e.g. ImageNet), the pre-trained model will already have learned features that are relevant to our own classification problem.

When to fine-tune models?

In general, if our dataset is not drastically different in context from the dataset which the pre-trained model is trained on, we should go for fine-tuning. Pre-trained network on a large and diverse dataset like the ImageNet captures universal features like curves and edges in its early layers, that are relevant and useful to most of the classification problems.
Of course, if our dataset represents some very specific domain, say for example, medical images or Chinese handwritten characters, and that no pre-trained networks on such domain can be found, we should then consider training the network from scratch.
One other concern is that if our dataset is small, fine-tuning the pre-trained network on a small dataset might lead to overfitting, especially if the last few layers of the network are fully connected layers, as in the case for VGG network. Speaking from my experience, if we have a few thousand raw samples, with the common data augmentation strategies implemented (translation, rotation, flipping, etc), fine-tuning will usually get us a better result.
If our dataset is really small, say less than a thousand samples, a better approach is to take the output of the intermediate layer prior to the fully connected layers as features (bottleneck features) and train a linear classifier (e.g. SVM) on top of it. SVM is particularly good at drawing decision boundaries on a small dataset.

Fine-tuning Guidelines

Below are some general guidelines for fine-tuning implementation:

Truncate the last layer (softmax layer): The common practice is to truncate the last layer (softmax layer) of the pre-trained network and replace it with a new softmax layer that is relevant to our own problem. For example, pre-trained network on ImageNet comes with a softmax layer with 1000 categories.
- If our task is a classification on 10 categories, the new softmax layer of the network will be of 10 categories instead of 1000 categories. We then run back propagation on the network to fine-tune the pre-trained weights. Make sure cross validation is performed so that the network will be able to generalize well.
Use a smaller learning rate to train the network: Since we expect the pre-trained weights to be quite good already as compared to randomly initialized weights, we do not want to distort them too quickly and too much. A common practice is to make the initial learning rate 10 times smaller than the one used for scratch training.
Freeze the weights of the first few layers: It is also a common practice to freeze the weights of the first few layers of the pre-trained network. This is because the first few layers capture universal features like curves and edges that are also relevant to our new problem. We want to keep those weights intact. Instead, we will get the network to focus on learning dataset-specific features in the subsequent layers.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledFinetuningModels,
  title   = {Fine-tuning Models},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}