Enhance Your Machine Learning Models With Transfer Learning

Enhance Your Machine Learning Models With Transfer Learning

Twitter
Facebook
LinkedIn
Telegram
Email

When you’re up against a completely novel task, collecting a hefty amount of data can feel like trying to scale a steep mountain. 

And when you’re working with only a sliver of data, getting your model performance to a satisfactory level—let’s talk about accuracy here—is no walk in the park either. 

It might even feel like you’re trying to squeeze water from a stone.

But don’t despair—there’s a solution that directly targets this conundrum, and it’s known as Transfer Learning.

It might even sound too fantastic to believe because the concept is genuinely straightforward: With only a small data pool for training, you can still hone your model to achieve high performance.

Intrigued? Excellent.

Stay with me because we’re about to dive deeply into how it all fits together.

So, what exactly is Transfer Learning?

In essence, transfer learning is a technique in machine learning where we repurpose a pre-trained model as the springboard for a new task.

Here’s a plain-English explanation—when you train a model on a specific task, you can recycle that model for a second, related task.

This approach is an excellent shortcut, facilitating quick progress when modelling the second task.

Applying transfer learning to a new assignment often results in significantly superior performance than training using only a small dataset.

Transfer learning has become such a staple that it’s almost a rarity to train a model from the ground up for image or natural language processing-related tasks. Instead, data scientists and researchers typically prefer to begin with a pre-trained model that’s already a whizz at classifying objects and has picked up general features like shapes and edges in images.

Models like ImageNet, AlexNet, and Inception are standard go-to’s that form the foundation of transfer learning.

So, how does Transfer Learning compare to Traditional Machine Learning?

The advent of transfer learning was a significant boon introduced by deep learning experts to circumvent the constraints of traditional machine learning models.

Let’s unpack the distinctions between these two learning methodologies.

  • Traditional machine learning models necessitate training from the ground up, a process that’s both computationally heavy and demands a hefty volume of data to reach optimal performance.

    On the other hand, transfer learning is far more computationally efficient and still delivers superior results, even when working with a smaller dataset.

  • Traditional ML adopts an insular training approach, wherein each model is trained independently for a specific purpose without relying on previous knowledge.

    In stark contrast, transfer learning capitalizes on the wisdom gleaned from the pre-trained model to continue the task. 

    If you tried to use the pre-trained ImageNet model with biomedical images, you’d quickly find yourself at an impasse. Why?

    Because ImageNet does not encompass images from the biomedical sphere.

  • Compared to their traditional ML counterparts, transfer learning models accelerate optimally.

    These models utilize previously trained models’ insights (like features, weights, and more).

    They already have a handle on the features, which makes the process much quicker than training neural networks from scratch.

Classic Strategies of Transfer Learning

Transfer learning strategies and methods are diverse and can be applied based on the nature of the application, the task at hand, and the data available. So, before you plunge into the strategy of transfer learning, it’s important to consider these questions:

  • What knowledge can be shifted from the source to the target to boost the performance of the target task?

  • When is the right time to transfer, and when should it be avoided, to ensure we enhance the performance/results of the target task and not undermine them?

  • How should we transfer the knowledge gleaned from the source model based on our current domain/task?

Traditional transfer learning strategies are usually grouped into three categories, depending on the task domain and labelled/unlabeled data.

Let’s delve deeper into these.

Inductive Transfer Learning

In inductive transfer learning, the source and target domains are identical, though the specific tasks the model works on vary.

The algorithms aim to take the knowledge from the source model and apply it to improve the target task. The pre-trained model is already an expert on the features of the domain, so it’s in a better starting position than if we were to train it from the ground up.

Inductive transfer learning can be further classified into two subcategories based on whether or not the source domain contains labelled data. These are multi-task learning and self-taught learning.

Transductive Transfer Learning

Transductive transfer learning is the go-to strategy for scenarios where the domains of the source and target tasks aren’t identical but are related.

It’s possible to identify similarities between the source and target tasks.

These scenarios typically boast a wealth of labelled data in the source domain, while the target domain has only unlabeled data.

Unsupervised Transfer Learning

Unsupervised transfer learning mirrors inductive transfer learning, with one key distinction: the algorithms focus on unsupervised tasks and involve unlabeled datasets in both the source and target tasks.

Common Approaches to Transfer Learning

Now, let’s look at another way of categorizing transfer learning strategies based on domain similarity and independent of the type of data samples present for training.

Homogeneous Transfer Learning

Homogeneous transfer learning strategies are designed to manage situations where the domains exist in the same feature space.

In homogeneous transfer learning, domains differ only slightly in marginal distributions.

These strategies adapt the domains by correcting the sample selection bias or covariate shift.

Instance transfer

This covers a simple scenario where there’s a lot of labelled data in the source domain and only a limited amount in the target domain.

The domains and feature spaces only differ in marginal distributions.

For instance, let’s say we need to build a model to diagnose cancer in a specific region where the elderly are the majority.

Limited target-domain examples are given, and relevant data are available from another area where young people are the majority. Simply transferring all the data from another area may be unsuccessful since there’s a difference in marginal distribution, and the elderly have a higher risk of cancer than young people.

In this scenario, consider adapting the marginal distributions. Instance-based transfer learning reassigns weights to the source domain instances in the loss function.

Parameter transfer

Parameter-based transfer learning strategies transfer knowledge at the model/parameter level.

This involves transferring knowledge through the shared parameters of the source and target domain learning models.

One way to transfer the learned knowledge can be by creating multiple source learner models and optimally combining the re-weighted learners—much like ensemble learners—to form an improved target learner.

Feature-representation transfer

Feature-based methods transform the original features to create a new feature representation.

This method can be divided into two subcategories: asymmetric and symmetric feature-based transfer learning.

Asymmetric approaches transform the source features to match the target ones. In other words, we take the features from the source domain and fit them into the target feature space. Some information loss may occur in this process due to the marginal difference in the feature distribution. 

Symmetric approaches find a common latent feature space and then transform both the source and the target features into this new feature representation.

Relational-knowledge transfer

Relational-based transfer learning approaches primarily focus on learning the relationships between the source and target domains and using this knowledge to derive past knowledge and apply it in the current context.

Such methods transfer the logical relationship or rules learned in the source domain to the target domain.

For example, learning the relationship between different elements of speech in a male voice can significantly aid in analyzing the sentence in another voice.

Heterogeneous Transfer Learning

Transfer learning involves using representations derived from a previous network to extract meaningful features from new samples for an interrelated task.

However, these methods often overlook the difference in the feature spaces between the source and target domains.

Collecting labelled source domain data with the same feature space as the target domain is often challenging, and heterogeneous transfer learning methods have been developed to overcome these limitations.

This technique addresses the issue of source and target domains having differing feature spaces, differing data distributions, differing label spaces, and other concerns.

Heterogeneous transfer learning is applied in cross-domain tasks such as cross-language text categorization, text-to-image classification, and many others.

Delving into Transfer Learning for Deep Learning

In artificial intelligence, areas like image recognition and natural language processing are particularly ripe for implementing transfer learning.

This approach has driven many models to achieve state-of-the-art performance.

These pre-established, pre-trained neural networks constitute the foundation of transfer learning in deep learning, and we often refer to this as deep transfer learning.

Leveraging Pre-trained Models as Feature Extractors

To grasp how deep learning models function, we must understand their composition. 

They are built on layered architectures that identify different features at each layer.

Initial layers identify high-level features, gradually becoming more detailed as we delve deeper into the network. 

These layers are eventually connected to a final layer (typically a fully connected layer in supervised learning) to produce the final output. This approach allows us to utilize well-established pre-trained networks (like the Oxford VGG Model, Google Inception Model, Microsoft ResNet Model) as feature extractors for various tasks, minus their final layer.

The primary notion here is to utilize the pre-trained model’s weighted layers for feature extraction without updating the model’s weights during the training phase with new data for a new task. 

The pre-trained models, having been trained on a large, generic dataset, serve as a universal model of the visual world.

Refining Pre-trained Models

This technique involves more interaction. Rather than relying solely on the features extracted from pre-trained models and replacing the final layer, some preceding layers are also retrained selectively.

Deep neural networks are layered structures with numerous adjustable hyperparameters. Initial layers capture generic features, while subsequent ones focus more on the specific task.

It makes sense to adjust the higher-order feature representations in the base model to make them more relevant for the particular task by retraining some layers of the model while keeping others frozen during training.

To Freeze or Fine-tune?

A logical next step to further enhance the model’s performance is to fine-tune the weights of the top layers of the pre-trained model in conjunction with the training of the added classifier. 

This adjustment forces the weights to deviate from the generic feature maps learned from the source task.

Fine-tuning enables the model to apply past knowledge to the target domain and re-learn some aspects.

However, fine-tuning a small number of top layers rather than the entire model is often more beneficial.

The first few layers learn basic and generic features that can be generalized to almost all data types. 

As a result, it’s wise to freeze these layers and reuse the foundational knowledge derived from past training.

As we move up, the features become more specific to the dataset on which the model was trained.

Fine-tuning seeks to adapt these specialized features to work with the new dataset rather than overwrite the generic learning.

Transfer Learning in Six Steps

Lastly, let’s outline the process of how transfer learning works in practice:

  1. Choose Pre-trained Model: The first step is to choose the pre-trained model on which we want to base our training, depending on the task. Transfer learning requires a strong correlation between the knowledge of the pre-trained source model and the target task domain for them to be compatible.

  2. Develop a Base Model: The base model is an architecture like ResNet or Xception, which we have chosen in the first step because it closely aligns with our task. We can either download the network weights to save time on additional training or use the network architecture to train our model from scratch.

  3. Freeze Layers: Freezing the initial layers from the pre-trained model is crucial. Doing this ensures that the base layers from the pre-trained model don’t get retrained when training our new model.

  4. Change Structure: The new model’s architecture must align with our requirements for the new task. Often, it’s enough to replace the pre-trained model’s final layer (classification layer) with a new one that suits our task.

  5. Train: Now, the network is trained with the new data, and only the added layers’ weights get updated, while the frozen layers’ weights remain unchanged.

  6. Unfreeze and Fine-tune: We can unfreeze a few layers from the top of the pre-trained model after the new model has learned some initial weights. We can tweak the higher-level feature representations by fine-tuning these layers to suit our task better.

The effectiveness of transfer learning lies in its inherent power of using pre-existing knowledge in the face of a new situation. However, there are several challenges to consider, which I can explain if you’re interested.

First, let’s start with the necessary imports:

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.applications import VGG16
from tensorflow.keras.datasets import cifar10
import matplotlib.pyplot as plt

Next, we load the CIFAR-10 dataset and preprocess the images.

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Preprocess the data
x_train = tf.keras.applications.vgg16.preprocess_input(x_train)
x_test = tf.keras.applications.vgg16.preprocess_input(x_test)
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

The next step is to create the base model. We will use the VGG16 model, remove the top (or final) layers, and specify that the input shape is (32,32,3). We will freeze the layers in the base model.

# Create base model
base_model = VGG16(weights="imagenet", include_top=False, input_shape=(32,32,3))

# Freeze base model
base_model.trainable = False

Now, we will add new trainable layers on top of the base model.

# Create new model on top of the base model
model = keras.models.Sequential()
model.add(base_model)
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))  # We have 10 classes in the dataset

Next, we will compile and train the model.

# Compile the model
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=[keras.metrics.CategoricalAccuracy()])

# Train the model
history = model.fit(x_train, y_train, epochs=5, batch_size=32, validation_data=(x_test, y_test))

At this point, we have a working model that was trained by adding the new layer. To further improve the model, we will unfreeze some of the layers in the base model and fine-tune them.

# Unfreeze the base model
base_model.trainable = True

# It's important to recompile your model after you make any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are taken into account
model.compile(optimizer=keras.optimizers.Adam(1e-5),  # Very low learning rate
              loss=keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=[keras.metrics.CategoricalAccuracy()])

# Train end-to-end. Be careful to stop before you overfit!
history_fine_tuning = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

Now we can visualize the weight of the first convolutional layer before and after fine-tuning to understand how much it changes.

# Get the weights of the first convolutional layer of the base model
weights_before_fine_tuning = base_model.layers[0].get_weights()[0]

# Continue training with fine-tuning
history_fine_tuning = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

# Get the weights of the first convolutional layer of the base model after fine-tuning
weights_after_fine_tuning = base_model.layers[0].get_weights()[0]

# Visualize the weights
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.title("Weights before fine-tuning")
plt.imshow(weights_before_fine_tuning[:, :, :, 0], cmap='gray')

plt.subplot(1, 2, 2)
plt.title("Weights after fine-tuning")
plt.imshow(weights_after_fine_tuning[:, :, :, 0], cmap='gray')
plt.show()

Challenges and potential issues

  1. Negative Transfer: This occurs when the pre-trained model’s knowledge hurts the performance on the target task instead of helping it. This could happen if the source and target tasks are not related or if the features in the source task obstruct learning in the target task. Identifying such scenarios and knowing when to use transfer learning is essential.

  2. Task Relevance: It’s crucial to ascertain the relevance of the source task to the target task. The knowledge transfer may not yield positive results if the tasks are unrelated. The features that a model learns from one task may not necessarily be helpful for a different task. Therefore, choosing an appropriate pre-trained model is critical.

  3. Domain Shift: If the source and target domains are different, it can lead to a phenomenon called “domain shift”, where the model fails to generalize well to the target task. This can be mitigated using techniques like domain adaptation, where the distribution shift between the source and target domains is explicitly modelled.

  4. Fine-tuning Difficulties: Deciding which layers to fine-tune and how much to fine-tune is often more of an art than a science and requires a lot of trial and error. It also adds computational overhead, and over-fine-tuning can lead to overfitting on the target task.

  5. Dataset Bias: The pre-trained models are usually trained on large, diverse datasets. These datasets can be propagated to the target task if they have inherent biases or errors. Therefore, understanding the biases in the training data of the pre-trained model is important.

Enhance Your Machine Learning Models With Transfer Learning
Picture of Marva

Marva

I share my insights and experiences on how to be a thriving software developer while still leading a fulfilling life.

Leave an address

I will email you sometimes (when I feel I have something useful to say) with the best and most useful information!

LATEST POSTS 🐱‍👓