When you first start doing machine learning, it’s fascinating to see what all the pieces that make up a model are for.
In this article, let’s find out what the activation function is all about.
Join me in this investigation!
To truly understand why activation functions are so critical, let’s first look at a neural network without them.
We’ll use a simple example and implement it in PyTorch.
This way, we can examine the performance of a model with and without activation functions.
import torch import torch.nn as nn
A Model Without Activation Function
Consider a simple linear regression model implemented in PyTorch.
It is a basic neural network.
class LinearModel(nn.Module): def __init__(self): super(LinearModel, self).__init__() self.linear1 = nn.Linear(2, 8) self.linear2 = nn.Linear(8, 1) def forward(self, x): x = self.linear1(x) return self.linear2(x)
Using a linear transformation, the model will learn to predict the output based on the input.
However, this model is limited in its ability to learn complex relationships in the data.
In fact, the model can only approximate linear relationships between inputs and outputs.
Introducing a ReLU
In order to appreciate the difference in results, we need to add an activation function to the second model.
There are many of them, but our choice is the ReLU.
The ReLU function is defined as ReLU(x) = max(0, x)
.
When the input value x
is positive, the function returns x
. When the input value x
is negative or zero, the function returns 0.
Let’s say we have a neuron with an input value x
.
The ReLU activation function is applied to the output of the neuron’s weighted sum. If the output value is positive, the neuron is considered “activated”, and its output value is the same as the input value. If the output value is negative or zero, the neuron is considered “not activated”, and its output value is 0.
A Model With Activation Function
Now, let’s add the ReLU function to our model, and observe the changes:
class ReLUModel(nn.Module): def __init__(self): super(ReLUModel, self).__init__() self.linear1 = nn.Linear(2, 8) self.relu = nn.ReLU() self.linear2 = nn.Linear(8, 1) def forward(self, x): x = self.linear1(x) x = self.relu(x) return self.linear2(x)
We added an activation function (ReLU) in this updated model between two linear layers.
As a result, the model can learn to represent more complex relationships between inputs and outputs.
This increased capacity to learn non-linear relationships is crucial for neural networks to excel in various tasks.
Are these exactly the same models?
This section explains the code above for beginners. You can skip it if you already understand it.
Both models inherit from the nn.Module
class, which is the base class for all neural network modules in PyTorch. This inheritance provides the models with essential methods and attributes for creating, training and evaluating neural networks.
class LinearModel(nn.Module): ... class ReLUModel(nn.Module): ...
The super(LinearModel, self).__init__()
and super(ReLUModel, self).__init__()
calls in the __init__
methods of both models are used to initialize the base class nn.Module
. This is required to set up the internal state of the base class correctly, allowing the models to function as proper PyTorch neural network modules.
super(LinearModel, self).__init__() super(ReLUModel, self).__init__()
Both models have the same architecture, containing two linear layers (nn.Linear
). The first linear layer maps the input features (2-dimensional) to an 8-dimensional hidden space, and the second linear layer maps the hidden space back to a single output. The only difference between the models is the use of a ReLU activation function in ReLUModel
.
self.linear1 = nn.Linear(2, 8) self.linear2 = nn.Linear(8, 1)
In ReLUModel
, the ReLU activation function is added between the two linear layers, introducing non-linearity to the model. This allows ReLUModel
capturing of more complex patterns in the data.
self.relu = nn.ReLU()
The forward
method defines the forward pass of each model, specifying how the input is transformed to produce the output. In both models, the input x
is passed through the first linear layer self.linear1
. However, in the ReLUModel
, the output of the first linear layer is then passed through the ReLU activation function self.relu
, introducing non-linearity. Finally, in both models, the output of the first layer (or the ReLU function in the case of ReLUModel
) is passed through the second linear layer self.linear2
to produce the final output.
def forward(self, x): x = self.linear1(x) return self.linear2(x)
In ReLUModel
:
def forward(self, x): x = self.linear1(x) x = self.relu(x) return self.linear2(x)
Anyway, both LinearModel
and ReLUModel
have the same architecture except for the use of the ReLU activation function in ReLUModel
.
Now, let’s explore the concept of activation functions in more detail.
Prepare the data first
In order to sort out the learning differences between our models, we will need data.
Let’s import the necessary libraries and functions:
import numpy as np from sklearn.datasets import make_moons from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from torch.utils.data import DataLoader, TensorDataset
Now, we will generate a synthetic dataset using the make_moons
function, which creates two moon-shaped clusters that are not linearly separable.
This dataset is appropriate for demonstrating the benefits of using an activation function, as a linear model without an activation function will struggle to separate the clusters accurately.
X, y = make_moons(n_samples=10000, noise=0.2, random_state=42)
Of course, we split the dataset into training and testing sets using the train_test_split
function, with 80% of the data for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
And convert the training and testing data into PyTorch tensors, preparing them for use with PyTorch models.
X_train_tensor = torch.tensor(X_train, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1) X_test_tensor = torch.tensor(X_test, dtype=torch.float32) y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
Then, we must create data loaders for the training and testing data. The DataLoader
class allows you to load and manage data in batches during training and evaluation.
train_dataset = TensorDataset(X_train_tensor, y_train_tensor) test_dataset = TensorDataset(X_test_tensor, y_test_tensor) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
Let’s put the code together:
# Create synthetic dataset X, y = make_moons(n_samples=10000, noise=0.2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Convert to PyTorch tensors X_train_tensor = torch.tensor(X_train, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1) X_test_tensor = torch.tensor(X_test, dtype=torch.float32) y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1) # Create data loaders train_dataset = TensorDataset(X_train_tensor, y_train_tensor) test_dataset = TensorDataset(X_test_tensor, y_test_tensor) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
The activation function in action: train models
We are going to train our models on prepared data. We will have to write some more code to do this.
Define the loss function as Binary Cross Entropy with Logits Loss (nn.BCEWithLogitsLoss
).
This loss function is well-suited for binary classification problems like the one in this example.
It combines the sigmoid activation and binary cross-entropy loss into a single function for better numerical stability.
criterion = nn.BCEWithLogitsLoss()
If you don’t know what a loss function is or if these words confuse you, don´t worry. You may have a look in the glossary and that may clarify the question.
Just understand that it is a function needed to measure a prediction’s quality.
Next, we must define the optimizer as the Adam optimizer with a learning rate of 0.01.
The optimizer is responsible for updating the model’s parameters during training to minimize the loss function.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
Now let’s describe how we want to train the model.
for epoch in range(epochs): model.train() for batch_X, batch_y in train_loader: optimizer.zero_grad() output = model(batch_X) loss = criterion(output, batch_y) loss.backward() optimizer.step()
Let’s discuss what’s going on in this piece of code.
First, we indicate that we go to train a model for a certain number of epochs.
For each epoch, iterate over the batches of training data provided by the train_loader
.
For each batch, we are going to perform the following steps:
- Zero the gradients of the optimizer using
optimizer.zero_grad()
. This step is crucial as gradients accumulate with each backward pass, and failing to zero them would result in incorrect parameter updates. - Pass the input data (
batch_X
) through the model to obtain the output predictions. - Compute the loss between the output predictions and the true labels (
batch_y
) using thecriterion
defined earlier. - Perform backpropagation by calling
loss.backward()
. This step computes the gradients of the loss with respect to the model’s parameters. - Update the model’s parameters using the optimizer by calling
optimizer.step()
.
Then, we have to evaluate the model on the test dataset. First, set the model to evaluation mode using model.eval()
.
This step is essential as it disables any dropout or batch normalization layers in the model that might affect the evaluation results.
Use the torch.no_grad()
context to disable gradient computation, as it is not needed during evaluation and can save memory.
Compute the predicted labels (y_pred
) by passing the test data (X_test_tensor
) through the model, applying the sigmoid function to the output, and rounding the result.
Here, the sigmoid function is used to convert the raw model output (logits) into probabilities, which are then rounded to obtain binary predictions (0 or 1).
with torch.no_grad(): y_pred = torch.round(torch.sigmoid(model(X_test_tensor))) accuracy = accuracy_score(y_test_tensor, y_pred)
Compute the accuracy of the model on the test dataset using accuracy_score
, and return the result.
Next, instantiate both models (LinearModel and ReLUModel) and train them using the train_and_evaluate
function.
linear_model = LinearModel() linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader) relu_model = ReLUModel() relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader)
Finally, we print the accuracy of both models on the test dataset.
print("Linear Model Accuracy: {:.2f}".format(linear_accuracy)) print("ReLU Model Accuracy: {:.2f}".format(relu_accuracy))
And, of course, we are calling our function.
linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader) relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader)
How does the activation function affect model prediction?
Let’s see how our models have coped with data separation. To do this, we visualise the data and the line that separates the data into two clusters.
Let’s first look at the ReLU activation function and its properties to understand how the decision boundary was formed.
The ReLU function is defined as:
f(x) = max(0, x)
As you remember, the ReLU function retains positive values while setting negative values to zero. This introduces non-linearity into the neural network.
In our example, we used a simple neural network with a single hidden layer containing eight neurons, followed by a ReLU activation function:
Input -> Linear(2, 8) -> ReLU -> Linear(8, 1) -> Output
The decision boundary is formed as a combination of the individual linear boundaries created by the neurons in the hidden layer. Each neuron in the hidden layer creates a linear decision boundary based on its weights and biases. After the ReLU activation function, these linear boundaries are combined to form a non-linear decision boundary.
Mathematically, this can be represented as:
z_i = w1_i * x1 + w2_i * x2 + b_i
Each neuron in the hidden layer calculates the weighted sum of the input features and adds a bias term. Here, z_i is the output of the ith neuron, w1_i and w2_i are the weights for the ith neuron, x1 and x2 are the input features, and b_i is the bias term for the ith neuron.
In the model with the activation function, the ReLU activation function is applied to the output z_i of each neuron:
a_i = ReLU(z_i) = max(0, z_i)
So, a_i is the activated output of the ith neuron.
For the model without an activation function, the output z_i of each neuron is directly passed to the next layer.
The individual linear boundaries created by the neurons in the model without an activation function are not the same, as each neuron has its own weights and biases.
However, the final decision boundary will still be linear, as an activation function introduces no non-linearity. The final decision boundary in the model without an activation function is a linear combination of the individual linear boundaries created by the neurons in the hidden layer.
Here’s a visualization of the individual linear boundaries for our ReLU model:
Remember that this is a simplified explanation of how the decision boundary is formed in a neural network with a single hidden layer and a ReLU activation function.
The exact process of forming the decision boundary will depend on the specific architecture and weights of the neural network, as well as the activation functions used.
Next time we will look at the different activation functions and discuss the pros and cons of using them. We will also talk about modern activation functions, which are used quite rarely, but sometimes help a lot.
Sign up and never miss the latest articles
Full activation function investigation code
import torch import torch.nn as nn import numpy as np from sklearn.datasets import make_moons from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from torch.utils.data import DataLoader, TensorDataset class LinearModel(nn.Module): def __init__(self): super(LinearModel, self).__init__() self.linear1 = nn.Linear(2, 8) self.linear2 = nn.Linear(8, 1) def forward(self, x): x = self.linear1(x) return self.linear2(x) class ReLUModel(nn.Module): def __init__(self): super(ReLUModel, self).__init__() self.linear1 = nn.Linear(2, 8) self.relu = nn.ReLU() self.linear2 = nn.Linear(8, 1) def forward(self, x): x = self.linear1(x) x = self.relu(x) return self.linear2(x) # Create synthetic dataset X, y = make_moons(n_samples=10000, noise=0.2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Convert to PyTorch tensors X_train_tensor = torch.tensor(X_train, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1) X_test_tensor = torch.tensor(X_test, dtype=torch.float32) y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1) # Create data loaders train_dataset = TensorDataset(X_train_tensor, y_train_tensor) test_dataset = TensorDataset(X_test_tensor, y_test_tensor) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False) # Train and evaluate function def train_and_evaluate(model, train_loader, test_loader, epochs=50): criterion = nn.BCEWithLogitsLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) for epoch in range(epochs): model.train() for batch_X, batch_y in train_loader: optimizer.zero_grad() output = model(batch_X) loss = criterion(output, batch_y) loss.backward() optimizer.step() model.eval() with torch.no_grad(): y_pred = torch.round(torch.sigmoid(model(X_test_tensor))) accuracy = accuracy_score(y_test_tensor, y_pred) return accuracy # Train and evaluate models linear_model = LinearModel() linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader) relu_model = ReLUModel() relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader) # Print results print("Linear Model Accuracy: {:.2f}".format(linear_accuracy)) print("ReLU Model Accuracy: {:.2f}".format(relu_accuracy)) import numpy as np import matplotlib.pyplot as plt def plot_decision_boundary(model, X, y, title): h = 0.02 # step size in the mesh x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) mesh_input = np.c_[xx.ravel(), yy.ravel()] mesh_input_tensor = torch.tensor(mesh_input, dtype=torch.float32) with torch.no_grad(): Z = torch.sigmoid(model(mesh_input_tensor)).numpy() Z = Z.reshape(xx.shape) plt.figure(figsize=(6, 6)) plt.contour(xx, yy, Z, levels=[0.5], colors='k', linewidths=2) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k') plt.xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5) plt.ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5) plt.title(title) plt.show() plot_decision_boundary(linear_model, X, y, 'Linear Model Decision Boundary') plot_decision_boundary(relu_model, X, y, 'ReLU Model Decision Boundary') def plot_individual_boundaries(model, X, y): h = 0.02 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) mesh_input = np.c_[xx.ravel(), yy.ravel()] mesh_input_tensor = torch.tensor(mesh_input, dtype=torch.float32) with torch.no_grad(): hidden_output = model.linear1(mesh_input_tensor) fig, axes = plt.subplots(2, 4, figsize=(14, 8)) axes = axes.ravel() for i, ax in enumerate(axes): contour = hidden_output[:, i].numpy().reshape(xx.shape) ax.contour(xx, yy, contour, levels=[0], colors='k', linewidths=1) ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k') ax.set_xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5) ax.set_ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5) ax.set_title(f'Neuron {i + 1}') plt.tight_layout() plt.show() plot_individual_boundaries(relu_model, X, y)