Activation Functions in Machine Learning

Remember that you can find the full code in the bottom section.

When you first start doing machine learning, it’s fascinating to see what all the pieces that make up a model are for.

In this article, let’s find out what the activation function is all about.

Join me in this investigation!

To truly understand why activation functions are so critical, let’s first look at a neural network without them.

We’ll use a simple example and implement it in PyTorch.

This way, we can examine the performance of a model with and without activation functions.

import torch
import torch.nn as nn

A Model Without Activation Function

Consider a simple linear regression model implemented in PyTorch.

It is a basic neural network.

class LinearModel(nn.Module):
    def __init__(self):
        super(LinearModel, self).__init__()
        self.linear1 = nn.Linear(2, 8)
        self.linear2 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.linear1(x)
        return self.linear2(x)

Using a linear transformation, the model will learn to predict the output based on the input.

However, this model is limited in its ability to learn complex relationships in the data.

In fact, the model can only approximate linear relationships between inputs and outputs.

Introducing a ReLU

In order to appreciate the difference in results, we need to add an activation function to the second model.

There are many of them, but our choice is the ReLU.

The ReLU function is defined as ReLU(x) = max(0, x).

When the input value x is positive, the function returns x. When the input value x is negative or zero, the function returns 0.

Let’s say we have a neuron with an input value x.

The ReLU activation function is applied to the output of the neuron’s weighted sum. If the output value is positive, the neuron is considered “activated”, and its output value is the same as the input value. If the output value is negative or zero, the neuron is considered “not activated”, and its output value is 0.

A Model With Activation Function

Now, let’s add the ReLU function to our model, and observe the changes:

class ReLUModel(nn.Module):
    def __init__(self):
        super(ReLUModel, self).__init__()
        self.linear1 = nn.Linear(2, 8)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        return self.linear2(x)

We added an activation function (ReLU) in this updated model between two linear layers.

As a result, the model can learn to represent more complex relationships between inputs and outputs.

This increased capacity to learn non-linear relationships is crucial for neural networks to excel in various tasks.

Are these exactly the same models?

This section explains the code above for beginners. You can skip it if you already understand it.

Both models inherit from the nn.Module class, which is the base class for all neural network modules in PyTorch. This inheritance provides the models with essential methods and attributes for creating, training and evaluating neural networks.

class LinearModel(nn.Module):
    ...

class ReLUModel(nn.Module):
    ...

The super(LinearModel, self).__init__() and super(ReLUModel, self).__init__() calls in the __init__ methods of both models are used to initialize the base class nn.Module. This is required to set up the internal state of the base class correctly, allowing the models to function as proper PyTorch neural network modules.

super(LinearModel, self).__init__()
super(ReLUModel, self).__init__()

Both models have the same architecture, containing two linear layers (nn.Linear). The first linear layer maps the input features (2-dimensional) to an 8-dimensional hidden space, and the second linear layer maps the hidden space back to a single output. The only difference between the models is the use of a ReLU activation function in ReLUModel.

self.linear1 = nn.Linear(2, 8)
self.linear2 = nn.Linear(8, 1)

In ReLUModel, the ReLU activation function is added between the two linear layers, introducing non-linearity to the model. This allows ReLUModel capturing of more complex patterns in the data.

self.relu = nn.ReLU()

The forward method defines the forward pass of each model, specifying how the input is transformed to produce the output. In both models, the input x is passed through the first linear layer self.linear1. However, in the ReLUModel, the output of the first linear layer is then passed through the ReLU activation function self.relu, introducing non-linearity. Finally, in both models, the output of the first layer (or the ReLU function in the case of ReLUModel) is passed through the second linear layer self.linear2 to produce the final output.

def forward(self, x):
    x = self.linear1(x)
    return self.linear2(x)

In ReLUModel:

def forward(self, x):
    x = self.linear1(x)
    x = self.relu(x)
    return self.linear2(x)

Anyway, both LinearModel and ReLUModel have the same architecture except for the use of the ReLU activation function in ReLUModel.

Now, let’s explore the concept of activation functions in more detail.

Prepare the data first

In order to sort out the learning differences between our models, we will need data.

Let’s import the necessary libraries and functions:

import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader, TensorDataset

Now, we will generate a synthetic dataset using the make_moons function, which creates two moon-shaped clusters that are not linearly separable.

This is what our data will look like. The task of the model: is to divide it up as best as possible.

This dataset is appropriate for demonstrating the benefits of using an activation function, as a linear model without an activation function will struggle to separate the clusters accurately.

X, y = make_moons(n_samples=10000, noise=0.2, random_state=42)

Of course, we split the dataset into training and testing sets using the train_test_split function, with 80% of the data for training and 20% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

And convert the training and testing data into PyTorch tensors, preparing them for use with PyTorch models.

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

Then, we must create data loaders for the training and testing data. The DataLoader class allows you to load and manage data in batches during training and evaluation.

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Let’s put the code together:

# Create synthetic dataset
X, y = make_moons(n_samples=10000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

# Create data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

The activation function in action: train models

We are going to train our models on prepared data. We will have to write some more code to do this.

Define the loss function as Binary Cross Entropy with Logits Loss (nn.BCEWithLogitsLoss).

This loss function is well-suited for binary classification problems like the one in this example.

It combines the sigmoid activation and binary cross-entropy loss into a single function for better numerical stability.

criterion = nn.BCEWithLogitsLoss()

If you don’t know what a loss function is or if these words confuse you, don´t worry. You may have a look in the glossary and that may clarify the question.

Just understand that it is a function needed to measure a prediction’s quality.

Next, we must define the optimizer as the Adam optimizer with a learning rate of 0.01.

The optimizer is responsible for updating the model’s parameters during training to minimize the loss function.

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

Now let’s describe how we want to train the model.

for epoch in range(epochs):
    model.train()

    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

Let’s discuss what’s going on in this piece of code.

First, we indicate that we go to train a model for a certain number of epochs.

For each epoch, iterate over the batches of training data provided by the train_loader.

For each batch, we are going to perform the following steps:

Zero the gradients of the optimizer using optimizer.zero_grad(). This step is crucial as gradients accumulate with each backward pass, and failing to zero them would result in incorrect parameter updates.
Pass the input data (batch_X) through the model to obtain the output predictions.
Compute the loss between the output predictions and the true labels (batch_y) using the criterion defined earlier.
Perform backpropagation by calling loss.backward(). This step computes the gradients of the loss with respect to the model’s parameters.
Update the model’s parameters using the optimizer by calling optimizer.step().

I wrote about Backpropagation here

Then, we have to evaluate the model on the test dataset. First, set the model to evaluation mode using model.eval().

This step is essential as it disables any dropout or batch normalization layers in the model that might affect the evaluation results.

Use the torch.no_grad() context to disable gradient computation, as it is not needed during evaluation and can save memory.

Compute the predicted labels (y_pred) by passing the test data (X_test_tensor) through the model, applying the sigmoid function to the output, and rounding the result.

Here, the sigmoid function is used to convert the raw model output (logits) into probabilities, which are then rounded to obtain binary predictions (0 or 1).

with torch.no_grad():
    y_pred = torch.round(torch.sigmoid(model(X_test_tensor)))
    accuracy = accuracy_score(y_test_tensor, y_pred)

Compute the accuracy of the model on the test dataset using accuracy_score, and return the result.

Next, instantiate both models (LinearModel and ReLUModel) and train them using the train_and_evaluate function.

linear_model = LinearModel()
linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader)

relu_model = ReLUModel()
relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader)

Finally, we print the accuracy of both models on the test dataset.

print("Linear Model Accuracy: {:.2f}".format(linear_accuracy))
print("ReLU Model Accuracy: {:.2f}".format(relu_accuracy))

And, of course, we are calling our function.

linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader)
relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader)

How does the activation function affect model prediction?

Let’s see how our models have coped with data separation. To do this, we visualise the data and the line that separates the data into two clusters.

Remember that you can find the code for this visualization in the bottom section.

Let’s first look at the ReLU activation function and its properties to understand how the decision boundary was formed.

The ReLU function is defined as:

f(x) = max(0, x)

As you remember, the ReLU function retains positive values while setting negative values to zero. This introduces non-linearity into the neural network.

In our example, we used a simple neural network with a single hidden layer containing eight neurons, followed by a ReLU activation function:

Input -> Linear(2, 8) -> ReLU -> Linear(8, 1) -> Output

The decision boundary is formed as a combination of the individual linear boundaries created by the neurons in the hidden layer. Each neuron in the hidden layer creates a linear decision boundary based on its weights and biases. After the ReLU activation function, these linear boundaries are combined to form a non-linear decision boundary.

Mathematically, this can be represented as:

z_i = w1_i * x1 + w2_i * x2 + b_i

Each neuron in the hidden layer calculates the weighted sum of the input features and adds a bias term. Here, z_i is the output of the ith neuron, w1_i and w2_i are the weights for the ith neuron, x1 and x2 are the input features, and b_i is the bias term for the ith neuron.

In the model with the activation function, the ReLU activation function is applied to the output z_i of each neuron:

a_i = ReLU(z_i) = max(0, z_i)

So, a_i is the activated output of the ith neuron.

For the model without an activation function, the output z_i of each neuron is directly passed to the next layer.

The individual linear boundaries created by the neurons in the model without an activation function are not the same, as each neuron has its own weights and biases.

However, the final decision boundary will still be linear, as an activation function introduces no non-linearity. The final decision boundary in the model without an activation function is a linear combination of the individual linear boundaries created by the neurons in the hidden layer.

Here’s a visualization of the individual linear boundaries for our ReLU model:

Remember that you can find the code for this visualization in the bottom section.

Remember that this is a simplified explanation of how the decision boundary is formed in a neural network with a single hidden layer and a ReLU activation function.

The exact process of forming the decision boundary will depend on the specific architecture and weights of the neural network, as well as the activation functions used.

Next time we will look at the different activation functions and discuss the pros and cons of using them. We will also talk about modern activation functions, which are used quite rarely, but sometimes help a lot.

Sign up and never miss the latest articles

Full activation function investigation code

import torch
import torch.nn as nn
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader, TensorDataset

class LinearModel(nn.Module):
    def __init__(self):
        super(LinearModel, self).__init__()
        self.linear1 = nn.Linear(2, 8)
        self.linear2 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.linear1(x)
        return self.linear2(x)

class ReLUModel(nn.Module):
    def __init__(self):
        super(ReLUModel, self).__init__()
        self.linear1 = nn.Linear(2, 8)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        return self.linear2(x)

# Create synthetic dataset
X, y = make_moons(n_samples=10000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

# Create data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Train and evaluate function
def train_and_evaluate(model, train_loader, test_loader, epochs=50):
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    for epoch in range(epochs):
        model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_X)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

    model.eval()
    with torch.no_grad():
        y_pred = torch.round(torch.sigmoid(model(X_test_tensor)))
        accuracy = accuracy_score(y_test_tensor, y_pred)
    return accuracy

# Train and evaluate models
linear_model = LinearModel()
linear_accuracy = train_and_evaluate(linear_model, train_loader, test_loader)

relu_model = ReLUModel()
relu_accuracy = train_and_evaluate(relu_model, train_loader, test_loader)

# Print results
print("Linear Model Accuracy: {:.2f}".format(linear_accuracy))
print("ReLU Model Accuracy: {:.2f}".format(relu_accuracy))

import numpy as np
import matplotlib.pyplot as plt

def plot_decision_boundary(model, X, y, title):
    h = 0.02  # step size in the mesh

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    mesh_input = np.c_[xx.ravel(), yy.ravel()]
    mesh_input_tensor = torch.tensor(mesh_input, dtype=torch.float32)

    with torch.no_grad():
        Z = torch.sigmoid(model(mesh_input_tensor)).numpy()
        Z = Z.reshape(xx.shape)

    plt.figure(figsize=(6, 6))
    plt.contour(xx, yy, Z, levels=[0.5], colors='k', linewidths=2)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k')
    plt.xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
    plt.ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
    plt.title(title)
    plt.show()

plot_decision_boundary(linear_model, X, y, 'Linear Model Decision Boundary')
plot_decision_boundary(relu_model, X, y, 'ReLU Model Decision Boundary')

def plot_individual_boundaries(model, X, y):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    mesh_input = np.c_[xx.ravel(), yy.ravel()]
    mesh_input_tensor = torch.tensor(mesh_input, dtype=torch.float32)

    with torch.no_grad():
        hidden_output = model.linear1(mesh_input_tensor)

    fig, axes = plt.subplots(2, 4, figsize=(14, 8))
    axes = axes.ravel()

    for i, ax in enumerate(axes):
        contour = hidden_output[:, i].numpy().reshape(xx.shape)
        ax.contour(xx, yy, contour, levels=[0], colors='k', linewidths=1)
        ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k')
        ax.set_xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
        ax.set_ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
        ax.set_title(f'Neuron {i + 1}')

    plt.tight_layout()
    plt.show()

plot_individual_boundaries(relu_model, X, y)

Home

Technology

Machine Learning

Activation Functions in Machine Learning

Marva

I share my insights and experiences on how to be a thriving software developer while still leading a fulfilling life.

Leave an address

I will email you sometimes (when I feel I have something useful to say) with the best and most useful information!

LATEST POSTS 🐱‍👓

Large Language Models

May 5, 2024 No Comments

Impostor Syndrome: Recognizing and Overcoming It

May 2, 2024 No Comments

Activation Functions in Machine Learning

A Model Without Activation Function

Introducing a ReLU

A Model With Activation Function

Are these exactly the same models?

Prepare the data first

The activation function in action: train models

How does the activation function affect model prediction?

Sign up and never miss the latest articles

Full activation function investigation code

Marva

Leave an address

LATEST POSTS 🐱‍👓

Data Science

Large Language Models

Impostor Syndrome: Recognizing and Overcoming It