Box Plot: A Complete Guide + Best Practices

Box Plot: A Complete Guide + Best Practices

Twitter
Facebook
LinkedIn
Telegram
Email

The Box Plot (Box and Whisker Plot) is an incredible way to visualize one-dimensional statistical data, including quartiles, median, minimum and maximum values, and outliers.

It’s super compact and easy to understand!

Plus, you don’t have to make assumptions about your data’s distribution, making it an excellent tool for nonparametric statistics.

The plot was first introduced by John Tukey back in 1969, which is why it’s sometimes called the Tukey diagram. While it might not be as informative as a histogram, it’s much more straightforward and takes up less space.

Basically, the plot consists of a rectangle (the “box”) with lines extending out from the sides (the “whiskers”).

If you’re looking at one dataset, the box is usually horizontal. But if you want to compare multiple datasets, you can show them vertically next to each other.

How to read Box Plot? 🤔

So, Box Plot is a way of visualizing numerical data that statisticians came up with to represent all the necessary information about distribution in a simple picture.

Here’s what a boxplot shows:

The median is the value in the middle of a ranked series.

For example, if you rank all the octopuses by their ratings, the median rating would be the one in the middle. That means half the octopuses on the right rated the probability of buying lower than the median, and the other half (on the left) rated it higher.

The median is less affected by outliers, so it’s displayed in the centre rather than the mean.

The upper quartile is the rating above which only 25% of ratings fall.

The lower quartile is the value below which only 25% of ratings fall.

The interquartile range (IQR) differs between the 75th and 25th percentiles. 50% of observations lie within this range.

If the range is narrow (like with octopuses), the subgroup members are unanimous in their ratings. If it’s broad, it means there’s no consensus (like with chicks).

Outliers are atypical observations. What exactly counts as atypical depends on context, but you can use the following calculations:

Outliers are values beyond:

  • 25th percentile minus 1.5 times IQR
  • 75th percentile plus 1.5 times IQR

The significance level has nothing to do with the box, but it’s often helpful to show the results of statistical tests and boxplots together.

The p-value helps you understand whether ratings differ (between octopuses and chicks) are fundamental or just random variations due to using a sample of observations and not surveying all the octopuses and chicks.

In short: if the p-value is less than 0.05, the differences between subgroups are NOT random (i.e. the differences between subgroups are statistically significant).

How to build a Box Plot? 🏗️

🖋️ Note:

  • Order the data set from smallest to largest.
  • Calculate the first quartile (Q1), which is the median of the lower half of the data.
  • Calculate the third quartile (Q3), which is the median of the upper half of the data.
  • Calculate the interquartile range (IQR), which is the difference between Q3 and Q1.
  • Draw a box from Q1 to Q3, with a line inside representing the median.
  • Draw whiskers from the box to the minimum and maximum values that are within 1.5 times IQR from the box.
  • Mark outliers outside the whiskers with individual points or symbols.

Let’s look at an example.

Suppose we have the following data set: 12, 15, 18, 20, 22, 25, 28, 30, 35.

  1. Order the data set from smallest to largest: 12, 15, 18, 20, 22, 25, 28, 30, 35.
  2. Calculate the first quartile (Q1), which is the median of the lower half of the data: Q1 = median of {12, 15, 18, 20} = (15 + 18)/2 = 16.5.
  3. Calculate the third quartile (Q3), which is the median of the upper half of the data: Q3 = median of {25, 28, 30, 35} = (28 + 30)/2 = 29.
  4. Calculate the interquartile range (IQR), which is the difference between Q3 and Q1: IQR = Q3 – Q1 = 29 – 16.5 = 12.5.
  5. Draw a box from Q1 to Q3, with a line inside representing the median: draw a vertical line inside the box to represent the median (the median of the entire data set) at 22, and draw a box from 16.5 to 29.
  6. Draw whiskers from the box to the minimum and maximum values within 1.5 times IQR from the box: the minimum value within 1.5 times IQR from Q1 is 12, so draw a line from the bottom of the box to 12. The maximum value within 1.5 times IQR from Q3 is 35, so draw a line from the top of the box to 35.
  7. Mark outliers outside the whiskers with individual points or symbols: there are no outliers in this data set.

So, the final box plot looks like a box with whiskers stretching from 12 to 35, with a median line at 22 inside the box.

How to build Box Plot in Python? 🐍

To build a Box Plot using Python, we will use the matplotlib library.

Import the required libraries. We will use numpy and matplotlib.pyplot libraries for this example.

import numpy as np
import matplotlib.pyplot as plt

Create a sample data set. For this example, we will list 100 random numbers between 1 and 50.

data = np.random.normal(25, 10, 100)

Create a figure and axis object using subplots() function.

fig, ax = plt.subplots()

Create a Box Plot using boxplot() function.

ax.boxplot(data)

Add title and axis labels using title(), xlabel() and ylabel() functions.

ax.set_title('Box Plot Example')
ax.set_xlabel('Data Set')
ax.set_ylabel('Value')

Display the Box Plot using show() function.

plt.show()

Here is the complete code:

import numpy as np
import matplotlib.pyplot as plt

# Step 2: Create a sample data set
data = np.random.normal(25, 10, 100)

# Step 3: Create a figure and axis object
fig, ax = plt.subplots()

# Step 4: Create a Box Plot

ax.boxplot(data)

# Step 5: Add title and axis labels
ax.set_title('Box Plot Example')
ax.set_xlabel('Data Set')
ax.set_ylabel('Value')

# Step 6: Display the Box Plot
plt.show()

When to use Box Plot’s? 👍

Box plots are helpful in visualizing the distribution of a data set and identifying fundamental statistical values.

Here are some scenarios when it’s good to use a Box Plot:

✨ Outliers detection

Box plots are an effective way to identify outliers in a data set.

Outliers are data points significantly different from other data points in the same set.

The Box Plot shows outliers as individual points outside the whiskers, which makes them easy to identify.

✨ Skewed distribution

If a data set has a skewed distribution, a Box Plot can help to visualize the skewness.

The Box Plot will show the median, the interquartile range (IQR), and the extent of the data distribution.

✨ Comparing distributions

Box plots are helpful in comparing the distribution of one or more data sets.

Multiple Box Plots can be placed side by side for easy comparison.

✨ Identifying central tendency

Box plots help to identify the central tendency of a data set.

The Box Plot’s median line shows the data set’s central value.

✨ Variation and spread

Box plots help to identify the variation and spread of a data set.

The box size in the Box Plot represents the IQR, which is a measure of the variation of the data set.

✨ Symmetry

Box plots are helpful in identifying the symmetry of a data set.

If the Box Plot is symmetrical, it indicates that the data set is symmetric.

When not to use them?

While Box plots are a useful visualization tool in many scenarios, there are also some cases when they may not be the best choice.

Here are some scenarios when it’s not recommended to use a Box Plot:

✨ Small sample sizes

Box plots may not be suitable for small sample sizes as they may not accurately represent the data.

In such cases, other visualization tools, such as a histogram or a scatter plot, may be more appropriate.

✨ Non-numeric data

Box plots are designed to display numeric data, and, therefore may not be appropriate for displaying non-numeric data such as categorical data.

✨ Different scales

If the data sets being compared have different scales or units, Box plots may not be suitable as they may not be comparable.

✨ Multiple modes

Box plots may not be able to accurately represent data sets with multiple modes as they assume a single mode.

✨ Extreme values

If a data set has extreme values significantly different from other values in the set, Box plots may not be the best choice for visualization as they tend to focus on the median and quartiles and may not provide enough detail on extreme values.

✨ Detailed information

While Box plots are an excellent way to get an overview of a data set, they do not provide detailed information, such as individual data points or specific statistical measures.

Overall, Box plots are a valuable tool in many scenarios, but may not be suitable for all data sets.

It’s essential to consider the data’s nature and the analysis’s goals when deciding whether to use a Box Plot.

Best Practices ❤️

✨ Use consistent scales

When creating Box Plots, it’s crucial to use consistent scales on the x-axis to ensure they are comparable.

# Example of using consistent scales for multiple Box Plots

fig, ax = plt.subplots()
ax.boxplot([data1, data2, data3])
ax.set_xticklabels(['Data 1', 'Data 2', 'Data 3'])
ax.set_ylabel('Value')

plt.show()

✨ Label axes and titles

Always label the x and y-axes and add a title to the Box Plot to provide context for the viewer.

# Example of adding axis labels and a title to a Box Plot

fig, ax = plt.subplots()
ax.boxplot(data)
ax.set_title('Box Plot of Data')
ax.set_xlabel('Distribution')
ax.set_ylabel('Value')
plt.show()

✨ Avoid clutter

Avoid cluttering the Box Plot with unnecessary information, such as grid lines or extra data points that are not part of the Box Plot.

# Example of a cluttered Box Plot with grid lines

fig, ax = plt.subplots()
ax.boxplot(data)
ax.set_title('Box Plot of Data')
ax.set_xlabel('Distribution')
ax.set_ylabel('Value')
ax.grid(True)
plt.show()

✨ Use appropriate outliers

Outliers should only be included in the Box Plot if they are true outliers, not extreme values.

It’s important to understand the data’s nature and consider whether the outliers should be included.

# Example of including outliers in a Box Plot
fig, ax = plt.subplots()

ax.boxplot(data, showfliers=True)
ax.set_title('Box Plot of Data with Outliers')
ax.set_xlabel('Distribution')
ax.set_ylabel('Value')

plt.show()

✨ Compare Box Plots carefully

When comparing Box Plots, it’s important to consider the scales and the range of the data being compared.

# Example of comparing two Box Plots

fig, ax = plt.subplots()
ax.boxplot([data1, data2])
ax.set_xticklabels(['Data 1', 'Data 2'])
ax.set_ylabel('Value')

plt.show()

By following these best practices, you can create Box Plots that are clear, informative, and easy to interpret.

Advanced Techniques 🔥

✨ Customizing Box Plot aesthetics

In addition to changing the color and line style, there are many ways to customize the aesthetics of a Box Plot using Python libraries like Matplotlib or Seaborn.

For example, you can add shading or annotations to highlight specific features of the plot, adjust the width of the boxes, or change the position or size of the whiskers.

✨ Combining Box Plots with other visualizations

Box Plots are useful for displaying the distribution of a single variable, but they can also be combined with other types of visualizations to explore relationships between variables.

For example, you can use a Box Plot to compare the distribution of a variable across different groups, and then overlay a scatter plot or line plot to show how the relationship between the variables varies across the groups.

✨ Using Box Plots for multivariate data

Box Plots can be extended to display multivariate data by using multiple boxes side-by-side or nested within each other.

This allows you to explore the relationships between multiple variables at once, or to compare the distribution of a single variable across multiple subgroups defined by other variables.

✨ Handling skewed data

Box Plots are less effective at displaying skewed data, where the distribution is not symmetrical.

In these cases, you can transform the data or use alternative visualizations such as violin plots or density plots to better represent the distribution.

✨ Using Box Plots for outlier detection

Box Plots can also be used to identify outliers in a dataset. Any point outside the whiskers can be considered an outlier.

If the data has extreme outliers, it may be more appropriate to use a modified Box Plot or a different visualization.

Sign up and never miss the latest articles

Box Plot: A Complete Guide + Best Practices
Picture of Marva

Marva

I share my insights and experiences on how to be a thriving software developer while still leading a fulfilling life.

Leave an address

I will email you sometimes (when I feel I have something useful to say) with the best and most useful information!

LATEST POSTS 🐱‍👓