Understanding Random Forest Classification and Its Effectiveness

why random forests and ensemble methods are the underrated heroes of machine learning

Experiments
Machine Learning
Published

March 7, 2024

A Random Forest is a versatile and robust machine learning algorithm used for both classification and regression tasks. It builds upon the concept of decision trees, but improves on their accuracy and overcomes their tendency to overfit by combining the predictions of numerous decision trees constructed on different subsets of the data. We have already experimented with a Random Tree regressor, and in this experiment, we will focus on Random Forest classification.

What are Random Forest models ?

A Random Forest operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees. It is termed as “Random” because of its ability to develop trees based on random subsets of features and data points, which ensures model variance and generally results in a more robust overall prediction.

Random Forest have the following key characteristics:

  • Robustness: A Random Forest is less likely to overfit than decision trees, because they average multiple trees to give a more accurate prediction.
  • Handling of Unbalanced Data: It can handle unbalanced data from both binary and multiclass classification problems effectively.
  • Feature Importance: It provides insights into which features are most important for the prediction.
  • Explainability: A Random Forest provides good explainability, and isn’t a black box.

The mechanics of the algorithm

The Random Forest algorithm follows these steps:

  1. Bootstrap Aggregating (Bagging): Random subsets of the data are created for training each tree, sampled with replacement.
  2. Random Feature Selection: When splitting nodes during the formation of trees, only a random subset of features are considered.
  3. Building Trees: Each subset is used to train a decision tree. Trees grow to their maximum length and are not pruned.
  4. Aggregation: For classification tasks, the mode of all tree outputs is considered for the final output.

Random Forest typically outperform single decision trees due to their reduced variance without increasing bias. This means they are less likely to fit noise in the training data, making them significantly more accurate. They are also effective in scenarios where the feature space is large, and robust against overfitting which is a common issue in complex models.

Effectiveness

Since their inception, it has been shown that Random Forest is highly effective for a wide range of problems. It is particularly known for their effectiveness in:

  • Handling large data sets with higher dimensionality. They can handle thousands of input variables without variable deletion.
  • Maintaining accuracy even when a large proportion of the data is missing.

An example Random Forest classifier

Below is an example demonstrating the implementation of a Random Forest classifier using the scikit-learn library. This example uses the Breast Cancer dataset. Let us start by describing the data.

Show the code
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load data
breast_cancer = load_breast_cancer()

df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
df
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 0.07871 ... 25.380 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 0.05667 ... 24.990 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 0.05999 ... 23.570 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 0.09744 ... 14.910 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 0.05883 ... 22.540 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 0.05623 ... 25.450 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 0.05533 ... 23.690 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 0.05648 ... 18.980 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 0.07016 ... 25.740 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 0.05884 ... 9.456 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039

569 rows × 30 columns

And let’s get a view into the distribution of the available data.

Show the code
df.describe().drop("count").style.background_gradient(cmap="Greens")
  mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 0.405172 1.216853 2.866059 40.337079 0.007041 0.025478 0.031894 0.011796 0.020542 0.003795 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 0.277313 0.551648 2.021855 45.491006 0.003003 0.017908 0.030186 0.006170 0.008266 0.002646 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 0.111500 0.360200 0.757000 6.802000 0.001713 0.002252 0.000000 0.000000 0.007882 0.000895 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 0.232400 0.833900 1.606000 17.850000 0.005169 0.013080 0.015090 0.007638 0.015160 0.002248 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 0.324200 1.108000 2.287000 24.530000 0.006380 0.020450 0.025890 0.010930 0.018730 0.003187 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 0.478900 1.474000 3.357000 45.190000 0.008146 0.032450 0.042050 0.014710 0.023480 0.004558 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 2.873000 4.885000 21.980000 542.200000 0.031130 0.135400 0.396000 0.052790 0.078950 0.029840 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500
About Scale Variance

The Random Forest algorithm is not sensitive to scale variance, so it is not necessary to preprocess and perform scale normalization on the data. This is one of the advantages of using Random Forest. It also handles missing values well, so imputation is not necessary, as well as handling both continuous and ordinal (categorical) data.

Let us build and train a Random Forest model with the data we just loaded.

Show the code
# Split data into features and target
X = breast_cancer.data
y = breast_cancer.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Initialize the Random Forest classifier
clf = RandomForestClassifier(random_state=42)

# Fit the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Random Forest classifier: {accuracy:.2f}")
Accuracy of Random Forest classifier: 0.97

This is all good and proper, but what do we mean by a “decision tree”? Let us clarify this by visualizing one of the random trees that has been built by the algorithm during the training. Each node in the tree represents a “decision” point and helps to split the data based on the best possible feature and threshold to differentiate the outcomes.

  • Root Node: This is the top-most node of the tree where the first split is made. The split at this node is based on the feature that results in the most significant information gain or the best Gini impurity decrease. Essentially, it chooses the feature and threshold that provide the clearest separation between the classes based on the target variable.

  • Splitting Nodes: These are the nodes where subsequent splits happen. Each splitting node examines another feature and makes a new decision, slicing the dataset into more homogeneous (or pure) subsets. Splitting continues until the algorithm reaches a predefined maximum depth, a minimum number of samples per node, or no further information gain is possible, among other potential stopping criteria.

  • Leaf Nodes: Leaf nodes are the terminal nodes of the tree at which no further splitting occurs. Each leaf node represents a decision outcome or prediction. In classification trees, the leaf node assigns the class that is most frequent among the samples in that node. In regression trees, the leaf usually predicts the mean or median of the targets.

  • Branches: Branches represent the outcome of a test in terms of feature and threshold. Each branch corresponds to one of the possible answers to the question posed at the node: Is the feature value higher or lower than the threshold? This binary splitting makes the structure of a decision tree inherently simple to understand.

Show the code
import matplotlib.pyplot as plt
from sklearn import tree

# Select the tree that you want to visualize (e.g., the fifth tree in the forest)
estimator = clf.estimators_[5]

# Create a figure for the plot
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(8, 6), dpi=300)

# Visualize the tree using plot_tree function
tree.plot_tree(
    estimator,
    feature_names=breast_cancer.feature_names,
    class_names=breast_cancer.target_names,
    filled=True,
    max_depth=2,  # Limit the depth of the tree for better readability
    ax=axes,
)

# Display the plot
plt.show()

We have seen a single tree, but Random Forest is an ensemble of multiple trees. The final prediction is made by aggregating the predictions of all the trees in the forest. We can also visualise all or a subset of trees in the forest to grasp the complexity and diversity of the model.

Show the code
import random

# Total number of trees in the random forest
total_trees = len(clf.estimators_)

# Number of trees to plot
num_trees_to_plot = 24

# Randomly pick 'num_trees_to_plot' trees from the random forest
selected_trees = random.sample(range(total_trees), num_trees_to_plot)

# Create a figure object and an array of axes objects (subplots)
fig, axes = plt.subplots(
    nrows=(num_trees_to_plot // 4) + 1,
    ncols=4,
    figsize=(8, 2 * ((num_trees_to_plot // 4) + 1)),
)

# Flatten the array of axes (for easy iteration if it's 2D due to multiple rows)
axes = axes.flatten()

# Plot each randomly selected tree using a subplot
for i, ax in enumerate(
    axes[:num_trees_to_plot]
):  # Limit axes iteration to number of trees to plot
    tree_index = selected_trees[i]
    tree.plot_tree(
        clf.estimators_[tree_index],
        feature_names=breast_cancer.feature_names,
        class_names=["Malignant", "Benign"],
        filled=True,
        ax=ax,
    )
    ax.set_title(f"Tree {tree_index}", fontsize=9)

# If there are any leftover axes, turn them off (when num_trees_to_plot is not a multiple of 4)
for ax in axes[num_trees_to_plot:]:
    ax.axis("off")

# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

Explainability

We’ve established that Random Forest models offer substantial explainability, unlike many other complex model frameworks that are often considered “black boxes.” To elucidate this aspect, one effective method is visualizing the decision paths used by the trees within the forest when making predictions. This can be accomplished using the dtreeviz library, which provides a detailed and interactive visualization of the decision-making process within a tree.

Using dtreeviz, we can trace the decision path of a single example from the training set across any of the trees in the model. This visualization includes splits made at each node, the criteria for these splits, and the distribution of target classes at each step. Such detailed traceability helps in understanding exactly how the model is arriving at its conclusions, highlighting the individual contributions of features in the decision process.

Show the code
from dtreeviz import model

# Suppress warnings - this is just to shut up warnings about fonts in GitHub Actions
import logging

logging.getLogger("matplotlib.font_manager").setLevel(level=logging.CRITICAL)

# The training sample to visualize
x = X_train[5]

# Define colors for benign and malignant
color_map = {
    "classes": [
        None,  # 0 classes
        None,  # 1 class
        ["#FFAAAA", "#AAFFAA"],  # 2 classes
    ]
}

# Visualizing the selected tree
viz = model(
    estimator,
    X_train,
    y_train,
    target_name="Target",
    feature_names=breast_cancer.feature_names,
    class_names=list(breast_cancer.target_names),
)

viz.view(x=x, colors=color_map)

Another great feature of Random Forests is that they can explain the relative importance of each feature when predicting results. For our Breast Cancer dataset, here is how each feature impacts the model.

Show the code
import numpy as np

features = breast_cancer.feature_names
importances = clf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 6))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="b", align="center")
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Now that we know which features are most important, we can use dtreeviz to visualise the classification boundaries for any pair of features. This can help us understand how the model is making decisions. Let us visualise classification boundaries for worst concave points and worst area features.

Show the code
from dtreeviz import decision_boundaries

X_features_for_boundaries = X_train[
    :, [27, 23]
]  # 27 = 'worst concave points', 23 = 'worst area'
new_clf = RandomForestClassifier(random_state=42)
new_clf.fit(X_features_for_boundaries, y_train)

fig, axes = plt.subplots(figsize=(8, 6))
decision_boundaries(
    new_clf,
    X_features_for_boundaries,
    y_train,
    ax=axes,
    feature_names=["worst concave points", "worst area"],
    class_names=breast_cancer.target_names,
    markers=["X", "s"],
    colors=color_map,
)
plt.show()

We can also plot pairs of features and their decision boundaries in a grid, to understand how pairs of features interact in the model. This can help us understand the relationships between features and how they contribute to the model’s predictions. Let us do so for random pairs, just for illustration purposes. In practice, you would choose pairs of features that are most important for your specific problem.

Show the code
# Set a random seed for reproducibility
np.random.seed(42)

# Create a 4x4 subplot grid
fig, axes = plt.subplots(4, 4, figsize=(20, 20))
axes = axes.flatten()  # Flatten the 2D array of axes for easy iteration

# Randomly select and plot decision boundaries for 5x5 pairs of features
for ax in axes:
    # Randomly pick two distinct features
    features_idx = np.random.choice(range(X.shape[1]), size=2, replace=False)
    X_features_for_boundaries = X[:, features_idx]

    # Train a new classifier
    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_features_for_boundaries, y)

    # Plot decision boundaries using dtreeviz
    decision_boundaries(
        clf,
        X_features_for_boundaries,
        y,
        ax=ax,
        feature_names=features[features_idx],
        class_names=breast_cancer.target_names,
        markers=["X", "s"],
        colors=color_map,
    )

    # Set titles for the subplots
    ax.set_title(f"{features[features_idx[0]]} vs {features[features_idx[1]]}")

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

Random Forests vs Neural Networks

Comparing Random Forests to neural networks involves considering several factors such as accuracy, training time, interpretability, and scalability across different types of data and tasks. Both algorithms have their unique strengths and weaknesses, making them suitable for specific scenarios.

Performance metrics

Random Forests typically offer strong predictive accuracy with less complexity than deep learning models, particularly on structured datasets. By constructing multiple decision trees and averaging their outputs, Random Forests can capture a variety of signals without overfitting too much, making them competitive for many standard data science tasks. In contrast, neural networks, especially deep learning architectures, are known for their prowess on unstructured data like images, text, or audio, due to their ability to learn intricate feature hierarchies.

When it comes to training, Random Forests are usually quicker on small to medium-sized datasets, thanks to parallel tree building and the lack of iterative tuning. Neural networks, on the other hand, often require intensive computation over multiple epochs, relying heavily on GPUs or TPUs to handle large volumes of data. This extra training overhead can pay off if the dataset is big and complex, but it does mean more time and resources are needed.

Interpretability is another key distinction. Because each tree’s splits can be traced, Random Forests offer a more transparent look into how decisions are reached, and feature importance scores can be extracted. Neural networks, however, are often seen as “black boxes”, with hidden layers that make it harder to pinpoint exactly how they arrive at their predictions. This can be challenging in fields that require clear explanations for regulatory or trust reasons.

In terms of robustness, Random Forests mitigate variance by aggregating a large number of individual trees, reducing the chance of overfitting. Neural networks, if not carefully regularized with techniques like dropout or early stopping, can easily overfit. Yet, with proper tuning and enough data, they remain extremely powerful.

Finally, there’s the matter of scalability. Random Forests scale well in parallel settings for both training and inference, making them handy in distributed environments. Neural networks can also scale effectively to handle massive datasets, especially with specialized hardware, but require a more complex setup. That said, their ability to adapt to various input sizes and modalities remains unmatched for certain tasks.

Suitability based on data type

Random Forests are particularly well-suited for:

  • Classification and regression on structured data
  • Large datasets, but with a limitation on the input feature space (high-dimensional spaces might lead to slower performance)
  • Applications requiring a balance between accuracy and interpretability

On the other hand, Neural Networks are more appropriate for:

  • High-complexity tasks involving image, text, or audio
  • Unstructured data which requires feature learning
  • Situations where model interpretability is less critical than performance

Example comparisons

In image recognition, neural networks (specifically convolutional neural networks) perform significantly better than random forests due to their ability to hierarchically learn features directly from data.

In tabular data prediction, random forests typically outperform neural networks, especially when the dataset isn’t huge, as they can better leverage the structure within the data without the need for extensive parameter tuning.

Final remarks

In summary, Random Forests are excellent for many traditional machine learning tasks and provide a good mix of accuracy, ease of use, and speed, especially on structured data. Neural networks are preferable for tasks involving complex patterns and large scales of unstructured data, although they require more resources and effort to tune and interpret.

Choosing between the two often depends on the specific requirements of the task, the nature of the data involved, and the computational resources available. In practice, it’s also common to evaluate both types of models along with others to find the best tool for a particular job.

Reuse

This work is licensed under CC BY (View License)