Understanding Random Forest Classification and Its Effectiveness
why random forests and ensemble methods are the underrated heroes of machine learning
Experiments
Machine Learning
Published
March 7, 2024
A Random Forest is a versatile and robust machine learning algorithm used for both classification and regression tasks. It builds upon the concept of decision trees, but improves on their accuracy and overcomes their tendency to overfit by combining the predictions of numerous decision trees constructed on different subsets of the data. We have already experimented with a Random Tree regressor, and in this experiment, we will focus on Random Forest classification.
What are Random Forest models ?
A Random Forest operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees. It is termed as “Random” because of its ability to develop trees based on random subsets of features and data points, which ensures model variance and generally results in a more robust overall prediction.
Random Forest have the following key characteristics:
Robustness: A Random Forest is less likely to overfit than decision trees, because they average multiple trees to give a more accurate prediction.
Handling of Unbalanced Data: It can handle unbalanced data from both binary and multiclass classification problems effectively.
Feature Importance: It provides insights into which features are most important for the prediction.
Explainability: A Random Forest provides good explainability, and isn’t a black box.
The mechanics of the algorithm
The Random Forest algorithm follows these steps:
Bootstrap Aggregating (Bagging): Random subsets of the data are created for training each tree, sampled with replacement.
Random Feature Selection: When splitting nodes during the formation of trees, only a random subset of features are considered.
Building Trees: Each subset is used to train a decision tree. Trees grow to their maximum length and are not pruned.
Aggregation: For classification tasks, the mode of all tree outputs is considered for the final output.
Random Forest typically outperform single decision trees due to their reduced variance without increasing bias. This means they are less likely to fit noise in the training data, making them significantly more accurate. They are also effective in scenarios where the feature space is large, and robust against overfitting which is a common issue in complex models.
Effectiveness
Since their inception, it has been shown that Random Forest is highly effective for a wide range of problems. It is particularly known for their effectiveness in:
Handling large data sets with higher dimensionality. They can handle thousands of input variables without variable deletion.
Maintaining accuracy even when a large proportion of the data is missing.
An example Random Forest classifier
Below is an example demonstrating the implementation of a Random Forest classifier using the scikit-learn library. This example uses the Breast Cancer dataset. Let us start by describing the data.
The Random Forest algorithm is not sensitive to scale variance, so it is not necessary to preprocess and perform scale normalization on the data. This is one of the advantages of using Random Forest. It also handles missing values well, so imputation is not necessary, as well as handling both continuous and ordinal (categorical) data.
Let us build and train a Random Forest model with the data we just loaded.
Show the code
# Split data into features and targetX = breast_cancer.datay = breast_cancer.target# Split data into training and test setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42)# Initialize the Random Forest classifierclf = RandomForestClassifier(random_state=42)# Fit the model on the training dataclf.fit(X_train, y_train)# Make predictions on the test sety_pred = clf.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, y_pred)print(f"Accuracy of Random Forest classifier: {accuracy:.2f}")
Accuracy of Random Forest classifier: 0.97
This is all good and proper, but what do we mean by a “decision tree”? Let us clarify this by visualizing one of the random trees that has been built by the algorithm during the training. Each node in the tree represents a “decision” point and helps to split the data based on the best possible feature and threshold to differentiate the outcomes.
Root Node: This is the top-most node of the tree where the first split is made. The split at this node is based on the feature that results in the most significant information gain or the best Gini impurity decrease. Essentially, it chooses the feature and threshold that provide the clearest separation between the classes based on the target variable.
Splitting Nodes: These are the nodes where subsequent splits happen. Each splitting node examines another feature and makes a new decision, slicing the dataset into more homogeneous (or pure) subsets. Splitting continues until the algorithm reaches a predefined maximum depth, a minimum number of samples per node, or no further information gain is possible, among other potential stopping criteria.
Leaf Nodes: Leaf nodes are the terminal nodes of the tree at which no further splitting occurs. Each leaf node represents a decision outcome or prediction. In classification trees, the leaf node assigns the class that is most frequent among the samples in that node. In regression trees, the leaf usually predicts the mean or median of the targets.
Branches: Branches represent the outcome of a test in terms of feature and threshold. Each branch corresponds to one of the possible answers to the question posed at the node: Is the feature value higher or lower than the threshold? This binary splitting makes the structure of a decision tree inherently simple to understand.
Show the code
import matplotlib.pyplot as pltfrom sklearn import tree# Select the tree that you want to visualize (e.g., the fifth tree in the forest)estimator = clf.estimators_[5]# Create a figure for the plotfig, axes = plt.subplots(nrows=1, ncols=1, figsize=(8, 6), dpi=300)# Visualize the tree using plot_tree functiontree.plot_tree( estimator, feature_names=breast_cancer.feature_names, class_names=breast_cancer.target_names, filled=True, max_depth=2, # Limit the depth of the tree for better readability ax=axes,)# Display the plotplt.show()
We have seen a single tree, but Random Forest is an ensemble of multiple trees. The final prediction is made by aggregating the predictions of all the trees in the forest. We can also visualise all or a subset of trees in the forest to grasp the complexity and diversity of the model.
Show the code
import random# Total number of trees in the random foresttotal_trees =len(clf.estimators_)# Number of trees to plotnum_trees_to_plot =24# Randomly pick 'num_trees_to_plot' trees from the random forestselected_trees = random.sample(range(total_trees), num_trees_to_plot)# Create a figure object and an array of axes objects (subplots)fig, axes = plt.subplots( nrows=(num_trees_to_plot //4) +1, ncols=4, figsize=(8, 2* ((num_trees_to_plot //4) +1)),)# Flatten the array of axes (for easy iteration if it's 2D due to multiple rows)axes = axes.flatten()# Plot each randomly selected tree using a subplotfor i, ax inenumerate( axes[:num_trees_to_plot]): # Limit axes iteration to number of trees to plot tree_index = selected_trees[i] tree.plot_tree( clf.estimators_[tree_index], feature_names=breast_cancer.feature_names, class_names=["Malignant", "Benign"], filled=True, ax=ax, ) ax.set_title(f"Tree {tree_index}", fontsize=9)# If there are any leftover axes, turn them off (when num_trees_to_plot is not a multiple of 4)for ax in axes[num_trees_to_plot:]: ax.axis("off")# Adjust layout to prevent overlapfig.tight_layout()# Show the plotplt.show()
Explainability
We’ve established that Random Forest models offer substantial explainability, unlike many other complex model frameworks that are often considered “black boxes.” To elucidate this aspect, one effective method is visualizing the decision paths used by the trees within the forest when making predictions. This can be accomplished using the dtreeviz library, which provides a detailed and interactive visualization of the decision-making process within a tree.
Using dtreeviz, we can trace the decision path of a single example from the training set across any of the trees in the model. This visualization includes splits made at each node, the criteria for these splits, and the distribution of target classes at each step. Such detailed traceability helps in understanding exactly how the model is arriving at its conclusions, highlighting the individual contributions of features in the decision process.
Show the code
from dtreeviz import model# Suppress warnings - this is just to shut up warnings about fonts in GitHub Actionsimport logginglogging.getLogger("matplotlib.font_manager").setLevel(level=logging.CRITICAL)# The training sample to visualizex = X_train[5]# Define colors for benign and malignantcolor_map = {"classes": [None, # 0 classesNone, # 1 class ["#FFAAAA", "#AAFFAA"], # 2 classes ]}# Visualizing the selected treeviz = model( estimator, X_train, y_train, target_name="Target", feature_names=breast_cancer.feature_names, class_names=list(breast_cancer.target_names),)viz.view(x=x, colors=color_map)
Another great feature of Random Forests is that they can explain the relative importance of each feature when predicting results. For our Breast Cancer dataset, here is how each feature impacts the model.
Show the code
import numpy as npfeatures = breast_cancer.feature_namesimportances = clf.feature_importances_indices = np.argsort(importances)plt.figure(figsize=(8, 6))plt.title("Feature Importances")plt.barh(range(len(indices)), importances[indices], color="b", align="center")plt.yticks(range(len(indices)), [features[i] for i in indices])plt.xlabel("Relative Importance")plt.show()
Now that we know which features are most important, we can use dtreeviz to visualise the classification boundaries for any pair of features. This can help us understand how the model is making decisions. Let us visualise classification boundaries for worst concave points and worst area features.
We can also plot pairs of features and their decision boundaries in a grid, to understand how pairs of features interact in the model. This can help us understand the relationships between features and how they contribute to the model’s predictions. Let us do so for random pairs, just for illustration purposes. In practice, you would choose pairs of features that are most important for your specific problem.
Show the code
# Set a random seed for reproducibilitynp.random.seed(42)# Create a 4x4 subplot gridfig, axes = plt.subplots(4, 4, figsize=(20, 20))axes = axes.flatten() # Flatten the 2D array of axes for easy iteration# Randomly select and plot decision boundaries for 5x5 pairs of featuresfor ax in axes:# Randomly pick two distinct features features_idx = np.random.choice(range(X.shape[1]), size=2, replace=False) X_features_for_boundaries = X[:, features_idx]# Train a new classifier clf = RandomForestClassifier(random_state=42) clf.fit(X_features_for_boundaries, y)# Plot decision boundaries using dtreeviz decision_boundaries( clf, X_features_for_boundaries, y, ax=ax, feature_names=features[features_idx], class_names=breast_cancer.target_names, markers=["X", "s"], colors=color_map, )# Set titles for the subplots ax.set_title(f"{features[features_idx[0]]} vs {features[features_idx[1]]}")# Adjust layout to prevent overlapplt.tight_layout()plt.show()
Random Forests vs Neural Networks
Comparing Random Forests to neural networks involves considering several factors such as accuracy, training time, interpretability, and scalability across different types of data and tasks. Both algorithms have their unique strengths and weaknesses, making them suitable for specific scenarios.
Performance metrics
Random Forests typically offer strong predictive accuracy with less complexity than deep learning models, particularly on structured datasets. By constructing multiple decision trees and averaging their outputs, Random Forests can capture a variety of signals without overfitting too much, making them competitive for many standard data science tasks. In contrast, neural networks, especially deep learning architectures, are known for their prowess on unstructured data like images, text, or audio, due to their ability to learn intricate feature hierarchies.
When it comes to training, Random Forests are usually quicker on small to medium-sized datasets, thanks to parallel tree building and the lack of iterative tuning. Neural networks, on the other hand, often require intensive computation over multiple epochs, relying heavily on GPUs or TPUs to handle large volumes of data. This extra training overhead can pay off if the dataset is big and complex, but it does mean more time and resources are needed.
Interpretability is another key distinction. Because each tree’s splits can be traced, Random Forests offer a more transparent look into how decisions are reached, and feature importance scores can be extracted. Neural networks, however, are often seen as “black boxes”, with hidden layers that make it harder to pinpoint exactly how they arrive at their predictions. This can be challenging in fields that require clear explanations for regulatory or trust reasons.
In terms of robustness, Random Forests mitigate variance by aggregating a large number of individual trees, reducing the chance of overfitting. Neural networks, if not carefully regularized with techniques like dropout or early stopping, can easily overfit. Yet, with proper tuning and enough data, they remain extremely powerful.
Finally, there’s the matter of scalability. Random Forests scale well in parallel settings for both training and inference, making them handy in distributed environments. Neural networks can also scale effectively to handle massive datasets, especially with specialized hardware, but require a more complex setup. That said, their ability to adapt to various input sizes and modalities remains unmatched for certain tasks.
Suitability based on data type
Random Forests are particularly well-suited for:
Classification and regression on structured data
Large datasets, but with a limitation on the input feature space (high-dimensional spaces might lead to slower performance)
Applications requiring a balance between accuracy and interpretability
On the other hand, Neural Networks are more appropriate for:
High-complexity tasks involving image, text, or audio
Unstructured data which requires feature learning
Situations where model interpretability is less critical than performance
Example comparisons
In image recognition, neural networks (specifically convolutional neural networks) perform significantly better than random forests due to their ability to hierarchically learn features directly from data.
In tabular data prediction, random forests typically outperform neural networks, especially when the dataset isn’t huge, as they can better leverage the structure within the data without the need for extensive parameter tuning.
Final remarks
In summary, Random Forests are excellent for many traditional machine learning tasks and provide a good mix of accuracy, ease of use, and speed, especially on structured data. Neural networks are preferable for tasks involving complex patterns and large scales of unstructured data, although they require more resources and effort to tune and interpret.
Choosing between the two often depends on the specific requirements of the task, the nature of the data involved, and the computational resources available. In practice, it’s also common to evaluate both types of models along with others to find the best tool for a particular job.