Evaluating Dimensionality Reduction - PCA vs t-SNE
reducing dimensions for better insights
Experiments
Machine Learning
Published
February 11, 2024
Evaluating the effectiveness of dimensionality reduction techniques, such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), requires a multifaceted approach tailored to the specific aims and context of the analysis. These methods serve to transform high-dimensional data into a lower-dimensional space while striving to preserve certain properties of the original data. The choice of evaluation criteria and methods significantly depends on the intended application of the dimensionality reduction, whether it be for visualization purposes, to facilitate clustering, or to enhance the performance of classification algorithms. Below, we explore a variety of strategies for assessing the performance of PCA and t-SNE, accompanied by Python code examples. It’s crucial to recognize that the efficacy of these techniques is highly contingent on the characteristics of the dataset in question and the specific objectives sought through the reduction process.
For this experiment, we will use the SKLearn Digits dataset, which comprises of a number of 16x16 digit representations.
Show the code
# Show an image of digits 0 to 9 from the digits datasetimport matplotlib.pyplot as pltfrom sklearn import datasets# Load the digits datasetdigits = datasets.load_digits()# Create a figure with subplots in a 2x5 gridfig, axes = plt.subplots(nrows=2, ncols=5, figsize=(8, 6))# Flatten the array of axesaxes = axes.flatten()for i inrange(10):# Find the first occurrence of each digit index = digits.target.tolist().index(i)# Plot on the ith subplot axes[i].imshow(digits.images[index], cmap=plt.cm.gray_r, interpolation="nearest") axes[i].set_title(f"Digit: {i}") axes[i].axis("off") # Hide the axesplt.tight_layout()plt.show()
Visual inspection
One of the simplest ways to evaluate PCA and t-SNE is by visually inspecting the reduced dimensions to see how well they separate different classes or clusters.
Show the code
import matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEimport numpy as npX = digits.datay = digits.targetprint("Before Dimensionality Reduction")print(X.shape)# Apply PCApca = PCA(n_components=2, random_state=42)X_pca = pca.fit_transform(X)print("After PCA")print(X_pca.shape)# Apply t-SNEtsne = TSNE(n_components=2, random_state=42)X_tsne = tsne.fit_transform(X)print("After t-SNE")print(X_tsne.shape)# Plotting functiondef plot_reduction(X, y, title): plt.figure(figsize=(8, 6))# Define a colormap colors = plt.cm.Spectral(np.linspace(0, 1, 10))# Plot each digit with a unique color from the colormapfor i, color inzip(range(10), colors): plt.scatter(X[y == i, 0], X[y == i, 1], color=color, label=f"Digit {i}") plt.title(title) plt.xlabel("First Principal Component") plt.ylabel("Second Principal Component") plt.legend(loc="best", shadow=False, scatterpoints=1) plt.axis("equal" ) # Equal aspect ratio ensures that PCA1 and PCA2 are scaled the same# Add a legend plt.legend() plt.show()# Plot resultsplot_reduction(X_pca, y, "PCA Result")plot_reduction(X_tsne, y, "t-SNE Result")
Before Dimensionality Reduction
(1797, 64)
After PCA
(1797, 2)
After t-SNE
(1797, 2)
From the analysis presented, it’s evident that t-SNE provides a significantly clearer and more distinct separation among the clusters corresponding to each digit compared to PCA. t-SNE’s strength lies in its ability to maintain the local relationships between data points, resulting in well-defined clusters that are easily distinguishable from one another. This contrast starkly with PCA, which, while reducing dimensionality in a way that preserves global variance, tends to overlap different digits more frequently. Consequently, the clusters formed by PCA are not as neatly segregated, making it harder to visually discern the distinct groups of digits. This observation underscores t-SNE’s advantage in scenarios where the preservation of local data structures is crucial for identifying nuanced patterns or clusters within the dataset.
Quantitative measures for clustering quality
For datasets with labeled classes such as this one, metrics like Silhouette Score can help quantify how well the reduced dimensions separate different classes.
About Silhouette Score
The Silhouette Score is a metric used to calculate the efficiency of the clustering algorithm. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The Silhouette Score provides a way to assess the distance between the resulting clusters. The score is calculated for each sample in the dataset, and the average score is used to evaluate the overall quality of the clustering.
The value of the score ranges from -1 to 1:
A score close to +1 indicates that the sample is far away from the neighboring clusters.
A score of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters.
A score close to -1 indicates that the sample is placed in the wrong cluster.
Show the code
from sklearn.metrics import silhouette_score# Silhouette Score for PCAsilhouette_pca = silhouette_score(X_pca, y)print(f"PCA Silhouette Score: {silhouette_pca}")# Silhouette Score for t-SNEsilhouette_tsne = silhouette_score(X_tsne, y)print(f"t-SNE Silhouette Score: {silhouette_tsne}")
Another way to evaluate the effectiveness of PCA and t-SNE is to use the reduced dimensions as input for a classifier and compare the classification accuracy. This can help determine if the reduced dimensions capture the essential information needed for classification tasks. In this case, we use a simple Random Forest classifier to compare the classification accuracy of PCA and t-SNE reduced dimensions.
Show the code
from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score# Split the dataset for PCA and t-SNE resultsX_train_pca, X_test_pca, y_train, y_test = train_test_split( X_pca, y, test_size=0.3, random_state=42)X_train_tsne, X_test_tsne, _, _ = train_test_split( X_tsne, y, test_size=0.3, random_state=42)# Train and evaluate a classifier on PCA resultsclf_pca = RandomForestClassifier(random_state=42)clf_pca.fit(X_train_pca, y_train)y_pred_pca = clf_pca.predict(X_test_pca)accuracy_pca = accuracy_score(y_test, y_pred_pca)print(f"PCA Classification Accuracy: {accuracy_pca}")# Train and evaluate a classifier on t-SNE resultsclf_tsne = RandomForestClassifier(random_state=42)clf_tsne.fit(X_train_tsne, y_train)y_pred_tsne = clf_tsne.predict(X_test_tsne)accuracy_tsne = accuracy_score(y_test, y_pred_tsne)print(f"t-SNE Classification Accuracy: {accuracy_tsne}")
Finally, comparing the time it takes to perform the reduction can be important, especially for large datasets. t-SNE is known to be computationally expensive compared to PCA, so understanding the time complexity of each method can help in choosing the right technique for the task at hand.
In conclusion, the evaluation of dimensionality reduction techniques like PCA and t-SNE is a multifaceted process that requires a combination of visual inspection, quantitative metrics, and performance evaluation using classification algorithms. The choice of evaluation criteria should be tailored to the specific objectives of the analysis, whether it be for visualization, clustering, or classification tasks. While PCA is useful for preserving global variance and reducing dimensionality, t-SNE excels at maintaining local relationships and forming distinct clusters. Understanding the strengths and limitations of each technique is crucial for selecting the most appropriate method for a given dataset and analysis goal.