import pandas as pd
= pd.read_csv("mnist_small_train.csv")
mnist_train = mnist_train.drop("label", axis=1)
X_train = mnist_train["label"]
y_train
= pd.read_csv("mnist_test.csv")
mnist_test = mnist_test.drop("label", axis=1)
X_test = mnist_test["label"] y_test
Lab 10: Principal Component Analysis
These questions were sourced from the excellent textbook Géron (2022), which is available through the UNSW Library’s access to O’Reilly Media texts. The author provided a Google Colab notebook containing the coding solutions.
Questions
Conceptual Questions
(HandsOnML3, Q8.1) \star What are the main motivations for reducing a dataset’s dimensionality? What are the main drawbacks?
(HandsOnML3, Q8.3) \star Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?
(HandsOnML3, Q8.4) \star Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
(HandsOnML3, Q8.5) \star Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
(HandsOnML3, Q8.7) \star How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Applied Question
- \star Load the MNIST dataset (uploaded as two CSV files to Moodle) keeping the given smaller training set (only 10,000 of the 60,000 training instances) and test set (another 10,000 instances). Train a random forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new random forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next, evaluate the classifier on the test set. How does it compare to the previous classifier? Try again with a SGDClassifier (if using Python) or a kNN (if using R). How much does PCA help now?
Solutions
Conceptual Questions
The main motivations for dimensionality reduction are:
- To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
- To visualize the data and gain insights on the most important features
- To save space (compression)
The main drawbacks are:
- Some information is lost, possibly degrading the performance of subsequent training algorithms.
- It can be computationally intensive.
- It adds some complexity to your Machine Learning pipelines.
- Transformed features are often hard to interpret.
Once a dataset’s dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as t-SNE) do not.
PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in the Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.
That’s a trick question: it depends on the dataset. Let’s look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset’s intrinsic dimensionality.
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset.
Applied Question
The solutions for Python & R are below. The Python solution (shown first) is taken from the textbook, and it was converted/adapted to the R solution (shown second).
Python
Exercise: Load the MNIST dataset
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.
from sklearn.ensemble import RandomForestClassifier
from timeit import default_timer as timer
= RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf
= timer()
start ;
rnd_clf.fit(X_train, y_train)print(timer() - start)
3.0955865829999993
from sklearn.metrics import accuracy_score
= rnd_clf.predict(X_test)
y_pred accuracy_score(y_test, y_pred)
0.9505
Exercise: Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
from sklearn.decomposition import PCA
= PCA(n_components=0.95)
pca = pca.fit_transform(X_train) X_train_reduced
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?
= RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf_with_pca
= timer()
start ;
rnd_clf_with_pca.fit(X_train_reduced, y_train)print(timer() - start)
8.190220457999999
Oh no! Training is actually about twice slower now! How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See Figure 8-6. If you try SGDClassifier instead of RandomForestClassifier, you will find that training time is reduced by a factor of 5 when using PCA. Actually, we will do this in a second, but first let’s check the precision of the new random forest classifier.
Exercise: Next evaluate the classifier on the test set: how does it compare to the previous classifier?
= pca.transform(X_test)
X_test_reduced
= rnd_clf_with_pca.predict(X_test_reduced)
y_pred accuracy_score(y_test, y_pred)
0.9129
It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance. 😭
It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance. :’(
Exercise: Try again with an SGDClassifier. How much does PCA help now?
from sklearn.linear_model import SGDClassifier
= SGDClassifier(random_state=42)
sgd_clf
= timer()
start ;
sgd_clf.fit(X_train, y_train)print(timer() - start)
3.955731958000001
= sgd_clf.predict(X_test)
y_pred accuracy_score(y_test, y_pred)
0.8919
Okay, so the SGDClassifier takes much longer to train on this dataset than the RandomForestClassifier, plus it performs worse on the test set. But that’s not what we are interested in right now, we want to see how much PCA can help SGDClassifier. Let’s train it using the reduced dataset:
= SGDClassifier(random_state=42)
sgd_clf_with_pca = timer()
start ;
sgd_clf_with_pca.fit(X_train_reduced, y_train)print(timer() - start)
0.7959582080000018
Nice! Reducing dimensionality led to roughly 5× speedup. :) Let’s check the model’s accuracy:
= sgd_clf_with_pca.predict(X_test_reduced)
y_pred accuracy_score(y_test, y_pred)
0.8965
Great! PCA not only gave us a roughly 5x speed boost, it also improved performance slightly.
So there you have it: PCA can give you a formidable speedup, and if you’re lucky a performance boost… but it’s really not guaranteed: it depends on the model and the dataset!
R
library(randomForest)
library(rbenchmark)
library(caret)
library(kknn)
Exercise: Load the MNIST dataset
<- read.csv("mnist_small_train.csv")
mnist_train <- mnist_train[, -1]
X_train <- as.factor(mnist_train[, "label"])
y_train
<- read.csv("mnist_test.csv")
mnist_test <- mnist_test[, -1]
X_test <- as.factor(mnist_test[, "label"])
y_test
# Some of the columns have zero variance, so we remove them
<- which(apply(X_train, 2, var) != 0)
nonzero_var_cols <- X_train[, nonzero_var_cols]
X_train <- X_test[, nonzero_var_cols] X_test
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.
benchmark(
<- randomForest(X_train, y_train,
rnd_clf ntree=20, seed=42, maxnodes=100),
replications=1)
<- predict(rnd_clf, X_test)
y_pred <- sum(y_pred == y_test) / length(y_test)
accuracy_score accuracy_score
[1] 0.9108
Exercise: Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
<- preProcess(X_train, method=c("pca"), thresh=0.95)
pca <- predict(pca, X_train)
X_train_reduced <- predict(pca, X_test)
X_test_reduced ncol(X_train_reduced)
[1] 284
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?
benchmark(
<- randomForest(X_train_reduced, y_train,
rnd_clf_with_pca ntree=20, seed=42, maxnodes=100),
replications=1
)
Oh no! Training time is actually about twice slower now! not hugely different. How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See Figure 8-6. If you try KNN instead of RandomForest, you will find that training time is reduced by a factor of 2 when using PCA. Actually, we will do this in a second, but first let’s check the precision of the new random forest classifier.
Exercise: Next evaluate the classifier on the test set: how does it compare to the previous classifier?
<- predict(rnd_clf_with_pca, X_test_reduced)
y_pred <- sum(y_pred == y_test) / length(y_test)
accuracy_score accuracy_score
[1] 0.7865
It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance.
Exercise: Try again with an KNNClassifier. How much does PCA help now?
# The train.kknn function takes a formula as input
<- data.frame(y = y_train, X_train)[1:5000,]
df_train
# Fit the model
benchmark(
<- train.kknn(y ~ ., data = df_train, ks=1:4),
knn_model replications=1
)
# You can then predict with this model using the test data like so:
<- data.frame(X_test)
df_test <- predict(knn_model, newdata = df_test)
predictions <- sum(predictions == y_test) / length(y_test)
accuracy_score accuracy_score
[1] 0.8952
Okay, so the KNNClassifier takes much longer to train on this dataset than the RandomForestClassifier, plus it performs worse on the test set. But that’s not what we are interested in right now, we want to see how much PCA can help KNNClassifier.
Exercise: Let’s train it using the reduced dataset:
<- data.frame(y = y_train, X_train_reduced)[1:5000,]
df_train
benchmark(
<- train.kknn(y ~ ., data = df_train, ks=1:4),
knn_model replications=1
)
<- data.frame(X_test_reduced)
df_test <- predict(knn_model, newdata = df_test)
predictions <- sum(predictions == y_test) / length(y_test)
accuracy_score accuracy_score
[1] 0.7935
Great! PCA gave us speed boost, and performance was similar.
So there you have it: PCA can give you a formidable speedup, and if you’re lucky a performance boost… but it’s really not guaranteed: it depends on the model and the dataset!