Linear Algebra for ML Part 2 | Principal Component Analysis
Updated: Jan 24, 2023
In the previous article, we have talked about applying linear algebra for data representation in machine learning algorithms, but the application of linear algebra in ML is much broader than that. This article will introduce more linear algebra concepts with the main focus on how these concepts are applied for dimensionality reduction, specially Principal Component Analysis (PCA). We will also dive deeper into how PCA is implemented using scikit-learn in a practical way.
When to Use PCA?
High-dimensional data is a common issue experienced in machine learning practices, as we typically feed a large amount of features for model training. This results in the caveat of models having less interpretability and higher complexity - also known as the curse of dimensionality. PCA can be beneficial when the dataset is high-dimensional (i.e. contains many features) and it is widely applied for dimensionality reduction.
Additionally, PCA is also used for discovering the hidden relationship among feature variables and reveal underlying patterns that could be very insightful. PCA attempts to find linear components that capture as much variance in the data as possible. And the first principal component (PC1) is typically composed of features that contributes most to the model predictions.
How Does PCA Work?
The objective of PCA is to find the principal components that represents the data variance in a lower dimension and we can unfold the process into following steps:
represent the data variance using covariance matrix
eigenvector and eigenvalue capture data variance in a lower dimensionality
principal components are the eigenvectors of the covariance matrix
To understand how PCA works, we need to answer the questions of what are covariance matrix and eigenvector/eigenvalue. It is also helpful to fundamentally shift our perspectives of viewing matrix multiplication as a math operation to a visual transformation.
Matrix Transformation
We have previously introduced how matrix dot product is computed from a math operation perspectives. We can also interpret the dot product as a visual transformation which assists in understanding more complex linear algebra concepts. As illustrated below, let us use a 2x2 matrix as an example. We split the matrix vertically into two vectors where the left one represents the basis vector of x-axis, and the right one represents the basis vector of the y-axis. Therefore, a matrix represents a 2D space constructed by the x-axis and y-axis.
It is not hard to understand that an identity matrix has [1,0] as the basis vector on the x-axis and [0,1] as the basis vector on the y-axis, so that the dot product between any vectors and the identity matrix will return the vector itself.
Matrix transformation boils down to shifting the scale and direction of the x-axis and y-axis. For example, changing the basis vector of x-axis from [1,0] to [2,0] means that the mapping space has been scaled two times in the x coordinate direction.
We can additionally combine both the x-axis and y-axis for more complicated scaling, rotating or shearing transformation. A typically example is the mirror matrix where we swap the x and y axis. For a given vector [1,2], we will get [2,1] after the mirror transformation.
If you would like to practice these transformations in python and skip the manual calculations, we can use following code to perform these dot products and visualize the result of the transformation using plt.quiver() function.
import numpy as np
import matplotlib.pyplot as plt
# define matrices and vector
x_scaled_matrix = np.array([[2,0],[0,1]])
mirror_matrix = np.array([[0,1],[1,0]])
v = np.array([1,2])
# matrix transformation
mirrored_v = mirror_matrix.dot(v)
x_scaled_v = x_scaled_matrix.dot(v)
# plot transformed vectors
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, v[0], v[1], color=['black'],scale=10, label='original vector')
plt.quiver(*origin, mirrored_v[0], mirrored_v[1] , color=['#D3E7EE'], scale=10, label='mirrored vector' )
plt.quiver(*origin, x_scaled_v[0], x_scaled_v[1] , color=['#C6A477'], scale=10, label='x_scaled vector')
plt.legend(loc ="lower right")
Covariance Matrix
In Short: covariance matrix represents the pairwise correlations among a group of variables in a matrix form.
Covariance matrix is another critical concept in PCA process that represents the data variance in the dataset. To understand the details of covariance matrix, we firstly need to know that covariance measures the magnitude of how one random variable varies with another random variable. For two random variable x and y, their covariance is formulated as below and higher covariance value indicates stronger correlation between two variables.
When given a set of variables (e.g. x1, x2, ... xn) in a dataset, covariance matrix is typically used for representing the covariance value between each variable pairs in a matrix format.
Multiplying the covariance matrix will transform any vector towards the direction that captures the trend of variance in the original dataset.
Let us use a simple example to simulate the effect of this transformation. Firstly, we randomly generate the variable x0, x1 and then compute the covariance matrix.
# generate random variables x0 and x1
import random
x0 = [round(random.uniform(-1, 1),2) for i in range(0,100)]
x1 = [round(2 * i + random.uniform(-1, 1) ,2) for i in x0]
# compute covariance matrix
X = np.stack((x0, x1), axis=0)
covariance_matrix = np.cov(X)
print('covariance matrix\n', covariance_matrix)
We then transform some random vectors by taking the dot product between each of them and the covariance matrix.
# plot original data points
plt.scatter(x0, x1, color=['#D3E7EE'])
# vectors before transformation
v_original = [np.array([[1,0.2]]), np.array([[-1,1.5]]), np.array([[1.5,-1.3]]), np.array([[1,1.4]])]
# vectors after transformation
for v in v_original:
v_transformed = v.dot(covariance_matrix)
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, v[:, 0], v[:, 1], color=['black'], scale=4)
plt.quiver(*origin, v_transformed[:, 0], v_transformed[:, 1] , color=['#C6A477'], scale=10)
plt.axis('scaled')
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
Original vectors prior to transformation are in black, and after transformation are in brown. As you can see, the original vectors that are pointing at different directions have become more conformed to the general trend displayed in the original dataset (i.e. the blue dots). Because of this property, covariance matrix is important to PCA in terms of describing the relationship between features.
Eigenvalue and Eigenvector
In Short: Eigenvector (v) of a matrix (A) remains at the same direction after the matrix transformation, hence Av = λv where v represents the corresponding eigenvalue. Representing data using eigenvector and eigenvalue reduces the dimensionality while maintaining the data variance as much as possible.
To bring more intuitions to this concept, we can use a simple demonstration. For example, we have the matrix [[0,1],[1,0]], and one of the eigenvector for matrix is [1,1] and the corresponding eigenvalue is 1.
From matrix transformation, we know that [[0,1],[1,0]] acts as a mirror matrix that swaps the x, y coordinate of the vector. Therefore, the direction of vector [1,1] will not change after the mirror transformation, thus it meets the criteria of being the eigenvector of the matrix. The eigenvalue 1 indicates that the vector remains at the same scale and direction as prior to the transformation. Consequently, we are able to represent the effect of a matrix transform (i.e. 2 dimensional) using a scalar (i.e. 1 dimension) and eigenvalue tells us how much variance are preserved by the eigenvector.
Let’s continue with the example above and use this code snippet to overlay the eigenvector with the greatest eigenvalue (in red color). As you can see, it is aligned with the direction with the greatest data variance.
from numpy.linalg import eig
eigenvalue,eigenvector = eig(covariance_matrix)
plt.quiver(*origin, eigenvector[:,1][0], eigenvector[:,1][1] , color=['red'], scale=4, label='eigenvector')
Principal Components
Now that we have discussed that covariance matrix can represent the data variance when multiple variables are present and eigenvector can capture the data variance in a lower dimensionality. By computing the eigenvector/eigenvalue of the covariance matrix, we get the principal components. There are more than one eigenvector for a matrix and they are typically arranged in a descending order of the their eigenvalue, denoted by PC1, PC2 …PCn. The first principal component (PC1) is the eigenvector with the highest eigenvalue which is the red vector shown in the image, which explains the maximum variance in the data. Therefore, when using principal components to reduce data dimensionality, we select the ones with higher eigenvalues as it preserves more information in the original dataset.
PCA Implementation in Machine Learning
We have walked through enough theory and now let us step into the practical part. Luckily, scikit-learn has provided us an easy implementation of PCA. We will use the public dataset "college major" from fivethirtyeight GitHub repository.
1. standardize data into the same scale
PCA is sensitive to data with different scales, as covariance matrix requires the data at the same scale to measure the correlation between features with a consistent standard. Therefore, we need to standardized the data before applying PCA, which means that each feature has a mean of zero and a standard deviation of one. We use the following code snippet to perform data standardization. If you wish to know more data transformation techniques such as normalization, min-max scaling, check out my article on “3 Common Techniques for Data Transformation”.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
2. apply PCA on the scaled data
We then use PCA from sklearn.decomposition and specify the number of components to generate. The number of components is determined by how much data variance to explain by the principal components. Here we will generate 3 components to balance the trade off between the explained variance and dimensionality.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_data = pca.fit_transform(scaled_df)
3. visualize explained variance using scree plot
Some information of the original dataset will be lost after shrinking it to a lower dimensionality, hence it is important to keep as much information as possible while limiting the number of principal components. To help us with the interpretation, we can visualize the explained variance using a scree plot. Explained variance of a principal component indicates the magnitude of data variance in the direction of the eigenvector and it correlates to the eigenvalue. Higher explained variance means that it preserves more information and the one with highest explained variance is the first principal component. We can use the code snippet below to visualize the explained variance and also the cumulative variance (i.e. sum of variance if we add previous principal components together).
import matplotlib.pyplot as plt
principal_components = ['PC1', 'PC2', 'PC3']
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
plt.figure(figsize=(10, 6))
plt.bar(principal_components, explained_variance, color='#D3E7EE')
plt.plot(principal_components, cumulative_variance, 'o-', linewidth=2, color='#C6A477')
# add cumulative variance as the annotation
for i,j in zip(principal_components, cumulative_variance):
plt.annotate(str(round(j,2)), xy=(i, j))
The scree plot tells us about the explained variances when three principal components were generated. The first principal component (PC1) explains 60% of the variance and 84% of the variance are explained with the first 3 components.
4. interpret the principal components composition
Principal components also provide us some evidence of the importance of original features. By evaluating the magnitude and direction of the coefficients for the original features, we know whether the feature is strongly correlated with the component. As show below, we generate the coefficients of the features with respects to the components.
pca_component_df = pd.DataFrame(pca.components_, columns = df.columns)
pca_component_df
Additionally, we can use heatmap from seaborn library to highlight the features with high absolute coefficient values. If we interpret PC1 (i.e. row 0), we can see there are multiple features have relatively higher association with PC1, such as "Total" (number of enrolled students), "Employed", "Full_time", "Unemployed" etc, indicating that these features are contributing more to the data variance. Additionally, you may notice that some features are directly correlated with each other, and PCA brings the extra benefit of removing multicollinearity among these features.
import seaborn as sns
# create custom color palette
customPalette = sns.color_palette("blend:#D3E7EE,#C6A477", as_cmap=True)
# create heatmap
plt.figure(figsize=(24,3))
sns.heatmap(pca_component_df, cmap=customPalette, annot=True)
5. use principal components in ML algorithm
Finally, we have reduced the dimensionality to a handful of principal components which are ready to be utilized as the new features in the machine learning algorithm. To do so, we are going to use the transformed dataset from the output of PCA process - pca_df. We can examine the shape of this dataset and we get 173 rows and 3 columns. We then add the label (e.g. “Rank”) back to this dataset with 3 new features derived from the PCA process and this will become the new dataframe to build the ML model.
pca_df = pd.DataFrame(pca_data)
new_df = pd.concat([pca_df,label_df], axis = 1)
new_df.columns = ["PC1", "PC2", "PC3", "Rank"]
The remaining process will follow the standard procedure of a machine learning lifecycle, that is - split the dataset into train-test, building model and then model evaluation. Here we won’t dive into the details of building ML models, but if you are interested, please have a look at my article on classification algorithms as the starting point.
Hope you found this article helpful. If you’d like to support my work and see more articles like this, treat me a coffee ☕️ by signing up Premium Membership with $10 one-off purchase.
Take-Home Message
In the previous article, we have introduced using linear algebra for data representation in machine learning. Now we introduced another common use case of linear algebra in ML for dimensionality reduction - Principal Component Analysis (PCA). We firstly discussed the theory behind PCA:
represent the data variance using covariance matrix
use eigenvector and eigenvalue to capture data variance in a lower dimension
The principal component is the eigenvector and eigenvalue of the covariance matrix
Furthermore, we utilize scikit-learn to implement PCA through the following procedures:
standardize data into the same scale
apply PCA on the scaled data
visualize explained variance using scree plot
interpret the principal components composition
use principal components in ML algorithm