Let us begin with a simple, tangible example. Imagine a collection of data points scattered across a two-dimensional Cartesian plane. While each point is defined by its X and Y coordinates, the overall structure of the point cloud might reveal a more fundamental organization. The data might be elongated in a particular direction, suggesting a strong relationship between the X and Y variables. Our goal is to discover these intrinsic directions, which are known as the principal axes of variation.
Consider a dataset generated programmatically. We can use the NumPy library to create a set of points that are not aligned with the standard X and Y axes. We will first generate random data and then apply a rotation and scaling to introduce a correlation between the two dimensions.
import numpy as np # Generate some 2D data np.random.seed(42) X = np.dot(np.random.rand(2, 2), np.random.randn(2, 200)).T
If one were to plot these 200 points, a distinct elliptical shape would emerge, oriented diagonally. The challenge is to find a new coordinate system where the axes align with the major and minor axes of this ellipse. The first axis of this new system should point in the direction of the greatest variance in the data. The second axis, being orthogonal to the first, will point in the direction of the next greatest variance.
UnionSine 1TB Ultra Slim Portable External Hard Drive HDD-USB 3.0 for PC, Mac, Laptop, PS4, Xbox one, Xbox 360-HD2510(Black)
$55.74 (as of July 26, 2025 11:18 GMT +03:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)The mathematical tool that encapsulates the variance and covariance of our data is the covariance matrix. For a dataset with n features, the covariance matrix is an n-by-n matrix where the entry in the i-th row and j-th column is the covariance between the i-th and j-th features. The diagonal elements represent the variance of each feature individually. We can compute this matrix directly from our data, but first, we must center the data by subtracting the mean of each feature. This ensures that the analysis focuses on the variation around the central point of the data cloud.
# Center the data X_centered = X - np.mean(X, axis=0) # Calculate the covariance matrix cov_matrix = np.cov(X_centered, rowvar=False) print("Covariance Matrix:") print(cov_matrix)
The output of this code reveals a 2×2 matrix. The off-diagonal elements are non-zero, confirming the covariance between our two dimensions. The principal axes we seek are, in fact, the eigenvectors of this covariance matrix. The eigenvectors define the directions of the axes, while their corresponding eigenvalues quantify the amount of variance along each of those axes. An eigenvector with a large eigenvalue corresponds to a direction of high variance.
Let us now employ NumPy’s linear algebra functions to find these eigenvectors and eigenvalues.
# Calculate eigenvalues and eigenvectors eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) print("nEigenvalues:") print(eigenvalues) print("nEigenvectors (Principal Axes):") print(eigenvectors)
The eigenvectors
array contains our principal axes as its columns. The first column is the eigenvector corresponding to the first eigenvalue, and the second column corresponds to the second eigenvalue. By convention, libraries often return them sorted by the magnitude of the eigenvalues, but it is wise to verify. Let’s sort them explicitly to be certain.
# Sort eigenvectors by descending eigenvalues sort_indices = np.argsort(eigenvalues)[::-1] eigenvalues_sorted = eigenvalues[sort_indices] eigenvectors_sorted = eigenvectors[:, sort_indices] # The first principal axis is the eigenvector with the largest eigenvalue principal_axis_1 = eigenvectors_sorted[:, 0] principal_axis_2 = eigenvectors_sorted[:, 1] print("nFirst Principal Axis (direction of max variance):") print(principal_axis_1) print("nVariance along first axis (largest eigenvalue):") print(eigenvalues_sorted[0])
We have now successfully identified the primary direction of variation within our dataset. This vector, principal_axis_1
, represents the line onto which the data, if projected, would have the maximum possible spread. The corresponding eigenvalue gives us a precise measure of that spread. The second vector, principal_axis_2
, is orthogonal to the first and captures the remaining variance. These two vectors form a new basis, a new coordinate system, that is intrinsically tailored to the structure of our data. This process of finding the eigenvectors of the covariance matrix is the very heart of Principal Component Analysis. It provides a mathematical foundation for reorienting our perspective to better understand the data’s inherent structure. The next logical step involves using these axes to transform the original data into this new coordinate system, a process which re-expresses each data point in terms of these principal components.
The mechanics of transformation with scikit-learn
To transform our original dataset using the principal axes we have identified, we need to perform a matrix multiplication between our centered data and the eigenvectors. This operation projects our data points onto the new coordinate system defined by the principal components. The result will be a new dataset where each point is represented in terms of its coordinates along the principal axes.
Let’s execute this transformation using the previously computed eigenvectors. The transformation is straightforward: we multiply the centered data matrix by the eigenvector matrix. The resulting matrix will contain the coordinates of the original data points in the new basis.
# Transform the data using the principal axes X_transformed = X_centered.dot(eigenvectors_sorted) print("nTransformed Data:") print(X_transformed)
The output will display the transformed coordinates of the original data points. Each point is now expressed in terms of its position along the principal axes, effectively reducing the dimensionality of the dataset while retaining the most significant features of its structure.
In practice, we might want to visualize this transformation to better understand how our data has been reoriented. By plotting the transformed data, we can clearly see the alignment along the principal axes. Using the Matplotlib library, we can create a scatter plot of the transformed data points.
import matplotlib.pyplot as plt # Create a scatter plot of the transformed data plt.figure(figsize=(8, 6)) plt.scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=0.5) plt.title('Transformed Data in Principal Component Space') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.axhline(0, color='grey', lw=0.5, ls='--') plt.axvline(0, color='grey', lw=0.5, ls='--') plt.grid() plt.show()
This plot will illustrate how the data is now aligned along the axes of maximum variance, revealing the structure that was previously obscured in the original coordinate system. Each point in the plot corresponds to a point in the original dataset, but now expressed in terms of the principal components.
Source: https://www.pythonlore.com/understanding-principal-component-analysis-with-scikit-learn/