Current location - Loan Platform Complete Network - Big data management - How does python pca data?
How does python pca data?
Basic steps:

Normalize the data (this is not done in the code, but the average value is directly subtracted)

Calculating covariance matrix of normalized data set

Calculating eigenvalues and eigenvectors of covariance matrix

Keep the most important k features (generally k is less than n), or you can do it yourself, or you can choose a threshold, and then subtract the sum of the first k eigenvalues from the sum of the last n-k eigenvalues, and then choose this K.

Find the eigenvectors corresponding to k eigenvalues.

Multiply the data set of m * n by the eigenvector (n * k) of k N-dimensional eigenvectors to get the final dimensionality reduction data.

In fact, the essence of PCA is diagonalizing covariance matrix. It is necessary to explain why the eigenvalues are sorted from the largest to the smallest and then selected. First of all, what does the eigenvalue represent? We have asked for it countless times in linear algebra, so what is its specific meaning? By decomposing an n*n symmetric matrix, its eigenvalues and eigenvectors can be obtained, which will produce n orthogonal bases of n dimensions, each of which corresponds to an eigenvalue. Then the matrix is projected onto these n bases, and the modulus of eigenvalue represents the projection length of the matrix on this base.

The larger the eigenvalue, the greater the variance of the matrix on the corresponding eigenvector, the more discrete the sample points, the easier to distinguish and the more information. Therefore, the direction of the corresponding eigenvector with the largest eigenvalue contains more information. If some eigenvalues are very small, which means that the information in this direction is very small, then we can delete the data in the direction corresponding to the small eigenvalues and only keep the data in the direction corresponding to the large eigenvalues. After that, the amount of data will be reduced, but useful information will be retained. PCA is this principle.