Big Data Definition
Data is defined as a collection of data so large that it becomes difficult to process using traditional techniques. There are three broad aspects of big data:
The complexity of working with so many dimensions drives the need for a wide variety of data technologies to filter information and make the data appear to better solve problems. Filters reduce dimensionality by removing redundant information from high-latitude datasets.
We can think of will be as an information compression of the data, similar to compressing a 1000 1000 image to 64 64 resolution, again to be able to understand what the picture means.
The core algorithm in big data dimensionality reduction is SVD, which we call singular value decomposition.The formula for SVD is:
The meaning of this formula is that the original data matrix M is decomposed into the product of three matrices.
The key thing is to understand what s stands for, for example, the sum of all elements of s is 100, and the first value of s is 99, which means that 99% of the information is stored in the first column of U and Vh. So you can happily discard all the remaining columns after the first column without losing important information about the data, only 1% of the information is lost, which is not too important for the data.
The data M that needs to be downscaled in this example consists of 4 samples, each consisting of 3 feature values. Here's how we'll use the svd function of the linalg module, to decompose the matrix:
The values in s show that the first column contains most of the information (more than 80%). The second column has some values (about 14%), and the third column contains information about participation.
Of course the svd formula is invertible, that is, these three matrices from the decomposition can still be restored to the original matrix by dot-multiplication. Note that the matrix s is actually a diagonal matrix, and the reduction uses the diagonal matrix to participate in the operation.
You can see that the back_M after the reduction is the same as the M matrix before.
Starting with the three matrices output by the SVD, figure out how to remove the third column. u takes U[:,:2] and becomes (4,2), s takes s[:2] and becomes (2,), and Vh takes Vh[:2,:] and becomes (2,3)
You can see that even if you lose the value of the last column, the reduction has some difference compared to the past, but it's not very much. That is, you can save the past values in fewer dimensions.
Seeing this, you might be a little confused about where the dimensionality has been reduced. From the (4,3) matrix in the past to the three matrices (4,3)(3,)(3,3), not only the dimensionality is not reduced, but also add some data.
Say we ignore the information in the last column and turn it into three matrices (4,1),(1,),(1,3), from the past 4x3=12 numbers to the current 4+1+3 numbers, it does drop. But how should we use these three matrices to engage in machine learning?