Current location - Loan Platform Complete Network - Big data management - UMAP for dimensionality reduction analysis of 10X single cell (10X spatial transcriptome)
UMAP for dimensionality reduction analysis of 10X single cell (10X spatial transcriptome)

UMAP , full name uniform manifold approximation and projection, unified manifold approximation and projection, is constructed based on the theoretical framework structure of Riemannian geometry and algebraic topology. When dealing with large datasets, UMAP has obvious advantages, running faster and with a small memory footprint.Etienne Becht et al. 2019 published an article in Nature Biotechnology applying it to biological data and describing the applications and advantages of UMAP in dealing with single-cell data.

If you don't know what tSNE is, how it works, and haven't read the revolutionary van der Maaten & Hinton original manuscript from 2008, you can refer to my article 10X single-cell (10X spatial transcriptome) downscaling analysis of tSNE (algorithmic basics) . Although tSNE has had a huge impact on single-cell genomics and data science in general, it is widely recognized that it has some drawbacks that will soon be addressed. (The shortcomings of tSNE were also covered in detail in the last article shared ).

Looking at the graph above, I would say that t-distributions are supposed to provide global distance information because they push points that are farther apart in high-dimensional space to points that are farther away in low-dimensional space.

However, this good intention is stifled by the choice of the cost function (KL-divergence), and we will see why later.

(1),can significantly reduce the computational time high dimensional images due to the fact that summation or integration is a costly computational process. Think of Markov Chain Monte Carlo (MCMC) which basically tries to approximate the integration over the denominator of Bayes' rule (UMAP uses the number of nearest neighbors instead of perplexity)

(2) Defining perplexity, UMAP defines the number of nearest neighbors k without a log2 function, i.e.

UMAP uses a slightly different high-dimensional probabilistic symmetry

symmterization is necessary because UMAP fuses together points with locally different metrics (through the parameter ρ), it may happen that the weight between nodes of the graphs A and B is not equal to the weight between B and the node. Why UMAP uses this symmetry and not the one used by tSNE is unclear. I will show my experiments with different symmetrization rules in my next post (Writing UMAP from scratch), which does not convince me that this is such an important step, as it has a small impact on the final low-dimensional embedding.

UMAP models distance probabilities in low dimensions using a family of curves 1 / (1+a*y^(2b)), not exactly the Student's t-distribution, but very, very similar, note that once again no normalization is applied:

where for the default UMAP hyperparameters a ≈ 1.93,b ≈ 0.79 (actually, for min_dist = 0.001). In practice, UMAP finds a and b from nonlinear least squares fits to segmented functions with min_dist hyperparameters:

To better understand the behavior of the family of curves 1 / (1+a*y^(2b)), let's draw curves with different a and b:

We can see that the family of curves is very sensitive to the parameter b, and that at a large parameter b and at a small parameter y, the curve family forms a kind of peak. This means that under the UMAP hyperparameter min_dist, all data points are equally tightly connected. Since the Q(Y) function behaves almost like a Heaviside step function, this means that UMAP assigns almost the same low-dimensional coordinates to all the points that are close to each other in the low-dimensional space. min_dist is exactly what leads to the ultra-dense clustering often observed in UMAP dimensionally reduced maps.

To demonstrate how to find exactly the a and b parameters, let's show a simple segmented function (where the peaks are partially defined by the min_dist parameter) and fit it by optimization using the family of functions 1 / (1+a y^(2b)). curve_fit is from the Scipy Python library. As a result of the fit, we get the initial value a and the initial value b of the function 1 / (1+a y^(2b)).

Since we need to know the gradient of the cross-entropy in order to implement the gradient descent later on, let's quickly compute it. Ignoring the constant term containing only p(X), we can rewrite the cross entropy and differentiate it as follows:

Graph Laplacian, spectral clustering, Laplacian Eignemaps, diffusion maps, spectral embeddings, etc., actually refer to the same interesting method that combines matrix decomposition and adjacency graph methods for solving dimensionality reduction problems. In this approach, we first construct a graph (or knn graph), then formalize it with matrix algebra (adjacency and degree matrices) by constructing the Laplace matrix, and finally decompose the Laplace matrix, i.e., solve the eigenvalue decomposition problem.

We can easily display the initial low-dimensional coordinates on the demo dataset (i.e., cancer-associated fibroblasts (CAFs) scRNAseq data) using the scikit-learn Python library with the spectralembedded function:

Finally, UMAP uses stochastic gradient descent (SGD ) instead of regular gradient descent (GD), e.g. tSNE / FItSNE, which both speeds up computation and reduces memory consumption.

Now let us briefly discuss why they say that tSNE preserves only the local structure of the data. The localization of tSNE can be understood in different ways. First, we have the σ-parameter Eq. (1) The local set of data points "feel" each other in this way. Because of the probability of pairwise Euclidean distance decay exponents, at small values of σ, it is essentially zero for distant points (large X) and grows rapidly only for the nearest neighbors (small X). In contrast, at large σ, the probability of distant and near points becomes restricted comparable and σ → ∞, the probability is equal to 1 for all distances between any pair of points, i.e., they become equidistant points.

Interestingly, if we expand the probability of pairwise Euclidean distances in high dimensions into a Taylor series at σ → ∞, we will be in the second approximation of the power law:

The power law on the two-by-two Euclidean distances is similar to the cost function of the multidimensional normalization method (MDS), which is a method that preserves global distances by preserving the distances between each pair of points regardless of whether they are far away from each other or close together. . One can account for this large σtSNE interaction between remote data points, so it is not entirely correct to say that tSNE can only handle local distances. However, we are usually constrained by the finite value of perplexity, and Laurens van der Maaten suggests a range of values for perplexity between 5 and 50, although there may be a good compromise between local and global information by choosing perplexity using the square root ≈N^(1/2), where N is the sample size. In the opposite limit, σ → 0, we end up with extreme "localized" high-dimensional probabilities whose behavior is similar to that of the Dirac δ-function.

Another way to understand the "localization" of tSNE is to consider the KL-divergence function. Suppose X is the distance between points in a high-dimensional space and Y is the distance between points in a low-dimensional space

According to the definition of kl-divergence:

The first term of equation (9) converges to 0 for all sizes of X. It also converges to 0 for the size of X, because the exponent converges to 1 and log(1) = 0. For large X, this term still converges to 0 because the exponential prefactor converges to 0 faster than the log converges to negative infinity. So, to understand kl scattering intuitively, it suffices to consider only the second term:

This is a strange looking function, let's draw KL(X, Y)

The shape of this function is very asymmetric. If the distance between points in high dimensions X is small, the exponential factor becomes 1 and the logarithmic term behaves log(1 + Y ^ 2) which means that if Y is large for distances in low dimensional space, there will be a large penalty, so tSNE tries to reduce Y in small X in order to reduce the penalty. On the contrary, for long distances X in high-dimensional space, Y can be essentially any value between 0 and ∞ since the exponential term tends to 0 and always beats the logarithmic term. Thus, points that are far apart in high-dimensional space may be close to each other in low-dimensional space. Thus, in other words, tSNE does not guarantee that points that are far apart in high-dimensional space will remain far apart in low-dimensional space. However, it does guarantee that points that are adjacent in higher dimensional space will remain adjacent in lower dimensional space. So tSNE is not very good at projecting far away to lower dimensions,so it only retains the local data structure providing σ does not go to ∞.

Unlike tSNE, UMAP uses cross-entropy (CE) as a cost function instead of KL-divergence

This leads to a large change in the local-global structural conservation balance. At small values of X, we get the same limit as for tSNE, since the second term vanishes due to the fact that the prefactor and the logarithmic function are slower than the polynomial function:

Thus, in order to minimize the penalty rule, the Y-coordinate has to be very small, i.e., Y → 0. This is exactly the same behavior as for tSNE. However, at the opposite limit of large X, i.e., X → ∞, the first term vanishes and the second term has a prefactor of 1, yielding:

Here, if Y were very small, we would get a very large penalty because Y is in the denominator of the logarithm, and so we encourage Y to be large so that the ratio under the logarithm becomes 1 and we get zero penalty. Thus, we get Y → ∞ at X → ∞, so the overall distance from the higher dimensional space to the lower dimensional space stays the same, which is exactly what we want. To illustrate this, let's plot the UMAP CE cost function:

Here, we can see that the "right" part of the plot looks very similar to the kl scattering surface above. This means that when X is low, we still want Y to be low in order to minimize losses. However, when X is large, the distance to Y has to be large too, because if it is small, the loss of CE (X, Y) will be huge. Remember, before, for KL (X, Y) surfaces, we had no difference between high and low Y values when X was large, which is why the CE (X, Y) cost function was able to maintain global and local distances.

We know that UMAP is faster than tSNE worry) when a large number of data points, b) embedding dimensions greater than 2 or 3, c) a large number of environmental dimensions of the dataset. Here, let us try to understand the mathematical and algorithmic implementation of UMAP which is superior to tSNE.

Both tSNE and UMAP basically consist of two steps:

However, I noticed that the first step of UMAP is much faster than tSNE. There are two reasons for this:

Next, UMAP actually gets faster in the second step as well. There are several reasons for this improvement as well:

In this article, we learned that although tSNE has served the field of single-cell research for many years, it has too many drawbacks such as its speed and lack of global distance preservation.UMAP generally follows the philosophy of tSNE but introduces a few improvements such as another cost function and lack of normalization of high- and low-dimensional probabilities.

In addition to running fast and having a small memory footprint, UMAP has a big advantage when dealing with cytological data, which is that it can reflect the continuity and organization of differentiation between cell populations. This will be explained in more detail below with data 2 from the literature.

The same data set was subjected to tSNE and UMAP downscaling, respectively, for up to 300,000 samples of T and NK cells enriched from 8 different tissues, and Phenograph clustering was used to classify the cells into 6 major categories, with each color representing one cell. As can be seen from the figure, both UMAP and tSNE can better separate different classes of cells. However, tSNE tends to divide the same cell population into more clusters, as shown in the figure, the CD8 T cells in the black circle have more clusters and are farther apart in the tSNE results.

This same set of data color differentiates the cell's on the UMAP and t-SNE plots using tissue source, and an interesting phenomenon can be observed. Compared to UMAP, t-SNE is more inclined to separate overall cells based on their origin. UMAP, on the other hand, will take into account the category and source of the cell population to arrange the cells, as shown in the figure within the CD4 T cell and CD8 T cell populations, the cells will be arranged with some regularity with respect to their source, both roughly from umbilical cord blood (CB) and peripheral blood mononuclear cells (PBMC), to the liver (Liver) and spleen (Spleen), and finally to the lentils at one end or the skin at the other end (Skin) , the gut (Gut) and the lungs (Lung).

By the distribution of expression clusters of the resident memory T-cell marker CD69/CD103, the memory T-cell marker CD45 RO, and the na?ve T-cell marker CCR7, it can be observed that UMAP can demonstrate successive stages of T-cell differentiation. Whereas in the tSNE results, there is also a continuum between these clusters, but without a very pronounced along-axis structure. The same phenomenon is also observed in the hematopoietic cell system. This shows that UMAP can demonstrate the continuity of cell clusters when processing large data sets.

Data randomization was performed on each of the three data sets (Samusik, Wong, and Han_400k) to reduce the data to different orders of magnitude between 100-200,000 to form a small data set. The vertical axis is the correlation between the small dataset and the original dataset, representing the reproducibility of the dimensionality reduction method on different data volumes. umap performs the best, and the larger the dataset, the more obvious the advantage.

The figure below compares the effectiveness of UMAP and t-SNE for downscaling a set of 784-dimensional Fashion MNIST high-dimensional datasets to 3 dimensions.

While both algorithms exhibit strong local clustering and group similar classes together, UMAP also separates these groups of similar classes from each other. Additionally, UMAP took 4 minutes for dimensionality reduction, while multicore t-SNE took 27 minutes.

UMAP's two most commonly used parameters, n_neighbors and min_dist, can be used effectively to control the balance between local and global structure in the final result.

The most important parameter is n_neighbors, the number of approximate nearest neighbors. It is effectively used to control the balance between local and global structure in UMAP. When the data is small, UMAP pays more attention to the local structure, and when the data is large, UMAP tends to represent the big-picture structure, dropping some details.

The second parameter is min_dist, the minimum distance between points. This parameter controls how tightly the UMAP is clustered together, and will be tighter when the data is smaller. Larger values will be looser, and the focus will be on retaining a broad topology.

t-SNE and UMAP behave very similarly for the most part, with the notable exception of the following example: wide, sparse clusters have dense clusters in them (as shown in the figure below). UMAP is unable to separate two nested clusters, especially at higher dimensions.

UMAP's use of localized distances in the initial graph construction can explain the algorithm's inability to handle the situation. Since distances between high-dimensional points tend to be very similar (the curse of dimensionality), it may be possible to mix them together as a result.

Algorithms are hard, that's why those who know them look like cows

The sky is the limit