代码编织梦想

1.Introduction:


In the processing of lower dimensional data, the calculation of distances and density of data points are not major difficulties to machine learning models. However, when we are handling higher dimensional data, we might face what is called the curse of dimensionality coined by Richard E. Bellman when considering problems in dynamic programming. The curse of dimensionality describes the increasing dimensionality that makes the data point very sparse such that makes calculation growth exponentially and prevents common data organization strategies from being efficient.

One common way to solve the curse of dimensionality is dimension reduction which is by transforming the original higher dimensional space into a lower dimensional subspace with a denser sample. The reason we could apply dimension reduction to a high-dimension space is that we might be able to find an embedding in a lower dimension that satisfies the machine-learning task from the sample space in a higher dimension.

2. Ways of Dimensional Reduction:

2.1 Multiple Dimensional Scaling (MDS)


Multidimensional scaling (MDS) is a technique that creates a map displaying the relative positions of a number of objects, given only a table of the distances between them. The map may consist of one, two, three, or even more dimensions. The program calculates either the metric or the non-metric solution. The table of distances is known as the proximity matrix. It arises either directly from experiments or indirectly as a correlation matrix.

So suppose there is a distance matrix D\in \mathbb{R}^{m\times m} for m samples. For element  d_{ij}\in D, d_{i,j} is the distance between sample x_i and x_j. By MDS we want to acquire a representation matrix Z\in \mathbb{Z}^{n\times m} such that n < m and each two sample ||z_i-z_j||=d_{ij}.

Let B=Z^TZ\in \mathbb{R}^{m\times m} such that b_{ij} = z_i^Tz_j and we have

d_{ij} = ||z_i||^2+|||z_j||^2-2z_i^Tz_j = b_{ii}+b_{jj}-2b_{ij}

.To make \Sigma_{i=1}^m z_i = 0 for convenience, we can find the sum of columns of B, and the sum of rows of B will be 0 too.

So that we can get 
\sum_{i=1}^m d_{ij}^2 = tr(B)+m\cdot b_{jj}
\sum_{j=1}^m d_{ij}^2 = tr(B)+m\cdot b_{ii}
\sum_{i=1}^m\sum_{j=1}^m d_{ij}^2= 2m\cdot tr(B)

By the trace, we can conclude

d_{i\cdot}^2 = \frac{1}{m}=\sum_{j=1}^m d_{ij}^2 
d_{\cdot j}^2 = \frac{1}{m}\sum_{i=1}^m d_{ij}^2
 d_{\cdot \cdot}^2 = \frac{1}{m^2}=\sum_{j=1}^m\sum_{j=1}^m d_{ij}^2

Then we can get B by \[b_{ij}=-\frac{1}{2}(d_{ij}^2-d_{i\cdot}^2-d_{\cdot j}^2+d_{\cdot \cdot}^2)\].

To get Z we can apply eigenvalue decomposition to B, B = V\Lambda V^T, in which \Lambda = diag(\lambda_1,...,\lambda_n) and \lambda_1\geq \lambda_2 \geq...\geq \lambda_d. V is the eigenvector matrix. With n' non zero eigenvalue consist diagonal matrix \Lambda_{n'}=diag(\lambda_1,...,\lambda_{n'}) and the corresponding eigenvector matrix V_{n'}, we can get Z such as
\[Z= \Lambda_{n'}^\frac{1}{2}V_{n'}^T\in \mathbb{R}^{n'\times m}\]
in which each column in Z is a lower dimension vector of a sample.

2.2 Principal component analysis (PCA)

Principal Component Analysis is one of the most common dimensional reduction methods. However, before considering the detail of PCA, we should consider the principal of dimensional reduction. For each data points of an orthogonal space, what kind of hyper plane will be the best way to express them?

We should consider that the hyper plane should either make the distance between each data point and the plane close enough and as separate as possible between data points.

Suppose the data sample is centralized, which \Sigma_i x_i = 0. Then we set the orthogonal basis of the data sample after projection as \{w_1,w_2,...,w_d\}$ in which $||w_i||_2=1 and w_i^Tw_j=0. If we want to reduce the dimension in to d'<d, then the data point x_i under the lower dimensional coordinate will be z_i = (z_{i1},z_{i2},...,z_{id'}), in which z_{ij} = w_j^Tw_i. By that we can write new \hat{x}_i=\Sigma_{j=1}^{d'}z_{ij}w_j.

Considering the training set, the distance between the original data point x_i and the new data point \hat{x_i} is 
\[\sum_{i=j}^m||\sum_{j=1}^{d'}z_{ij}w_j-x_i|| = \sum_{i=1}^mz_i^Tz_i-2\sum_{i=1}^mz_i^TW^Tx_i+const= const - tr(W^T(\sum_{i=1}^mx_ix_i^T)W)\]

So, by the first principal we proposed, we should minimize the distance. Knowing that w is orthogonal basis and \Sigma x_ix_i^T is the covariance matrix, we will be able to have
\[\min_{W} -tr(W^TXX^TW)\] 
such that W^TW = I.

If we start from the second principal, we shall maximize the variance of the transformed data points. The original data point x_i will be projected to the hyper plane as W^Tx_i. Then we shall make the variance as large as possible:
\[\max_{W} tr(W^TXX^TW)\]
which is equivalent to the conclusion we have by the first principal.

From either target function we can apply Lagrangian multiplier such that
\[XX^TW = \lambda W\]
So that we only need to apply eigenvalue decomposition to XX^T and arrange the eigenvalues from large to small. Then we take the first d' eigenvalues from \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_d to compose the basis W.

2.3 (t-SNE)

t-SNE is a algorithm majorly used as visualization of the data set and explaining high dimensional data sets. If we apply t-SNE to n dimensional data sets, we are able to get 3d or 2d data with relatively good representation of the original data sets. t-SNE starts at a random data point to create an affinity matrix S1 by euclidean distance. Then, it turns the distance into joint probability by normal distribution. After that, all data points are randomly aligned in lower dimension. Apply the same process to the original data points and the lower dimensional data points with t distribution. For those lower dimension data points, we can create a affinity matrix S2. Then by using KL divergence as loss function to apply gradient decent to compare S1 and S2.

Suppose in a data set x^i\in X s.t x^i\in \mathbb{R}^n, and we want to find a lower dimensional representation y^i\in Y$ s.t $y^i\in \mathbb{R}^m where m<n.

We define the probability p_{i|j} as the probability that data point x^i will pick data point x^j as its neighbour. Then we pick two of the data points x^i and x^j  and calculate the euclidean distance between them as 
\[D = \sum_{k=1}^n||x_k^i-x_k^j||_2\]

Then we want ti to have the probability density under a Gaussian centered distribution as x^i. So, we generate Gaussian distribution with mean at x^j and place the distance on X-axis. Then we place all the rest point on the X-axis.

We can have such function to determine the probability p_{i|j} by
\[p_{i|j}=\frac{exp(-|x^i-x^j|^2/2\sigma_j^2)}{\sum_{k\neq j}exp(-|x^i-x^j|^2/2\sigma_j^2)}\]

Then we can make the probabilities symmetriezed by write the distribution P as P=(p_{ij})_{i,j=1}^n where 
\[p_{ij}=\frac{p_{i|j}+p_{j|i}}{2n}\]. 
And similarly we define the distribution Q in the lower dimension as Q=(q_{ij})^n_{i,j=1}, and 
\[q_{ij}=\frac{(1+|y^i-y^j|^2)^{-1}}{\sum_{k\neq j}(1+|y^k-y^l|^2)^{-1}}\]

By this we model the neighborhood relation with t distribution.

Our goal to find the y_is through optimization on P and Q is expressed by the Kullback-Leibler divergence
\[KL(P|Q)=\sum_{i=1}^n\sum_{j=1}^n p_{ij}log\frac{p_{ij}}{q_{ij}}\]

From KL divergence, we will be able to optimize the target function to retrieve the best Y by gradient decent. By calculation, the gradient will be 
\[\frac{\partial KL(P|Q)}{\partial y^i} = 4\sum_{j=1}^n(p_{ij}-q_{ij})(y^i-y^j)(1+|y^i-y^j|^2)^{-1}\]

Then we can renew Y by having 
\[Y^{(t)}=Y^{(t-1)}+\alpha \frac{\partial KL(P|Q)}{\partial Y} +\beta(Y^{(t-1)}-Y^{(t-2)})]\]
in which \alpha and \beta  is the learning rate.

2.4 UMAP

UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. It provides a very general framework for approaching manifold learning and dimension reduction, but can also provide specific concrete realizations.

UMAP is an algorithm to find a representation of a given data set D in \mathbb{R}^n in a lower-dimensional
space \mathbb{R}^m. We think of the data points as being drawn from some Riemannian manifold M, then mapped into \mathbb{R}^n by some embedding \phi:M\rightarrow \mathbb{R}^n. To find a good map of M we can assume that D is uniformly drawn from M that parts of M might be stretched out or compressed under the embedding into \mathbb{R}^n. This assumption means that D approximates M well. (This is also where the ‘U’ in UMAP comes from.) We will also assume that M is locally connected and that there are enough points in D that no point in D is isolated in its own connected component. These connectivity assumptions imply that every point in D is connected in M to its nearest neighbour in D\rightarrow \mathbb{R}^n

Ideally we want the low dimensional representation to have as similar a fuzzy topological structure as possible. The first question is how do we determine the fuzzy topological structure of a low dimensional representation, and the second question is how do we find a good one.

When constructing the initial fuzzy topological representation we can take a few shortcuts. In practice, since fuzzy set membership strengths decay away to be vanishingly small, we only need to compute them for the nearest neighbors of each point. Ultimately that means we need a way to quickly compute (approximate) nearest neighbors efficiently, even in high dimensional spaces. We can do this by taking advantage of the Nearest-Neighbor-Descent algorithm of Dong et al. The remaining computations are now only dealing with local neighbors of each point and are thus very efficient.

In optimizing the low dimensional embedding we can again take some shortcuts. We can use stochastic gradient descent for the optimization process. To make the gradient descent problem easier it is beneficial if the final objective function is differentiable. We can arrange for that by using a smooth approximation of the actual membership strength function for the low dimensional representation, selecting from a suitably versatile family. In practice UMAP uses the family of curves of the form \frac{1}{ax^{2b}}. Equally we don’t want to have to deal with all possible edges, so we can use the negative sampling trick (as used by word2vec and LargeVis), to simply sample negative examples as needed. Finally since the Laplacian of the topological representation is an approximation of the Laplace-Beltrami operator of the manifold we can use spectral embedding techniques to initialize the low dimensional representation into a good state.

Putting all these pieces together we arrive at an algorithm that is fast and scalable, yet still built out of sound mathematical theory. 

2.5 Non-negative matrix factorization (NMF)

Non-negative matrix factorization (NMF or NNMF) is another very common method to apply dimensional reduction. The main idea of NMF is to decompose the data sample X in to X\approx WH.

We know that X \in \mathbb{R}^{d\times n}, then if we want to decompose it to H \in \mathbb{R}^{d'\times n} such that d'< d, we need to find W\in \mathbb{R}^{d\times d'} in which we want W\geq 0$ and $H\geq 0.

There are many ways to define the loss function, we can use the basic euclidean distance as the loss function. So, we should make each element between X and WH as close as possible. We could get:
\[D(X,WH) = ||X-WH||_F^2 = \sum_{i=1}^d\sum_{j=1}^n(X_{ij}-(WH)_{ij})^2\]
Also, we know we can write 

(WH)_{ij}=\sum_{k=1}^{d'}W_{ik}H_{kj}

Then the problem was transferred to find W and H such that minimize the distance function D(X,WH)
\[\min_{W,H}||X-WH||_F^2 = \sum_{i=1}^d\sum_{j=1}^n(X-\sum_{k=1}^{d'}W_{ik}H_{kj})^2\]

We can use Lagrangian multiplier to turn it in to a boundless optimization on the optimization function:
\[\min D = \sum_{i=1}^d\sum_{j=1}^n(X-\sum_{k=1}^dW_{ik}H_{kj})^2 - \sum_{i=1}^d\sum_{k}^{d'} \lambda_{ik}W_{ik} - \sum_{k}^{d'}\sum_{j=1}^n\lambda_{kj}W_{kj}\]

Get the partial derivative of D:
\[\frac{\partial D}{\partial W_{ik}}=\sum_{j=1}^{n}2(X_{ij}-\sum_{k=1}^{d'}W_{ik}H_{kj})*(-H_{kj})-\lambda_{ik}\]
\[\frac{\partial D}{\partial H_{ik}}=\sum_{j=1}^{d}2(X_{ij}-\sum_{k=1}^{d'}W_{ik}H_{kj})*(-W_{ik})-v_{kj}\]

Then we remove the constant and exchange the position of X_{ij}-\Sigma_{k=1}^{d}W_{ik}H_{ij}$ and $W_{ik}, so that we can retrieve:
\[\sum_{j=1}^n(X-WH)_{ij}*(H^T)_{jk}*W_{ik}=0\]
\[\sum_{i=1}^d(W^T)_{ki}*(X-WH)_{ij}*(H)_{kj}=0\]

Since W_{ik} is irrelevant to j and \Sigma_{i=1}^d(W^T)_{ki}*(X-WH)_{ij} is equal to W^T(X-WH)_{kj}. So, we can reorganize the equation as
\[[(X-WH)H^T]_{ik}W_{ik}=0\]
\[[W^T(X-WH)]_{kj}H_{kj}=0\]

We separate V and WH to the two sides of the equation
\[(XH^T)_{ik}W_{ik}=(WHH^T)_{ik}W_{ik}\]
\[(W^TX)_{kj}W_{kj}=(W^TWH)_{kj}H_{kj}\]

Then we can get how to renew W with X and H and renew H with X and W
\[W_{ik}^{new} = \frac{(XH^T)_{ik}}{(W^{old}HH^T)_{ik}}W_{ik}^{old}\]
\[H_{kj}^{new} = \frac{(W^TX)_{kj}}{(W^TWH^{old})_{kj}}H_{kj}^{old}\]

3. Sum up


In general, there are two types of feature extraction methods; linear and non-linear transformations. Linear methods, such as PCA, construct a new set of dimensions or latent variables based on a linear combination of the original features. Non-linear methods on the other hand, such as t-distributed Stochastic Neighbor Embedding (t-SNE), non-linearly retain local similarities between samples at the cost of retaining the similarities between dissimilar samples. As a result, t-SNE better preserves local dissimilarities as it is not condensed due to the large dissimilarities in the data set. A major advantage is that non-linear mappings are ideal for low-dimensional representations or visualizations. Linear reduction methods on the other hand, such as PCA, are preferably used to reduce complexity, improve run time, determine feature importance, and last but not least prevent the curse of dimensionality. Notably, a PCA method can also be very informative when using it for visualization purposes. 

As for the difference between PCA and NMF, we can see PCA would give a new data features as result of combination of existing one while NMF just decompose a data set matrix into its non negative sub matrix whose dimensionality is uneven. PCA is recommended when you have to transform high dimensions into low dimensions and you are okay to loose original features in process as new one are introduced. PCA also suitable for linearly separable data but for complex/ nonlinear dataset modifications in PCA algorithms are necessary. Finally, PCA is suitable when we have to establish a geometrical separability between data tuples. Moreover it is highly suitable in classification model building. Output of NMF can be visualized as compressed version of original dataset. While dataset modifications in PCA algorithms are necessary in nonlinear dataset, NMF don't require any modifications irrespective of linear separability. NMF very good for word and vocabulary recognition as compressing original text into smaller data vectors is easily possible. Also it is a  good choice in image processing problem, text mining, transcription processes, cryptic encoding and decoding. It can also handle decomposition of non understandable data object like videos , music, images.

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/A_L_S/article/details/127992861

learning to learn by gradient descent by gradient descent - pytorch实践_森涅斯的博客-爱代码爱编程

原创博文,转载请注明来源 https://senyang-ml.github.io/2018/12/17/learning_to_learn/ 引言 “浪费75金币买控制守卫有什么用,还不是让人给拆了?我要攒钱

文献阅读 - dimensionality reduction by learning an invariant mapping_k5niper的博客-爱代码爱编程

Dimensionality Reduction by Learning an Invariant Mapping R. Hadsell, S. Chopra, Y. Lecun, Dimensionality Red

11.20二叉树基础题型_小白孙在路上的博客-爱代码爱编程

一.二叉树的存储 1.存储结构 存储结构:顺序存储或者是类似于链表的链式存储 二叉树的链式存储是通过一个一个的节点引用起来的,常见的表示方式有二叉和三叉表示方式 // 孩子表示法 class Node { int val; // 数据域 Node left; // 左孩子的引用,常常代表左孩子为根的整棵左子树 Node right; /