Dimensionality Reduction for Text
Introduction: Text-based queries can then be represented as vectors, which can be used to search for their nearest neighbors in a document collection. However, for any nontrivial document database, the number of terms T and the number of documents D are usually quite large. Such high dimensionality leads to the problem of inefficient computation, since the resulting frequency table will have size T X D. Furthermore, the high dimensionality also leads to very sparse vectors and increases the difficulty in detecting and exploiting the relationships among terms (e.g., synonymy). To overcome these problems, dimensionality reduction techniques such as latent semantic indexing, probabilistic latent semantic analysis, and locality preserving indexing can be used.
We now briefly introduce these methods. To explain the basic idea beneath latent semantic indexing and locality preserving indexing, we need to use some matrix and vector notations. In the following part, we use x1; : : : ;xtn ∈ ℝm to represent the n documents with m features (words). They can be represented as a term-document matrix X = [x1;x2; : : : ;xn].
Latent Semantic Indexing: Latent semantic indexing (LSI) is one of the most popular algorithms for documentdimensionality reduction. It is fundamentally based on SVD (singular value decomposition). Suppose the rank of the term-document X is r, then LSI decomposes X using SVD as follows:
X =USVT
where S = diag(σ1; : : : ; σr) and σ1 ≥ σ2 ≥ . . . ≥ σr are the singular values of X, U = [a1; : : : ;ar] and ai is called the left singular vector, and V = [v1; : : : ;vr], and vi is called the right singular vector. LSI uses the first k vectors in U as the transformation matrix to embed the original documents into a k-dimensional subspace. It can be easily checked that the column vectors of U are the eigenvectors of XXT. The basic idea of LSI is to extract the most representative features, and at the same time the reconstruction error can be minimized. Let a be the transformation vector. The objective function of LSI can be stated as follows:
with the constraint,
aT a = 1
Since XXT is symmetric, the basic functions of LSI are orthogonal.
Locality Preserving Indexing: Different from LSI, which aims to extract the most representative features, Locality Preserving Indexing (LPI) aims to extract the most discriminative features. The basic idea of LPI is to preserve the locality information (i.e., if two documents are near each other in the original document space, LPI tries to keep these two documents close together in the reduced dimensionality space). Since the neighboring documents (data points in high dimensional space) probably relate to the same topic, LPI is able to map the documents related to the same semantics as close to each other as possible.
Given the document set x1; : : : ;xn ∈ ℝm, LPI constructs a similarity matrix S ∈ ℝm X n. The transformation vectors of LPI can be obtained by solving the following minimization problem:
with the constraint,
aT X D XT a = 1;
where L = D-S is the Graph Laplacian and Dii = Σj Si j. Dii measures the local density around xi. LPI constructs the similarity matrix S as