Any sequence of symbols (or characters) drawn from an alphabet constitutes a text. A large portion of the information available in electronic form around the world is actually text (other popular forms are structured and multimedia information).
Books, journals, articles, newspapers, jurisprudence databases, corporate information, and the Web are all examples of text documents. A text database is a system that keeps track of a large text collection and makes it accessible in a timely and accurate manner.
Textual data is used in an increasing number of business and research situations, and Information Systems (IS) researchers come across it in a variety of settings.
Researchers in the IS discipline, for example, are interested in examining titles, abstracts, or full-text bodies of IS publications in order to identify attributes related to the nature of the research, such as research topics, theories, and methods.
There are two ways to search for information in text documents:
Semantic search: A text search based on the meaning rather than the syntax of natural language text.
Syntactic search: A text search that is based on string patterns found in the text, rather than on meaning.
Latent semantic analysis
Latent semantic analysis (LSA) is a mathematical method for computer modelling and simulation of the meaning of words and passages in natural text corpora. Many aspects of human language learning and comprehension are well represented by LSA.
When complex wholes are treated as additive functions of component parts, they can be used to solve problems like information retrieval (one of the text mining techniques), educational technology, and pattern recognition.
Latent Semantic Analysis (LSA) is a type of natural language processing that looks at how documents and the terms they contain are related. It searches unstructured data for hidden relationships between terms and concepts using singular value decomposition, a mathematical technique.
Also Read | What is Topic Modelling in NLP?
Origin and goal of LSA
Latent Semantic Analysis is an information retrieval technique that was patented in 1988, despite its origins dating back to the 1960s.
Automated document categorization and concept searching are the main applications of LSA. It's also used in software engineering (to decode source code), publishing (text summarization), SEO, and other fields. There are several flaws in Latent Semantic Analysis, the most significant of which is its inability to capture polysemy (multiple meanings of a word).
The vector representation in this case is an average of all the meanings of the word in the corpus. Because of this, comparing documents is difficult. In the world of search engine optimization, Latent Semantic Indexing (LSI) is a term that is frequently used in place of Latent Semantic Analysis.
On-page SEO, according to some SEO experts, can benefit from LSI. However, given that there are more recent and elegant approaches to natural language processing, the effectiveness of LSI in optimising content for search is questioned.
The goal of latent semantic analysis is to generate text representations based on these topics and latent features. Additionally, the original text-based data set's dimensionality is reduced. The first step in the latent semantic analysis is to create a document term matrix, followed by singular value decomposition on the matrixed document.
The term matrix refers to the fact that text documents can be represented as space in vectors. By encoding text dimensionality in latent features, the text's dimensionality is reduced. These latent features represent topics in the original text data using latent semantic analysis.
Also Read | What is Latent Dirichlet Allocation (LDA) in NLP?
LSA in computational psychoanalysis
Latent semantic analysis (LSA) is a method of automatic indexing and retrieval that attempts to solve these issues by mapping documents and terms to a representation in the so-called latent semantic space. LSA is usually the one who takes the lead (high dimensional).
As a result, documents are represented in vector space using term frequencies as a starting point. A dimension-reducing linear projection is used This mapping's specific form is determined. It is based on a singular value decomposition and is generated by a given document collection (in short, SVD) of the term-document matrix corresponding to it.
The general claim is that document similarity or document and query similarities can be estimated more reliably in the reduced latent space representation than in the original latent space representation.
The most important result is that documents that contain frequently co-occurring terms, even if they do not contain any terms, will have a similar representation in the latent space. As a result, LSA functions as a form of noise reduction and has the potential to assist. Find synonyms and words that refer to the same subject, even if the relationship is tenuous.
Implicit and non-obviously identified (as in ESA). LSA's main idea is to map documents (and, by symmetry, terms) to a vector space of some sort. The latent semantic space has a reduced dimensionality. Decomposing yields this mapping.
The term-document matrix, N, in the canonical factorization N = UΣ tV , where U and V are orthogonal matrices, i.e., U tU = V tV = I,
The singular values of N are stored in a matrix. The original vector is used in latent semantic indexing. Documents are represented in the low-dimensional latent space instead of in space. The similarity is calculated based on this representation. (source)
The semantic primes, according to this theory, form a natural semantic metalanguage (in short, NSM). As well as having a predetermined meaning. The meaning of this is then used to deduce the meaning of other words.
Complex ideas that are broken down. After that, the decomposition will be used to classify the data. Create equivalence classes by encapsulating concepts within their meanings.
Advantages and disadvantages of LSA
Although Latent Semantic Analysis has a lot of potentials, it also has some drawbacks Both sides of LSA must be understood in order to know when to use it and when to try something else.
In LSA, the concept in question, as well as all related documents, are likely to be represented in the same way.
The original semantic structure of the space, as well as its original dimensions, are recovered using LSA analysis. LSA analysis' new dimensions provide a more accurate representation of documents and queries.
LSA can help remove some "noise" from data by using a reduced representation. The noise is data that can be described as uncommon and insignificant uses of certain terms.
By definition, LSA factors are orthogonal, so data is placed in the reduced space in a way that reflects the correlations in their use across documents and aids retrieval.
LSA vectors necessitate a lot of storage. Although there have been many advancements in electronic storage media, the loss of sparsity due to large data continues to be a more serious implication.
LSA performs well for long documents due to the small number of context vectors used to describe each document. However, because of the large size of the data, additional storage space and computing time are required, reducing LSA's efficiency. (source)