Introduction
This tutorial will guide you through understanding and implementing Latent Semantic Indexing (LSI) from scratch. LSI is a powerful technique used in information retrieval and natural language processing to find hidden relationships between words and documents.
By the end of this tutorial, you'll understand:
- How LSI works mathematically
- How to implement LSI using Python
- How to use LSI for document ranking and search
- How to visualize document relationships in concept space
What is Latent Semantic Indexing?
Latent Semantic Indexing (LSI) is a technique that:
- Helps retrieve accurate information from very large databases
- Was invented for Information Retrieval in the late 1980s
- Determines document similarity based on the contexts in which words appear
- Allows search engines to understand page content beyond exact keyword matching
- Focuses on "Themes" instead of just "Keywords"
Key Benefits
- Finds documents that are conceptually similar even if they don't share exact keywords
- Reduces the dimensionality of the document space
- Handles synonymy (different words with similar meanings)
- Manages polysemy (words with multiple meanings)
Step-by-Step Implementation
Step 1: Import Required Libraries
import pandas as pd
import math
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from string import punctuation
import nltk
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
Step 2: Data Loading and Preprocessing
Choose Your Dataset
# Select a news group category for analysis
news_group = "sci.med" # Medical science articles
n_docs = 40 # Number of documents to analyze
# Available categories include:
# 'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
# 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
# 'misc.forsale', 'rec.autos', 'rec.motorcycles',
# 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
# 'sci.electronics', 'sci.med', 'sci.space',
# 'soc.religion.christian', 'talk.politics.guns',
# 'talk.politics.mideast', 'talk.politics.misc',
# 'talk.religion.misc'
Load and Clean the Data
# Load the dataset
categories = [news_group]
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
remove=('headers', 'footers', 'quotes'),
shuffle=True
)
# Define text cleaning functions
stopwords = nltk.corpus.stopwords.words('english')
def strip_punctuation(s):
"""Remove punctuation from text"""
return ''.join(c for c in s if c not in punctuation)
def strip_nums(s):
"""Remove numbers from text"""
return ''.join([i for i in s if not i.isdigit()])
# Process documents: tokenize, clean, and remove stopwords
all_docs = [
nltk.tokenize.wordpunct_tokenize(
strip_nums(strip_punctuation(twenty_train.data[i]).lower())
)
for i in range(n_docs)
]
# Create bag of words (bow) by removing stopwords
bow = [
[word for word in doc if word not in stopwords]
for doc in all_docs
]
# Filter out empty documents
bow = list(filter(None, bow))
Step 3: Create Term-Document Matrix
Build Vocabulary
def unique(bow):
"""Create unique vocabulary from all documents"""
vocabulary = bow[0]
for i in range(1, len(bow)):
vocabulary = set(vocabulary).union(set(bow[i]))
return vocabulary
# Create vocabulary and initialize term-document matrix
wordset = unique(bow)
worddict = [dict.fromkeys(wordset, 0) for i in range(len(bow))]
Calculate Term Frequencies
def term_document_matrix():
"""Create term-document frequency matrix"""
for bow_i, worddict_i in zip(bow, worddict):
for word in bow_i:
worddict_i[word] += 1
return pd.DataFrame(worddict)
# Create the term-document matrix
docterm = term_document_matrix()
Step 4: Calculate TF-IDF Matrix
Term Frequency (TF) Calculation
def term_freq(worddict, bow):
"""Calculate term frequency for a document"""
tfdict = {}
bowcount = len(bow)
for word, count in worddict.items():
tfdict[word] = count / float(bowcount)
return tfdict
# Calculate TF for all documents
tfbow = [term_freq(worddict_i, bow_i) for worddict_i, bow_i in zip(worddict, bow)]
Inverse Document Frequency (IDF) Calculation
def idf(doclist):
"""Calculate inverse document frequency"""
idfdict = {}
n = len(doclist)
# Initialize IDF dictionary
idfdict = dict.fromkeys(doclist[0].keys(), 0)
# Count documents containing each word
for doc in doclist:
for word, val in doc.items():
if val > 0:
idfdict[word] += 1
# Calculate IDF: log(total_docs / docs_containing_word)
for word, val in idfdict.items():
idfdict[word] = math.log(n / float(val))
return idfdict
# Calculate IDF values
idfs = idf(worddict)
TF-IDF Calculation
def tfidf(tfbow, idfs):
"""Calculate TF-IDF scores"""
tfidf_dict = {}
for word, tf_val in tfbow.items():
tfidf_dict[word] = tf_val * idfs[word]
return tfidf_dict
# Calculate TF-IDF for all documents
tfidf_scores = [tfidf(tf_doc, idfs) for tf_doc in tfbow]
# Create TF-IDF matrix (terms × documents)
X = pd.DataFrame(tfidf_scores).T
Step 5: Apply Singular Value Decomposition (SVD)
# Perform SVD on the TF-IDF matrix
L, S, R = np.linalg.svd(X)
# L: Left singular vectors (term-concept matrix)
# S: Singular values (concept strengths)
# R: Right singular vectors (concept-document matrix)
Key Concepts Explained
Understanding SVD Components
-
L Matrix (Left Singular Vectors):
- Shape: (terms × concepts)
- Shows relationship between terms and concepts
- Each column represents a concept in term space
-
S Matrix (Singular Values):
- Shape: (concepts,)
- Diagonal matrix showing importance of each concept
- Larger values = more important concepts
-
R Matrix (Right Singular Vectors):
- Shape: (concepts × documents)
- Shows relationship between concepts and documents
- Each row represents a concept in document space
Low-Rank Approximation
LSI uses dimensionality reduction by keeping only the most important concepts:
def top_n_pad_0(U, S, V, n):
"""Create a matrix with top n singular values, padding rest with zeros"""
t = list(S[:n])
for i in range(len(S) - n):
t.append(0)
A = np.diag(t)
# Handle matrix dimension differences
newrow = [0] * len(S)
if len(U) > len(V):
for i in range(len(U) - len(S)):
A = np.vstack([A, newrow])
else:
for i in range(len(V) - len(S)):
A = np.vstack([A.T, newrow]).T
return A
def reconstruct(u, s, v, n):
"""Reconstruct matrix using top n concepts"""
A = top_n_pad_0(u, s, v, n)
return np.round((u.dot(A)).dot(v), decimals=3)
Choosing Optimal Number of Concepts
def frobenius(a, a2):
"""Calculate Frobenius norm between original and reconstructed matrices"""
a = np.array(a)
return (np.sqrt(np.sum((a - a2)**2))) / np.sqrt(np.sum(a**2))
def find_k():
"""Find optimal number of concepts using Frobenius norm threshold"""
for i in range(1, len(S)):
f = frobenius(X, reconstruct(L, S, R, i))
if f < 0.38: # Threshold for acceptable reconstruction error
return i
Practical Examples
Document Search and Ranking
def search(query):
"""Search for documents relevant to query"""
# Clean and tokenize query
q = strip_punctuation(query)
q = q.lower().split(" ")
# Convert query to vector space
terms = X.index
query_vector = np.array([1 if term in q else 0 for term in terms])
if np.count_nonzero(query_vector == 1) == 0:
print("Keywords don't match with documents")
else:
# Calculate similarity scores using LSI space
k = find_k() # Optimal number of concepts
reconstructed_matrix = reconstruct(L, S, R, k)
scores = query_vector.dot(reconstructed_matrix)
# Sort documents by relevance score
ranked_docs = sorted(
zip(range(1, len(scores) + 1), scores),
key=lambda x: x[1],
reverse=True
)
print("Documents ranked by relevance:")
for doc_id, score in ranked_docs:
print(f"Document-{doc_id} (Score: {score:.3f})")
# Example usage
search("Sky is high.")
Concept Analysis
def print_concepts(n_concepts, n_keywords):
"""Print top keywords for each concept"""
terms = X.index
for i, component in enumerate(L.T[:n_concepts]):
term_weights = zip(terms, component)
top_terms = sorted(term_weights, key=lambda x: x[1], reverse=True)[:n_keywords]
print(f"Concept {i+1}:")
for term, weight in top_terms:
print(f" {term} ({weight:.3f})")
print()
# Example: Show top 5 concepts with 10 keywords each
print_concepts(5, 10)
def print_concept_docs(n_docs):
"""Print documents most associated with each concept"""
doc_ids = X.columns
for i, component in enumerate(R):
doc_weights = zip(doc_ids, component)
top_docs = sorted(doc_weights, key=lambda x: x[1], reverse=True)[:n_docs]
print(f"Documents most related to Concept {i+1}:")
for doc_id, weight in top_docs:
print(f" Document-{doc_id} ({weight:.3f})")
print()
# Example: Show top 2 documents for each concept
print_concept_docs(2)
Advanced Features
Visualization Techniques
2D Document Visualization
def plot_docs_2d():
"""Plot documents in 2D concept space"""
# Project documents into concept space
doc_concepts = R.T.dot(np.diag(S))
concept1 = list(doc_concepts[0]) # First concept
concept2 = list(doc_concepts[1]) # Second concept
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111)
# Scatter plot
ax.scatter(concept1, concept2)
# Label each point with document ID
for x, y, doc_id in zip(concept1, concept2, range(len(concept1))):
ax.text(x, y, str(doc_id))
plt.xlabel("Concept 1")
plt.ylabel("Concept 2")
plt.title("Documents in 2D Concept Space")
plt.grid(True)
plt.show()
plot_docs_2d()
3D Document Visualization
def plot_docs_3d():
"""Plot documents in 3D concept space"""
# Project documents into concept space
doc_concepts = R.T.dot(np.diag(S))
concept1 = list(doc_concepts[0])
concept2 = list(doc_concepts[1])
concept3 = list(doc_concepts[2])
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
# 3D scatter plot
ax.scatter(concept1, concept2, concept3)
# Label each point
for x, y, z, doc_id in zip(concept1, concept2, concept3, range(len(concept1))):
ax.text(x, y, z, str(doc_id))
ax.set_xlabel("Concept 1")
ax.set_ylabel("Concept 2")
ax.set_zlabel("Concept 3")
ax.set_title("Documents in 3D Concept Space")
plt.show()
plot_docs_3d()
Performance Analysis
def calculate_reconstruction_error(original, reconstructed):
"""Calculate how well the reduced model represents original data"""
return frobenius(original, reconstructed)
def analyze_concept_importance():
"""Analyze the importance of different numbers of concepts"""
errors = []
concept_counts = range(1, min(10, len(S)))
for k in concept_counts:
reconstructed = reconstruct(L, S, R, k)
error = frobenius(X, reconstructed)
errors.append(error)
print(f"Concepts: {k}, Reconstruction Error: {error:.4f}")
return concept_counts, errors
# Analyze concept importance
concept_counts, errors = analyze_concept_importance()
Mathematical Foundation
The LSI Process
-
Document Collection: Start with a collection of documents D = {d₁, d₂, ..., dₙ}
-
Term-Document Matrix: Create matrix A where:
- Rows represent terms (words)
- Columns represent documents
- A[i,j] = TF-IDF score of term i in document j
-
SVD Decomposition: A = U × Σ × Vᵀ where:
- U: m × r matrix (terms × concepts)
- Σ: r × r diagonal matrix (concept strengths)
- Vᵀ: r × n matrix (concepts × documents)
-
Dimensionality Reduction: Keep only top k concepts:
- A_k = U_k × Σ_k × V_k ᵀ
- k << min(m, n)
-
Query Processing: For query q:
- Convert to term vector q_v
- Project to concept space: q_c = q_v × U_k × Σ_k⁻¹
- Calculate similarities: similarities = q_c × V_k ᵀ
Conclusion
What You've Learned
-
LSI Fundamentals: Understanding how LSI captures semantic relationships between documents and terms
-
Implementation Skills: Building LSI from scratch using Python and fundamental linear algebra operations
-
Mathematical Concepts: SVD, dimensionality reduction, and matrix operations in information retrieval
-
Practical Applications: Document search, ranking, and visualization in reduced concept space
Key Takeaways
- LSI goes beyond keyword matching by finding conceptual relationships
- SVD is the mathematical foundation that enables semantic understanding
- Dimensionality reduction helps focus on the most important concepts
- Visualization techniques help understand document relationships
Next Steps
- Experiment with different datasets to see how LSI performs across domains
- Try different numbers of concepts to optimize performance for your use case
- Compare with other techniques like Latent Dirichlet Allocation (LDA) or modern transformers
- Scale to larger datasets using efficient sparse matrix implementations
Real-World Applications
- Search engines: Improving search result relevance
- Document clustering: Grouping similar documents
- Recommendation systems: Finding related content
- Text mining: Discovering hidden topics in large text collections
This tutorial provides a comprehensive foundation for understanding and implementing Latent Semantic Indexing. The mathematical concepts, practical implementation, and visualization techniques give you the tools to apply LSI to your own text analysis projects.