Build a Latent Semantic Indexing (LSI) Search Engine in Python — From Scratch

Introduction

This tutorial will guide you through understanding and implementing Latent Semantic Indexing (LSI) from scratch. LSI is a powerful technique used in information retrieval and natural language processing to find hidden relationships between words and documents.

By the end of this tutorial, you'll understand:

  • How LSI works mathematically
  • How to implement LSI using Python
  • How to use LSI for document ranking and search
  • How to visualize document relationships in concept space

What is Latent Semantic Indexing?

Latent Semantic Indexing (LSI) is a technique that:

  1. Helps retrieve accurate information from very large databases
  2. Was invented for Information Retrieval in the late 1980s
  3. Determines document similarity based on the contexts in which words appear
  4. Allows search engines to understand page content beyond exact keyword matching
  5. Focuses on "Themes" instead of just "Keywords"

Key Benefits

  • Finds documents that are conceptually similar even if they don't share exact keywords
  • Reduces the dimensionality of the document space
  • Handles synonymy (different words with similar meanings)
  • Manages polysemy (words with multiple meanings)

Step-by-Step Implementation

Step 1: Import Required Libraries

import pandas as pd
import math
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from string import punctuation
import nltk
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

Step 2: Data Loading and Preprocessing

Choose Your Dataset

# Select a news group category for analysis
news_group = "sci.med"  # Medical science articles
n_docs = 40  # Number of documents to analyze

# Available categories include:
# 'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
# 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
# 'misc.forsale', 'rec.autos', 'rec.motorcycles',
# 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
# 'sci.electronics', 'sci.med', 'sci.space',
# 'soc.religion.christian', 'talk.politics.guns',
# 'talk.politics.mideast', 'talk.politics.misc',
# 'talk.religion.misc'

Load and Clean the Data

# Load the dataset
categories = [news_group]
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    shuffle=True
)

# Define text cleaning functions
stopwords = nltk.corpus.stopwords.words('english')

def strip_punctuation(s):
    """Remove punctuation from text"""
    return ''.join(c for c in s if c not in punctuation)

def strip_nums(s):
    """Remove numbers from text"""
    return ''.join([i for i in s if not i.isdigit()])

# Process documents: tokenize, clean, and remove stopwords
all_docs = [
    nltk.tokenize.wordpunct_tokenize(
        strip_nums(strip_punctuation(twenty_train.data[i]).lower())
    ) 
    for i in range(n_docs)
]

# Create bag of words (bow) by removing stopwords
bow = [
    [word for word in doc if word not in stopwords] 
    for doc in all_docs
]

# Filter out empty documents
bow = list(filter(None, bow))

Step 3: Create Term-Document Matrix

Build Vocabulary

def unique(bow):
    """Create unique vocabulary from all documents"""
    vocabulary = bow[0]
    for i in range(1, len(bow)):
        vocabulary = set(vocabulary).union(set(bow[i]))
    return vocabulary

# Create vocabulary and initialize term-document matrix
wordset = unique(bow)
worddict = [dict.fromkeys(wordset, 0) for i in range(len(bow))]

Calculate Term Frequencies

def term_document_matrix():
    """Create term-document frequency matrix"""
    for bow_i, worddict_i in zip(bow, worddict):
        for word in bow_i:
            worddict_i[word] += 1
    return pd.DataFrame(worddict)

# Create the term-document matrix
docterm = term_document_matrix()

Step 4: Calculate TF-IDF Matrix

Term Frequency (TF) Calculation

def term_freq(worddict, bow):
    """Calculate term frequency for a document"""
    tfdict = {}
    bowcount = len(bow)
    for word, count in worddict.items():
        tfdict[word] = count / float(bowcount)
    return tfdict

# Calculate TF for all documents
tfbow = [term_freq(worddict_i, bow_i) for worddict_i, bow_i in zip(worddict, bow)]

Inverse Document Frequency (IDF) Calculation

def idf(doclist):
    """Calculate inverse document frequency"""
    idfdict = {}
    n = len(doclist)
    
    # Initialize IDF dictionary
    idfdict = dict.fromkeys(doclist[0].keys(), 0)
    
    # Count documents containing each word
    for doc in doclist:
        for word, val in doc.items():
            if val > 0:
                idfdict[word] += 1
    
    # Calculate IDF: log(total_docs / docs_containing_word)
    for word, val in idfdict.items():
        idfdict[word] = math.log(n / float(val))
    
    return idfdict

# Calculate IDF values
idfs = idf(worddict)

TF-IDF Calculation

def tfidf(tfbow, idfs):
    """Calculate TF-IDF scores"""
    tfidf_dict = {}
    for word, tf_val in tfbow.items():
        tfidf_dict[word] = tf_val * idfs[word]
    return tfidf_dict

# Calculate TF-IDF for all documents
tfidf_scores = [tfidf(tf_doc, idfs) for tf_doc in tfbow]

# Create TF-IDF matrix (terms × documents)
X = pd.DataFrame(tfidf_scores).T

Step 5: Apply Singular Value Decomposition (SVD)

# Perform SVD on the TF-IDF matrix
L, S, R = np.linalg.svd(X)

# L: Left singular vectors (term-concept matrix)
# S: Singular values (concept strengths)
# R: Right singular vectors (concept-document matrix)

Key Concepts Explained

Understanding SVD Components

  1. L Matrix (Left Singular Vectors):

    • Shape: (terms × concepts)
    • Shows relationship between terms and concepts
    • Each column represents a concept in term space
  2. S Matrix (Singular Values):

    • Shape: (concepts,)
    • Diagonal matrix showing importance of each concept
    • Larger values = more important concepts
  3. R Matrix (Right Singular Vectors):

    • Shape: (concepts × documents)
    • Shows relationship between concepts and documents
    • Each row represents a concept in document space

Low-Rank Approximation

LSI uses dimensionality reduction by keeping only the most important concepts:

def top_n_pad_0(U, S, V, n):
    """Create a matrix with top n singular values, padding rest with zeros"""
    t = list(S[:n])
    for i in range(len(S) - n):
        t.append(0)
    A = np.diag(t)
    
    # Handle matrix dimension differences
    newrow = [0] * len(S)
    if len(U) > len(V):
        for i in range(len(U) - len(S)):
            A = np.vstack([A, newrow])
    else:
        for i in range(len(V) - len(S)):
            A = np.vstack([A.T, newrow]).T
    return A

def reconstruct(u, s, v, n):
    """Reconstruct matrix using top n concepts"""
    A = top_n_pad_0(u, s, v, n)
    return np.round((u.dot(A)).dot(v), decimals=3)

Choosing Optimal Number of Concepts

def frobenius(a, a2):
    """Calculate Frobenius norm between original and reconstructed matrices"""
    a = np.array(a)
    return (np.sqrt(np.sum((a - a2)**2))) / np.sqrt(np.sum(a**2))

def find_k():
    """Find optimal number of concepts using Frobenius norm threshold"""
    for i in range(1, len(S)):
        f = frobenius(X, reconstruct(L, S, R, i))
        if f < 0.38:  # Threshold for acceptable reconstruction error
            return i

Practical Examples

Document Search and Ranking

def search(query):
    """Search for documents relevant to query"""
    # Clean and tokenize query
    q = strip_punctuation(query)
    q = q.lower().split(" ")
    
    # Convert query to vector space
    terms = X.index
    query_vector = np.array([1 if term in q else 0 for term in terms])
    
    if np.count_nonzero(query_vector == 1) == 0:
        print("Keywords don't match with documents")
    else:
        # Calculate similarity scores using LSI space
        k = find_k()  # Optimal number of concepts
        reconstructed_matrix = reconstruct(L, S, R, k)
        scores = query_vector.dot(reconstructed_matrix)
        
        # Sort documents by relevance score
        ranked_docs = sorted(
            zip(range(1, len(scores) + 1), scores),
            key=lambda x: x[1],
            reverse=True
        )
        
        print("Documents ranked by relevance:")
        for doc_id, score in ranked_docs:
            print(f"Document-{doc_id} (Score: {score:.3f})")

# Example usage
search("Sky is high.")

Concept Analysis

def print_concepts(n_concepts, n_keywords):
    """Print top keywords for each concept"""
    terms = X.index
    
    for i, component in enumerate(L.T[:n_concepts]):
        term_weights = zip(terms, component)
        top_terms = sorted(term_weights, key=lambda x: x[1], reverse=True)[:n_keywords]
        
        print(f"Concept {i+1}:")
        for term, weight in top_terms:
            print(f"  {term} ({weight:.3f})")
        print()

# Example: Show top 5 concepts with 10 keywords each
print_concepts(5, 10)

def print_concept_docs(n_docs):
    """Print documents most associated with each concept"""
    doc_ids = X.columns
    
    for i, component in enumerate(R):
        doc_weights = zip(doc_ids, component)
        top_docs = sorted(doc_weights, key=lambda x: x[1], reverse=True)[:n_docs]
        
        print(f"Documents most related to Concept {i+1}:")
        for doc_id, weight in top_docs:
            print(f"  Document-{doc_id} ({weight:.3f})")
        print()

# Example: Show top 2 documents for each concept
print_concept_docs(2)

Advanced Features

Visualization Techniques

2D Document Visualization

def plot_docs_2d():
    """Plot documents in 2D concept space"""
    # Project documents into concept space
    doc_concepts = R.T.dot(np.diag(S))
    
    concept1 = list(doc_concepts[0])  # First concept
    concept2 = list(doc_concepts[1])  # Second concept
    
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111)
    
    # Scatter plot
    ax.scatter(concept1, concept2)
    
    # Label each point with document ID
    for x, y, doc_id in zip(concept1, concept2, range(len(concept1))):
        ax.text(x, y, str(doc_id))
    
    plt.xlabel("Concept 1")
    plt.ylabel("Concept 2")
    plt.title("Documents in 2D Concept Space")
    plt.grid(True)
    plt.show()

plot_docs_2d()

3D Document Visualization

def plot_docs_3d():
    """Plot documents in 3D concept space"""
    # Project documents into concept space
    doc_concepts = R.T.dot(np.diag(S))
    
    concept1 = list(doc_concepts[0])
    concept2 = list(doc_concepts[1])
    concept3 = list(doc_concepts[2])
    
    fig = plt.figure(figsize=(12, 9))
    ax = fig.add_subplot(111, projection='3d')
    
    # 3D scatter plot
    ax.scatter(concept1, concept2, concept3)
    
    # Label each point
    for x, y, z, doc_id in zip(concept1, concept2, concept3, range(len(concept1))):
        ax.text(x, y, z, str(doc_id))
    
    ax.set_xlabel("Concept 1")
    ax.set_ylabel("Concept 2")
    ax.set_zlabel("Concept 3")
    ax.set_title("Documents in 3D Concept Space")
    plt.show()

plot_docs_3d()

Performance Analysis

def calculate_reconstruction_error(original, reconstructed):
    """Calculate how well the reduced model represents original data"""
    return frobenius(original, reconstructed)

def analyze_concept_importance():
    """Analyze the importance of different numbers of concepts"""
    errors = []
    concept_counts = range(1, min(10, len(S)))
    
    for k in concept_counts:
        reconstructed = reconstruct(L, S, R, k)
        error = frobenius(X, reconstructed)
        errors.append(error)
        print(f"Concepts: {k}, Reconstruction Error: {error:.4f}")
    
    return concept_counts, errors

# Analyze concept importance
concept_counts, errors = analyze_concept_importance()

Mathematical Foundation

The LSI Process

  1. Document Collection: Start with a collection of documents D = {d₁, d₂, ..., dₙ}

  2. Term-Document Matrix: Create matrix A where:

    • Rows represent terms (words)
    • Columns represent documents
    • A[i,j] = TF-IDF score of term i in document j
  3. SVD Decomposition: A = U × Σ × Vᵀ where:

    • U: m × r matrix (terms × concepts)
    • Σ: r × r diagonal matrix (concept strengths)
    • Vᵀ: r × n matrix (concepts × documents)
  4. Dimensionality Reduction: Keep only top k concepts:

    • A_k = U_k × Σ_k × V_k ᵀ
    • k << min(m, n)
  5. Query Processing: For query q:

    • Convert to term vector q_v
    • Project to concept space: q_c = q_v × U_k × Σ_k⁻¹
    • Calculate similarities: similarities = q_c × V_k ᵀ

Conclusion

What You've Learned

  1. LSI Fundamentals: Understanding how LSI captures semantic relationships between documents and terms

  2. Implementation Skills: Building LSI from scratch using Python and fundamental linear algebra operations

  3. Mathematical Concepts: SVD, dimensionality reduction, and matrix operations in information retrieval

  4. Practical Applications: Document search, ranking, and visualization in reduced concept space

Key Takeaways

  • LSI goes beyond keyword matching by finding conceptual relationships
  • SVD is the mathematical foundation that enables semantic understanding
  • Dimensionality reduction helps focus on the most important concepts
  • Visualization techniques help understand document relationships

Next Steps

  1. Experiment with different datasets to see how LSI performs across domains
  2. Try different numbers of concepts to optimize performance for your use case
  3. Compare with other techniques like Latent Dirichlet Allocation (LDA) or modern transformers
  4. Scale to larger datasets using efficient sparse matrix implementations

Real-World Applications

  • Search engines: Improving search result relevance
  • Document clustering: Grouping similar documents
  • Recommendation systems: Finding related content
  • Text mining: Discovering hidden topics in large text collections

This tutorial provides a comprehensive foundation for understanding and implementing Latent Semantic Indexing. The mathematical concepts, practical implementation, and visualization techniques give you the tools to apply LSI to your own text analysis projects.