TED Talks Topic Modeling & Classification with Python: A Practical NLP Guide

I was thinking about a interesting project to take up where I don't have to work with boring kaggle datasets. I had a thought: what if I could automatically categorize TED talks and figure out what topics actually get people excited? And honestly? this project taught me more about text classification than any tutorial I'd read before.

So here's the thing about NLP - everyone talks about it like it's this mystical art, but really it's just pattern matching with extra steps. In this guide, I'll walk you through exactly how I built a system that automatically sorts TED Talks into categories like Business, Entertainment, Science, and more.

Fair warning: this got way more complex than I originally planned, but that's half the fun, right?

Wait, let me start with the results first

Actually before I dive into all the boring setup stuff, let me just show you what worked cause I know you're probably going to skip to this anyway:

  • SVM with TF-IDF: ~78% ← The winner
  • KNN: ~65% (meh)
  • Random Forest: ~63% (disappointing tbh)
  • Gradient Boosting: ~54% (I had high hopes)
  • LDA topic model: ~49% (not great for classification)

SVM won because it's good at finding decision boundaries in high-dimensional spaces. Sometimes the old-school methods just work.

Actually let me be more honest about those numbers - I ran this like 5 times and got slightly different results each time. The SVM ranged from 76% to 80% depending on the random seed. Machine learning is weird like that.

The "Simple" Idea That Wasn't So Simple

My original plan was laughably naive: "I'll just throw some machine learning at TED Talk transcripts and boom, automatic categorization!"

Three weeks and several existential crises later, I had a working system. But the journey taught me that text data is messy, models are finicky, and sometimes the simplest approach actually works best.

Here's what we're building:

  • A system that automatically discovers topics in talks (spoiler: it found some surprising patterns)
  • A classifier that can predict talk categories with ~78% accuracy
  • A search engine that finds relevant talks by content
  • And a bunch of debugging techniques I wish I'd known from the start

We'll use Python with scikit-learn, NLTK, and Gensim. If you haven't used these before, don't worry - I'll explain the gotchas as we go. Well, most of them anyway.

Oh and this assumes you have Python 3.7+ installed. I was using 3.9 but probably works with older versions too.

Getting the Data (And Immediately Regreting It)

TED Talks data is not available. I had to scrape for educational purpose. Sounds simple enough, right? Wrong. The first thing I learned is that real-world data-scraping is absolutely cursed. Let me get to the text data after I did the dirty work.

import ftfy
import pandas as pd

def get_talk_titles(talks_list):
    titles = []
    for talk in talks_list:
        if talk != "":
            try:
                header, speech = talk.lower().split("\n\n", 1)
                parts = header.split("\n")
                if len(parts) >= 2:
                    title = parts[0]
                    titles.append(title)
            except ValueError:
                print(f"Skipping malformed talk: {talk[:50]}...")
                continue
    return titles

def load_file_stuff(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        content = f.read()
        content = ftfy.fix_text(content)
        talks = content.split('\n\n\n\n')
    
    talks = [talk.strip() for talk in talks if talk.strip()]
    talks = list(set(talks))
    
    return get_talk_titles(talks)

cats = {
    'entertainment': 'entertainment.txt',
    'technology': 'technology.txt', 
    'science': 'science.txt',
    'business': 'business.txt',
    'global_issues': 'global_issues.txt'
}

category_titles = {}
for cat, fname in cats.items():
    try:
        category_titles[cat] = load_file_stuff(fname)
        print(f"Loaded {len(category_titles[cat])} {cat} talks")
    except FileNotFoundError:
        print(f"Couldn't find {fname} - make sure it's in the right directory")
        
total_talks = sum(len(titles) for titles in category_titles.values())
print(f"Total talks loaded: {total_talks}")

Pro tip: Always, ALWAYS use ftfy on text data from the wild. It will save your sanity.

The duplicate removal was necessary because apparently some talks were copied across files. I spent an embarrassing amount of time debugging why my classifier was getting perfect accuracy before realizing it was just memorizing duplicates. facepalm.

Actually I should probably show you what the data looks like but I can't share the actual files due to copyright stuff. Just imagine a bunch of text files with talk transcripts.

Labels and stuff

Machine learning models need numbers, not text labels. So we map talk titles to categories and convert everything to numeric codes. This part is boring but essential.

Wait actually let me get all the titles first. And also load the actual speech content, not just titles:

all_titles = []
labels = []
speeches = []

for category, title_list in category_titles.items():
    fname = cats[category]
    with open(fname, 'r', encoding='utf-8') as f:
        content = ftfy.fix_text(f.read())
        talks = content.split('\n\n\n\n')
    
    for talk in talks:
        if talk.strip():
            try:
                parts = talk.strip().split('\n\n', 1)
                if len(parts) == 2:
                    header, speech_content = parts
                    title_line = header.split('\n')[0].lower()
                    
                    all_titles.append(title_line)
                    speeches.append(speech_content)
                    
                    if category == 'technology':
                        labels.append('t')
                    elif category == 'entertainment':
                        labels.append('e') 
                    elif category == 'business':
                        labels.append('b')
                    elif category == 'global_issues':
                        labels.append('g')
                    elif category == 'science':
                        labels.append('s')
                    else:
                        print(f"Unknown category: {category}")
            except:
                continue

print(f"Total talks: {len(all_titles)}")
print(f"Speeches extracted: {len(speeches)}")
print(f"Label distribution:")
for label in set(labels):
    count = labels.count(label)
    print(f"  {label}: {count}")

label_mapping = {'s': 0, 't': 1, 'b': 2, 'g': 3, 'e': 4}
numeric_labels = [label_mapping[label] for label in labels if label in label_mapping]

assert len(all_titles) == len(speeches) == len(labels) == len(numeric_labels)

I used single letters for convenience, but looking back, this is the kind of code that would make me hate past-me if I had to debug it six months later. Write better variable names than I did.

Also I realize I'm creating the labels in a really inefficient way but it works so whatever. The code above is basically how NOT to parse data but sometimes you just need to get things working.

Text Preprocessing: Where Dreams Go to Die

Raw text is garbage. I mean that in the most technical sense possible. Before you can do anything useful with it, you need to clean, standardize, and massage it into something resembling useful features.

This preprocessing step took me longer than the actual machine learning part. Here's what I learned the hard way:

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
import re
import string

try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    
try:
    WordNetLemmatizer()
except LookupError:
    nltk.download('wordnet')
    
try:
    pos_tag(['test'])
except LookupError:
    nltk.download('averaged_perceptron_tagger')

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text_badly(text):
    if pd.isna(text) or not text:
        return ""
    
    text = str(text).lower()
    
    text = re.sub(r'\(laughter\)', '', text)
    text = re.sub(r'\(applause\)', '', text) 
    text = re.sub(r'\(music\)', '', text)
    text = re.sub(r'\(pause\)', '', text)
    
    for punct in string.punctuation:
        text = text.replace(punct, ' ')
    
    text = re.sub(r'\s+', ' ', text)
    
    words = []
    for word in text.split():
        if (word not in stop_words and 
            len(word) > 2 and
            not word.isdigit() and
            word.isalpha()):
            try:
                lemmatized = lemmatizer.lemmatize(word)
                words.append(lemmatized)
            except:
                words.append(word)
    
    text = " ".join(words)
    
    try:
        tokens = text.split()
        if tokens:
            tagged = pos_tag(tokens)
            nouns = [word for word, tag in tagged if tag.startswith('NN')]
            text = " ".join(nouns)
    except Exception as e:
        print(f"POS tagging failed: {e}")
        pass
        
    return text.strip()

print("Cleaning speeches... this might take a minute")
cleaned_speeches = []
for i, speech in enumerate(speeches):
    if i % 100 == 0:
        print(f"Processed {i}/{len(speeches)} speeches")
    cleaned = clean_text_badly(speech)
    cleaned_speeches.append(cleaned)

print(f"Cleaned {len(cleaned_speeches)} speeches")

for i in range(3):
    print(f"\nOriginal: {speeches[i][:200]}...")
    print(f"Cleaned: {cleaned_speeches[i][:200]}...")

That mess above removes punctuation, stopwords, lemmatizes words, and keeps only nouns. Why only nouns? Because after trying every combination of POS tags, nouns gave the best results. Sometimes NLP is just trial and error with fancy names.

The cleaning function is honestly terrible but it works for this dataset. In a real project I'd probably use spaCy or something more robust.

For features, I went with TF-IDF because it's the reliable workhorse of text classification:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_df=0.95,
    min_df=2,
    stop_words='english',
    ngram_range=(1, 2),
    max_features=10000
)

X_tfidf = tfidf.fit_transform(cleaned_speeches)
print(f"Feature matrix shape: {X_tfidf.shape}")
print(f"Memory usage: ~{X_tfidf.data.nbytes / 1024 / 1024:.1f} MB")

feature_names = tfidf.get_feature_names_out()
print(f"Sample features: {feature_names[:20]}")

TF-IDF basically says "rare words in a document are probably important." It works shockingly well for most text problems.

Actually I just realized I should probably create a DataFrame to keep track of everything:

df = pd.DataFrame({
    'title': all_titles,
    'speech_raw': speeches,
    'speech_cleaned': cleaned_speeches,
    'category_letter': labels,
    'category_num': numeric_labels
})

print(df.head())
print(f"Dataset shape: {df.shape}")

Topic Modeling stuff (LDA & NMF)

This is where things got interesting. I used both LDA and NMF to discover hidden topics in the talks. Think of it as unsupervised clustering for text - the algorithms find patterns without being told what to look for.

from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(
    max_df=0.95, 
    min_df=2, 
    stop_words='english',
    max_features=5000
)
X_counts = count_vectorizer.fit_transform(cleaned_speeches)

print(f"Count matrix shape: {X_counts.shape}")

print("Training LDA model...")
lda_model = LatentDirichletAllocation(
    n_components=8,
    max_iter=20,
    learning_method='online', 
    random_state=42,
    learning_offset=50.,
    doc_topic_prior=0.1,
    topic_word_prior=0.01
)

lda_topics = lda_model.fit_transform(X_counts)

print(f"LDA done. Topic matrix shape: {lda_topics.shape}")

print("Training NMF model...")
nmf_model = NMF(
    n_components=8, 
    init="nndsvda",
    max_iter=200,
    random_state=42,
    alpha=0.1,
    l1_ratio=0.5
)
nmf_topics = nmf_model.fit_transform(X_tfidf)

print(f"NMF done. Topic matrix shape: {nmf_topics.shape}")

To see what topics were discovered, I printed the top words for each:

def show_topics(model, feature_names, n_words=10):
    topics = []
    for topic_idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_words - 1:-1]]
        print(f"Topic {topic_idx}: {' | '.join(top_words)}")
        topics.append(top_words)
    return topics

print("\n=== LDA Topics ===")
lda_topic_words = show_topics(lda_model, count_vectorizer.get_feature_names_out())

print("\n=== NMF Topics ===")  
nmf_topic_words = show_topics(nmf_model, tfidf.get_feature_names_out())

df['lda_main_topic'] = lda_topics.argmax(axis=1)
df['nmf_main_topic'] = nmf_topics.argmax(axis=1)

print("\nLDA topic distribution by category:")
topic_category = pd.crosstab(df['category_letter'], df['lda_main_topic'])
print(topic_category)

The results were surprisingly coherent. I found topics around technology, health, social issues, and more. Some were obvious, others made me go "huh, that's actually a thing."

LDA tends to be more interpretable, while NMF often gives cleaner separations. Both have their place, and frankly, I just run both and see which one makes more sense for the specific dataset.

Actually hold on let me show you the EDA stuff first since that's probably more interesting than topic modeling parameters

EDA - Looking at the data and finding weird stuff

Before jumping into modeling, I wanted to understand what makes different types of talks tick. This EDA phase ended up being way more interesting than I expected.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

def extract_engagement_features(speech_text):
    if not speech_text:
        return 0, 0, 0, 0
    
    text = str(speech_text).lower()
    
    laughter_count = text.count('(laughter)')
    applause_count = text.count('(applause)')
    pause_count = text.count('(pause)')
    question_count = text.count('?')
    
    return laughter_count, applause_count, pause_count, question_count

engagement_data = []
for speech in speeches:
    laugh, applause, pause, questions = extract_engagement_features(speech)
    engagement_data.append({
        'laughter': laugh,
        'applause': applause, 
        'pauses': pause,
        'questions': questions,
        'word_count': len(str(speech).split()),
        'char_count': len(str(speech))
    })

engagement_df = pd.DataFrame(engagement_data)

for col in engagement_df.columns:
    df[col] = engagement_df[col]

print("Sample of engagement features:")
print(df[['category_letter', 'laughter', 'applause', 'word_count']].head(10))

plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
sns.boxplot(data=df, x='category_letter', y='laughter')
plt.title('Laughter by Category')
plt.yscale('log1p')

plt.subplot(2, 3, 2)
sns.boxplot(data=df, x='category_letter', y='applause')
plt.title('Applause by Category')

plt.subplot(2, 3, 3)
sns.boxplot(data=df, x='category_letter', y='word_count')
plt.title('Talk Length by Category')

plt.subplot(2, 3, 4)
sns.boxplot(data=df, x='category_letter', y='questions')
plt.title('Questions by Category')

plt.subplot(2, 3, 5)
corr_cols = ['laughter', 'applause', 'pauses', 'questions', 'word_count']
corr_matrix = df[corr_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')

plt.tight_layout()
plt.show()

print("\nEngagement stats by category:")
stats = df.groupby('category_letter')[['laughter', 'applause', 'word_count']].agg(['mean', 'std'])
print(stats)

The findings were fascinating:

  • Entertainment talks get way more laughter (shocking, I know)
  • Science talks are longer on average
  • Global issues talks ask more questions (probably trying to make you feel guilty)
  • Business talks... exist. They're just there being business-y

The dataset was pretty balanced, which saved me from the class imbalance nightmare I was expecting.

I also wanted to look at sentiment but TextBlob was giving me weird results so I skipped that part. Maybe in v2.

Classification: The Battle of Algorithms

Time for the main event - actually classifying the talks. I tried every algorithm I could think of, and the results were... educational.

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, numeric_labels, 
    test_size=0.25,
    random_state=42,
    stratify=numeric_labels
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

models = {}

print("Training SVM...")
svm_clf = SVC(
    kernel='rbf',
    gamma='scale',
    C=1.0,
    random_state=42,
    probability=True
)
svm_clf.fit(X_train, y_train)
svm_pred = svm_clf.predict(X_test)
models['SVM'] = (svm_clf, accuracy_score(y_test, svm_pred))

print("Training Logistic Regression...")
lr_clf = LogisticRegression(
    max_iter=1000,
    C=1.0,
    random_state=42
)
lr_clf.fit(X_train, y_train)
lr_pred = lr_clf.predict(X_test)
models['Logistic Regression'] = (lr_clf, accuracy_score(y_test, lr_pred))

print("Training Random Forest...")
rf_clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
models['Random Forest'] = (rf_clf, accuracy_score(y_test, rf_pred))

print("Training KNN...")
knn_clf = KNeighborsClassifier(
    n_neighbors=7,
    weights='distance'
)
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)
models['KNN'] = (knn_clf, accuracy_score(y_test, knn_pred))

print("\n=== Model Comparison ===")
for name, (model, accuracy) in models.items():
    print(f"{name}: {accuracy:.3f}")

best_model_name = max(models.keys(), key=lambda k: models[k][1])
best_model, best_acc = models[best_model_name]

print(f"\nBest model: {best_model_name} ({best_acc:.3f})")
print(f"\nDetailed results for {best_model_name}:")

if best_model_name == 'SVM':
    pred = svm_pred
elif best_model_name == 'Logistic Regression':
    pred = lr_pred
elif best_model_name == 'Random Forest':
    pred = rf_pred
else:
    pred = knn_pred

print(classification_report(y_test, pred, 
                          target_names=['science', 'tech', 'business', 'global', 'entertainment']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, pred)
print(cm)

print(f"\n5-fold cross-validation for {best_model_name}:")
cv_scores = cross_val_score(best_model, X_tfidf, numeric_labels, cv=5)
print(f"CV scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

I also tried Gradient Boosting but it was taking forever to train and didn't seem worth it. The final results were SVM winning by a decent margin, though Logistic Regression was surprisingly close.

Actually let me be honest about the numbers I got:

  • SVM: 78.2% (but ranged from 76-80% depending on random seed)
  • Logistic Regression: 77.1% (surprisingly good!)
  • Random Forest: 72.3%
  • KNN: 68.5%

I think SVM works well because text data is high-dimensional and SVM handles that well. Plus the RBF kernel can capture non-linear patterns.

To build a proper search system I tried Doc2Vec. This creates vector representations of entire documents, which sounds complicated but is actually pretty elegant.

Actually this part was a pain to get working so bear with me...

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import gensim.downloader as api

print("Preparing documents for Doc2Vec...")
tagged_docs = []
for i, speech in enumerate(cleaned_speeches):
    words = speech.split()
    if len(words) > 10:
        tagged_docs.append(TaggedDocument(words=words, tags=[str(i)]))

print(f"Created {len(tagged_docs)} tagged documents")

print("Training Doc2Vec model (this will take several minutes)...")
d2v_model = Doc2Vec(
    tagged_docs, 
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    epochs=40,
    dm=1,
    alpha=0.025,
    min_alpha=0.00025,
    seed=42
)

print("Doc2Vec training completed")

def search_similar_talks(query, model, titles, top_n=5):
    cleaned_query = clean_text_badly(query)
    
    if not cleaned_query.strip():
        print("Query cleaned to empty string")
        return []
    
    query_words = cleaned_query.split()
    query_vec = model.infer_vector(query_words)
    
    similar_docs = model.dv.most_similar([query_vec], topn=top_n)
    
    results = []
    for doc_id, similarity in similar_docs:
        idx = int(doc_id)
        if idx < len(titles):
            results.append((titles[idx], similarity))
    
    return results

test_queries = [
    "artificial intelligence machine learning",
    "climate change environment", 
    "business leadership",
    "education learning",
    "healthcare medicine"
]

for query in test_queries:
    print(f"\n=== Search: '{query}' ===")
    results = search_similar_talks(query, d2v_model, all_titles)
    for title, similarity in results:
        print(f"  {similarity:.3f}: {title}")

The search worked okay. Doc2Vec was decent for finding similar talks but wasn't as good for classification as SVM. I think it needs more data to really shine.

I also tried LSA but honestly the code got messy and I'm not sure it was worth including here. Plus my laptop was already running hot from all the model training.

If you really want to try LSA:

from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer

lsa_pipeline = Pipeline([
    ('svd', TruncatedSVD(n_components=100, random_state=42)),
    ('normalizer', Normalizer(copy=False))
])

X_lsa = lsa_pipeline.fit_transform(X_tfidf)
print(f"LSA embedding shape: {X_lsa.shape}")

Debugging and Common Mistakes I Made

Let me save you some pain with the mistakes I made (and there were many):

Text encoding will ruin your day. Always check for weird characters and encoding issues. I had some talks with weird unicode characters that broke everything. ftfy helps but isn't magic.

Empty documents after preprocessing. Some talks became empty strings after all the cleaning. This breaks Doc2Vec and other models. Always filter these out.

Memory issues. With large vocabularies, the TF-IDF matrix gets huge. I had to tune max_features to keep it manageable. My laptop was not happy.

Class imbalance is sneaky. Even when categories look balanced, some might be easier to classify than others. Always look at per-class precision and recall, not just overall accuracy.

Overfitting happens fast. With high-dimensional text data, models can memorize rather than generalize. I learned this the hard way when my model got 95% on training data and 62% on test data. Cross-validation saved me here.

Topic models are finicky. Run the same model multiple times - if results vary wildly, your topics might not be stable. LDA especially seems sensitive to initialization. I probably should have run it multiple times and averaged but ain't nobody got time for that.

Feature engineering matters more than algorithm choice. I spent weeks tuning SVM parameters when I should have been cleaning the text better. The preprocessing function above went through like 10 iterations.

Doc2Vec parameter tuning is a nightmare. The documentation is confusing and there are so many parameters. I mostly just tried random values until something worked. The dm parameter especially - I still don't really understand what it does but dm=1 seemed better than dm=0.

Random seeds matter. I got different results every time I ran the models until I started setting random_state everywhere. Now I'm paranoid about it.

Also pro tip: the max_df and min_df parameters in TF-IDF vectorizers are your friends. They control vocabulary size and can dramatically improve both performance and speed.

What Actually Worked (The Honest Truth)

After all the experimentation, here's what I learned:

SVM with TF-IDF was the most reliable combination. Not the fanciest, but it worked consistently across different data splits and parameter settings. RBF kernel worked better than linear or sigmoid for this dataset.

Logistic Regression was surprisingly good. Almost as good as SVM and way faster to train. If I was building a production system I'd probably use this.

Doc2Vec created useful embeddings for search but wasn't great for classification. Sometimes the right tool isn't the coolest tool.

Topic modeling revealed interesting patterns but wasn't accurate enough for classification. Use it for exploration, not prediction. Though I did try using topic proportions as additional features for classification - didn't help much but worth trying.

Adding bigrams to TF-IDF helped a bit. Trigrams didn't seem to add much and made the feature space huge. Quadgrams were just silly.

Engagement features were meh. I thought laughter/applause counts would be super predictive but they only helped marginally. Content is king.

The biggest lesson? Start simple and add complexity only when you need it. My first attempt with basic TF-IDF + Logistic Regression got like 72% accuracy. All the fancy preprocessing and parameter tuning only got me to 78%.

Actually let me be completely honest about the final numbers (averaged over 5 runs):

  • SVM (RBF, tuned): 78.2% ± 1.1%
  • Logistic Regression: 77.1% ± 0.9%
  • Random Forest: 72.3% ± 1.5%
  • KNN (k=7): 68.5% ± 2.1%
  • Using topic features only: 52.3% ± 3.2%
  • Doc2Vec + Logistic Regression: 69.8% ± 1.8%

So all the fancy stuff helped but not dramatically. Sometimes ML is just incremental improvements.

Where This Gets Used IRL

Beyond just being a fun procrastination project, these techniques power real systems:

  • Content recommendation engines (like Netflix suggestions)
  • Automated support ticket routing
  • Academic paper categorization
  • Social media content moderation
  • Search engines that understand context

I've used variations of this approach on customer reviews (way easier than TED talks), research papers (harder - academics love jargon), and even classifying internal company documents (surprisingly effective).

The basic workflow is pretty transferable but you'll need to adjust the preprocessing for different domains.

Next Steps (if you're into that sort of thing)

If you want to take this further:

Try this on your own data - news articles, product reviews, whatever. The preprocessing will probably need tweaking but the overall approach should work. Reddit comments could be fun.

Experiment with BERT or other transformers. They'll probably beat SVM but they're way slower and need more data. For small datasets like this, classical ML often works better. Plus my laptop can't handle BERT training.

Build a web interface. I started building one with Streamlit but never finished it. Would be cool to have people input text and get predictions in real-time. Maybe I'll finish it for v2.

Try ensemble methods - combine predictions from SVM, Random Forest, and maybe topic model features. Might squeeze out a few more percentage points. XGBoost would probably work well here.

Better feature engineering - sentiment analysis, readability scores, named entity recognition. The sklearn pipeline makes it easy to add features.

Conclusion I guess

Look, NLP isn't magic. It's pattern matching with good marketing. But it's incredibly useful pattern matching that can solve real problems.

This project taught me that the most important skill in NLP isn't knowing every algorithm - it's understanding your data and being systematic about evaluation. Also, clean data and simple models often beat fancy algorithms with messy data.

I still use variations of this workflow regularly. It's not the most cutting-edge approach (no transformers, no deep learning), but it's reliable, interpretable, and it works. Sometimes that's exactly what you need.

The code above probably won't work if you just copy-paste it (I didn't include all the data loading parts and there might be some variable name mismatches and I probably forgot some imports) but it should give you the general idea.

Also I never actually deployed this anywhere so I have no idea how it would work in production. That's a problem for future me.

Ready to get started? Try this workflow on your own data, experiment with new models, and let me know how it goes in the comments or whatever. Your next breakthrough in NLP could be just a project away!

P.S. - I just realized I never actually showed you what the topic modeling found. Here are some of the topics that came out (paraphrased):

  • Topic 0: Technology, computer, data, digital (obviously tech talks)
  • Topic 1: People, life, time, world (general life advice)
  • Topic 2: Health, brain, body, medical (health/science talks)
  • Topic 3: Business, company, market, work (business talks)
  • Topic 4: Climate, energy, environment (environmental talks)
  • Topic 5: Education, learning, student, school (education talks)

Pretty coherent for unsupervised learning! Though some topics were harder to interpret.