Natural Language Processing

AI in the Wild Part 3 of 24

About This Series

This is Part 3 of the AI in the Wild: Real-World Applications & Ethics series — a 24-part deep dive covering the complete end-to-end AI journey, from ML foundations through to responsible AI governance.

Intermediate NLP Transformers

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 3

AI & ML Landscape Overview

Paradigms, ecosystem map, real-world applications at a glance

ML Foundations for Practitioners

Supervised learning, bias-variance, model evaluation

3

Tokenization, embeddings, transformers, semantic search

You Are Here

4

Text Preprocessing & Tokenization

The fundamental challenge of Natural Language Processing is the gap between human communication and machine computation. Text is inherently unstructured: sequences of characters of arbitrary length, encoding meaning through a combination of word choice, word order, syntactic structure, pragmatic context, and cultural reference. Unlike tabular data where each row has the same fixed-width feature vector, a sentence and a document are both just "text" — yet contain wildly different amounts of information. Converting this raw linguistic signal into numerical representations that machine learning models can process is the first and most consequential engineering decision in any NLP system.

The difficulties stack up quickly. Human vocabulary is enormous — English alone has over 170,000 words in active use, expanding constantly with neologisms, slang, technical jargon, and brand names. The same meaning can be expressed through countless syntactic variations. The same word carries different meanings depending on context ("bank" as financial institution vs. riverbank). And multilingual systems must handle scripts, characters, and morphologies spanning dozens of languages simultaneously. Every representation choice — how to split text, what vocabulary to use, how to handle unknown words — propagates its effects through every downstream component.

                        
                        Key Insight: Tokenization is not a solved problem, and its choices have consequences far downstream. GPT-4's tokenizer encodes "tokenization" as three tokens; a poorly chosen vocabulary might split it into nine. Longer token sequences mean higher compute cost, shorter context windows effectively, and models that struggle with morphologically complex words. The way you split text fundamentally shapes what your model can learn to represent.
                    

History & Origins of NLP

Natural Language Processing has roots stretching back to the earliest days of computing. Alan Turing's 1950 paper "Computing Machinery and Intelligence" proposed the Turing Test — could a machine converse in a way indistinguishable from a human? — establishing language understanding as a central benchmark for machine intelligence. The Georgetown-IBM experiment of 1954 demonstrated automatic translation of 60 Russian sentences to English, prompting wildly optimistic predictions that machine translation would be solved within five years. Those predictions missed by decades.

The field spent the 1960s through 1980s in the symbolic tradition: hand-crafted grammars (context-free grammars, phrase-structure rules), rule-based parsers, and logic-based semantic representations. ELIZA (1966), the first chatbot, used pattern-matching rules to simulate conversation — fooling users into believing they were speaking with a therapist, despite having no understanding of language at all. The ALPAC report of 1966 concluded that machine translation was slower, less accurate, and twice as expensive as human translation, triggering a decade of reduced funding. The shift to statistical methods in the late 1980s and 1990s — driven by the availability of large text corpora and Bayesian probability models — finally cracked practical tasks like speech recognition and part-of-speech tagging. The 2010s brought distributed representations, and the 2017 transformer architecture catalysed the modern era. Each paradigm shift did not replace its predecessor entirely: hybrid systems combining statistical and neural approaches continue to dominate production NLP in specialised, data-scarce domains.

Classical Pipelines

Before the transformer era, NLP systems relied on a preprocessing stack built around heuristics and linguistic knowledge. The canonical pipeline begins with lowercasing and punctuation removal to normalise surface variation, then stopword removal to strip high-frequency function words (the, a, is) that carry little semantic content. Stemming — Porter Stemmer and Snowball being the most common implementations — heuristically strips suffixes to reduce words to their stems ("running", "runs", "ran" all become "run"). Lemmatisation uses vocabulary and morphological analysis (typically WordNet) to reduce words to their canonical dictionary form, producing linguistically valid results where stemming produces roots that may not be real words.

These steps feed into a bag-of-words (BoW) representation: a document becomes a sparse vector where each dimension corresponds to a vocabulary term and its value is the raw count (or binary presence) of that term. TF-IDF (Term Frequency-Inverse Document Frequency) improves on raw counts by upweighting terms that are frequent in the document but rare across the corpus — capturing the intuition that a term that appears everywhere is less informative than one that appears rarely but prominently. These representations remain competitive baselines for document classification, keyword-based search, and any domain where interpretability matters. Their limitations are equally clear: they discard word order entirely, cannot represent polysemy, and require exact match — "automobile" and "car" are unrelated under BoW.

Subword Tokenization

Modern deep learning NLP systems avoid word-level tokenisation entirely, instead breaking text into subword units — fragments smaller than words but larger than characters. The motivation is elegant: whole-word vocabularies can't handle rare or unseen words without an explicit "unknown" token bucket; character-level models generate extremely long sequences and struggle to learn word-level semantics; subword methods find a middle ground where common words are preserved intact and rare words are decomposed into recognisable subunits.

Byte Pair Encoding (BPE), used by GPT-2/3/4 and the Llama family, starts with a character vocabulary and iteratively merges the most frequent adjacent pair of symbols into a new symbol until a target vocabulary size is reached. The result is a vocabulary where common words are single tokens and rare words are composed of frequent subwords. WordPiece, used by BERT and its derivatives, is similar but merges pairs that maximise the likelihood of the training corpus under a language model rather than raw frequency — producing slightly different segmentations that tend to preserve morphological boundaries. SentencePiece, used by T5, Llama, and multilingual models, operates directly on raw text without pre-tokenisation, making it language-agnostic and particularly effective for morphologically rich languages like Finnish, Turkish, or Japanese. Vocabulary size is a critical hyperparameter: GPT-4 uses ~100K tokens; BERT uses 30K; smaller vocabularies produce longer sequences (more compute, shorter effective context) while larger vocabularies have sparse embeddings for rare tokens.

From Tokenization to Dense Embeddings

The following code illustrates the evolution of NLP representations across three levels — from classical spaCy tokenization with part-of-speech tagging and entity recognition, through BERT sub-word tokenization, to semantic sentence embeddings with cosine similarity:

# Evolution of NLP representations: from tokenization to dense embeddings
import spacy
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Basic tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple acquired Beats Electronics for $3 billion in 2014.")
for token in doc:
    print(f"{token.text:15} | POS: {token.pos_:8} | Entity: {token.ent_type_ or 'O'}")
# Apple           | POS: PROPN    | Entity: ORG
# $3 billion      | POS: MONEY    | Entity: MONEY
# 2014            | POS: NUM      | Entity: DATE

# 2. BERT sub-word tokenization
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("uncharacteristically", return_tensors=None)
print(tokenizer.convert_ids_to_tokens(tokens['input_ids']))
# ['[CLS]', 'un', '##char', '##act', '##erist', '##ically', '[SEP]']
# BERT splits rare words into known sub-word pieces

# 3. Semantic embeddings (sentence-transformers)
model = SentenceTransformer('all-MiniLM-L6-v2')  # 22M params, fast
sentences = [
    "The patient has a fever and sore throat.",
    "The customer is running a high temperature.",  # semantically different, lexically similar to above
    "Buy cheap medications online now!"              # unrelated
]
embeddings = model.encode(sentences)  # shape: (3, 384)
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(embeddings)
print(f"Medical vs Customer: {sim[0,1]:.3f}")  # 0.312 — low (different context)
print(f"Medical vs Spam:     {sim[0,2]:.3f}")  # 0.089 — very low

Setting Up Your NLP Environment

Getting started with modern NLP requires installing a handful of packages. The following bash commands set up a complete NLP development environment and demonstrate a working question-answering pipeline in just a few lines:

# Set up a complete NLP development environment
pip install transformers datasets sentence-transformers spacy
pip install faiss-cpu  # or faiss-gpu for CUDA
python -m spacy download en_core_web_sm  # basic English model
python -m spacy download en_core_web_trf  # transformer-based (higher accuracy)

# Quick model download and test
python -c "
from transformers import pipeline
# Auto-downloads ~440MB BERT model on first run, cached after
qa = pipeline('question-answering', model='deepset/roberta-base-squad2')
context = 'The Eiffel Tower was completed in 1889. It stands 330 metres tall.'
result = qa(question='How tall is the Eiffel Tower?', context=context)
print(result['answer'])  # '330 metres'
print(f'Confidence: {result[\"score\"]:.2%}')  # Confidence: 98.73%
"

Word Embeddings & Representations

The foundational insight behind word embeddings is the distributional hypothesis, articulated by linguist John Rupert Firth in 1957: "You shall know a word by the company it keeps." Words that appear in similar contexts tend to have similar meanings. This principle implies that we can learn meaningful word representations by analysing statistical co-occurrence patterns across large corpora — no linguistic rules, no hand-crafted ontologies, just patterns in data.

Dense vector representations — embeddings — replace the sparse, high-dimensional one-hot encoding of words (a vector with a single 1 in a vocabulary-sized space) with compact, low-dimensional vectors (typically 50–1000 dimensions) where geometric relationships encode semantic relationships. Words with similar meanings are close in embedding space; relationships are encoded in directions. The famous demonstration: the vector arithmetic king − man + woman ≈ queen shows that gender and royalty are encoded as consistent directions in the learned vector space. This property of geometric regularity made embeddings the universal interface between text and neural networks throughout the 2015–2019 era.

Word2Vec & GloVe

Word2Vec, introduced by Mikolov et al. at Google in 2013, learns embeddings by training a shallow neural network on one of two prediction tasks. The Skip-gram model takes a centre word as input and predicts surrounding context words — learning embeddings that capture syntactic and semantic similarity across wide contexts. The CBOW (Continuous Bag of Words) model takes surrounding context words as input and predicts the centre word — faster to train and performs better on frequent words. Both models learn implicitly from billions of word co-occurrences with negative sampling as a training efficiency trick. GloVe (Global Vectors for Word Representation, Stanford 2014) takes a more direct approach: it factorises the global word co-occurrence matrix, explicitly minimising the difference between the dot product of two word vectors and the logarithm of their co-occurrence probability. GloVe embeddings often outperform Word2Vec on analogy tasks because they incorporate corpus-wide statistics rather than local context windows alone.

FastText, released by Facebook AI Research in 2016, extends Word2Vec by representing each word as a bag of character n-grams. The embedding of a word is the sum of its component n-gram embeddings, allowing the model to generate representations for words that never appeared in training by composing their subword embeddings. This makes FastText robust to morphological variation and misspellings — critical for handling user-generated text with non-standard spelling. Pre-trained FastText embeddings in 157 languages democratised NLP for low-resource language communities. The limitation shared by all these static embedding methods is that each word maps to exactly one vector regardless of context — "bank" has a single embedding that must somehow average its financial and geographical senses.

Contextual Embeddings

The limitation of static embeddings — one vector per word, context-independent — was well understood by 2017. The word "fine" has radically different meanings in "I feel fine," "a parking fine," and "fine-grained analysis," yet a static embedding assigns all three usages the same vector. ELMo (Embeddings from Language Models, 2018) addressed this by using a bidirectional LSTM language model to generate contextualised representations: instead of a single embedding per word, ELMo produces a different embedding for each token occurrence, conditioned on the full surrounding sentence. Plugging ELMo embeddings into downstream models produced significant accuracy improvements across NLP benchmarks, establishing contextualisation as the new standard.

The transformer architecture (covered next) made contextual embeddings dramatically more powerful and scalable. BERT's representations — where every token's embedding is computed through 12 or 24 layers of bidirectional self-attention — proved so information-rich that fine-tuning just the final layer on labelled data outperformed task-specific architectures trained from scratch. Sentence-BERT (SBERT, 2019) extended this to sentence-level representations: by training BERT with a siamese network architecture on natural language inference pairs, it produces sentence embeddings that are meaningfully comparable via cosine similarity — enabling efficient semantic similarity search, clustering, and information retrieval at scale. Sentence embeddings from models like all-MiniLM-L6-v2 and text-embedding-ada-002 are now the de facto input representation for vector search and RAG systems.

The Transformer Revolution

The 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain introduced the transformer architecture and fundamentally changed the trajectory of NLP — and, eventually, of AI as a whole. The transformer's key innovation was replacing recurrent processing (LSTM, GRU) with self-attention: a mechanism that computes relationships between all pairs of tokens in a sequence simultaneously, enabling fully parallel training and far superior modelling of long-range dependencies. An LSTM processing a 512-token document must propagate information through 512 sequential steps, with gradient signals weakening at each step. A transformer processes all 512 tokens simultaneously, with each token attending directly to every other token regardless of distance.

The practical consequences were profound. Training became orders of magnitude faster because all positions in a sequence could be processed in parallel on GPU hardware. Longer context windows became tractable. And the self-attention mechanism proved surprisingly good at learning what to attend to — heads specialising in syntactic agreement, coreference resolution, and semantic similarity without any explicit supervision. The transformer spawned three major architectural variants: encoder-only models (BERT) for understanding tasks, decoder-only models (GPT) for generation tasks, and encoder-decoder models (T5, BART) for sequence-to-sequence tasks like translation and summarisation.

Attention Mechanism

Self-attention works by computing, for each token in the input sequence, a weighted combination of all other tokens' representations — the weights reflecting how relevant each other token is to the current one. The mechanism is parameterised by three learned linear projections: Query (Q), Key (K), and Value (V). For each token, its Query is compared against the Keys of all tokens via a dot product; these scores are scaled by the square root of the key dimension (to prevent vanishing gradients in high dimensions) and passed through a softmax to produce attention weights that sum to 1. The output for each token is then a weighted sum of Value vectors. In matrix form: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V.

Multi-head attention runs this mechanism h times in parallel with different learned projections, then concatenates and linearly projects the results. Different heads can attend to different types of relationships simultaneously — syntactic, semantic, positional. Empirically, heads specialise: some track subject-verb agreement, some resolve pronouns to their antecedents, some capture word-level synonymy. Since self-attention is permutation-invariant (the output is the same regardless of token order), transformers add positional encodings to inject sequence order. The original transformer used sinusoidal encodings; modern models like RoPE (Rotary Position Embedding) and ALiBi encode positions in ways that generalise better to sequences longer than those seen during training.

BERT, GPT & Beyond

BERT (Bidirectional Encoder Representations from Transformers, Google 2018) is an encoder-only transformer pre-trained with two tasks: Masked Language Modelling (randomly masking 15% of input tokens and predicting them from context) and Next Sentence Prediction (classifying whether two sentences are consecutive). The bidirectionality is key — BERT's masked LM allows each token to attend to both left and right context simultaneously, making it ideal for understanding tasks (classification, NER, question answering) where the full context is available at inference time. Fine-tuning BERT on task-specific data by adding a small task head and running a few epochs became the dominant NLP recipe from 2019 to 2021.

GPT (Generative Pre-trained Transformer, OpenAI 2018) uses a decoder-only architecture with causal (autoregressive) language modelling: predict each token from its left context only. The causal masking means GPT can generate text by iteratively predicting the next token — making it natural for generation tasks. GPT-2 (2019) scaled this approach and demonstrated surprisingly capable zero-shot generation. GPT-3 (2020) scaled further to 175 billion parameters and revealed in-context learning: the model could perform new tasks from a few examples in its prompt without any weight updates. T5 (Text-to-Text Transfer Transformer, Google 2020) is the elegant unification: it frames every NLP task as text-in → text-out, using an encoder-decoder architecture. Classification becomes generating a class label string; summarisation generates a summary; translation generates the target language. This text-to-text framing enabled T5 to train a single model on diverse tasks simultaneously. Successors — RoBERTa, DeBERTa, PaLM, LLaMA, Mistral, Claude, Gemini — refine or scale these foundational architectures; Part 8 covers the LLM scaling landscape in depth.

NLP Model Family Comparison

Model Family	Architecture	Best For	Context Window	Open-Source?
BERT / RoBERTa	Encoder-only transformer	Classification, NER, QA (extractive)	512 tokens	Yes (HuggingFace)
GPT-series	Decoder-only transformer	Text generation, instruction following, code	128K+ tokens (GPT-4o)	GPT-2 only; GPT-3/4 via API
T5 / BART	Encoder-decoder transformer	Summarisation, translation, data-to-text	512–4096 tokens	Yes (T5: Google; BART: Meta)
LLaMA 3	Decoder-only transformer	General purpose; fine-tunable on-premise	128K tokens	Yes (Meta, open weights)
sentence-transformers	Bi-encoder (BERT-based siamese)	Semantic search, clustering, similarity	512 tokens typical	Yes (HuggingFace Hub)

Core NLP Tasks in Production

The transformer revolution didn't just improve NLP research benchmarks — it dramatically lowered the bar for deploying NLP capabilities in production by enabling a single pre-trained model to be adapted to many tasks. The Hugging Face Transformers library with its pipeline() abstraction is perhaps the most significant democratisation of AI tooling in the past decade: a sentiment classifier, a named entity recogniser, a summariser, and a question-answering system can each be deployed in fewer than five lines of Python. Understanding what each task actually is — and what can go wrong in production — remains essential.

Classification & Sentiment

Text classification assigns a document or sentence to one of a discrete set of categories. Common production instances include: spam filtering (binary), support ticket routing (multi-class, 10–100 categories), content moderation (multi-label, multiple policy violation types can co-occur), and intent classification for dialogue systems. The standard approach is to fine-tune a BERT-class encoder on labelled examples by adding a classification head on top of the [CLS] token representation and training with cross-entropy loss. Label efficiency is excellent — BERT fine-tuned on 500–1000 examples typically outperforms classical ML trained on tens of thousands.

Sentiment analysis can be framed as classification (positive/negative/neutral) or regression (0–5 star rating prediction). Aspect-based sentiment analysis (ABSA) is the more demanding production variant: rather than assigning a single sentiment to a whole review, ABSA identifies which aspect each sentiment refers to ("the battery life is excellent but the camera is disappointing"). Zero-shot classification with instruction-tuned LLMs has shifted the paradigm for new use cases: you can describe your categories in natural language without labelled training data. The production cautions are real: LLMs add latency and cost; fine-tuned small models (DistilBERT, ELECTRA) are 10–100x faster and cheaper for high-volume, well-defined classification tasks. Label drift — the categories that matter in production shift over time — requires active monitoring and periodic retraining, often aided by human-in-the-loop labelling of low-confidence or high-impact predictions.

Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities in text — people, organisations, locations, dates, product names, medical concepts — by labelling each token with a tag in the BIO (Beginning-Inside-Outside) scheme. A sentence like "Tim Cook announced at Apple's Cupertino campus" would be tagged: Tim (B-PER), Cook (I-PER), Apple (B-ORG), Cupertino (B-LOC). NER is a foundational component in information extraction pipelines, knowledge graph construction, clinical data structuring, and financial document analysis.

Classical NER relied on hand-crafted features fed into Conditional Random Fields (CRFs) — and CRFs remain competitive for narrow, well-defined entity types with limited training data because their graphical model structure explicitly captures label dependencies (an I-ORG tag cannot follow a B-PER tag). Transformer-based NER (fine-tuned BERT with a token-level classification head) substantially outperforms CRFs when training data is abundant, and generalises better to novel entity mentions. Domain-specific NER presents the hardest challenges: clinical NER must recognise medications, dosages, adverse events, and procedures using vocabulary absent from Wikipedia-trained models; legal NER must identify statutory references and case citations. spaCy and Hugging Face provide the standard production tooling; customisation via annotation and fine-tuning with Prodigy or Label Studio is the standard path to domain-specific performance.

Machine Translation

Machine translation has undergone three paradigm shifts. Statistical machine translation (SMT), dominant from the 1990s to the mid-2010s, decomposed translation into a phrase translation model and a language model — explicitly learned from parallel corpora. Neural machine translation (NMT) with encoder-decoder architectures and attention (Bahdanau attention, 2015) replaced SMT by learning end-to-end continuous representations, producing more fluent translations at the cost of interpretability. The transformer-based systems that now power Google Translate, DeepL, and Microsoft Translator represent the third era: pre-trained on hundreds of language pairs simultaneously, they enable zero-shot translation between language pairs never seen together during training.

Current multilingual models like Meta's NLLB-200 (No Language Left Behind) support over 200 languages, specifically targeting low-resource languages where parallel data is scarce. Evaluation remains contentious: BLEU (n-gram overlap with reference translations) correlates poorly with human judgements of fluency and adequacy and is systematically gamed by modern neural systems; COMET and BLEURT use neural models to better approximate human evaluation. The remaining challenges are significant: translation of humour, cultural metaphors, technical jargon, and languages with radically different word orders (Japanese, Turkish) still requires human post-editing for high-stakes content.

NLP Task to Algorithm Mapping

Task	Approach	Key Library / Model	Evaluation Metric
Text Classification	Fine-tune BERT encoder + classification head	HuggingFace `transformers`, DistilBERT	F1-score (macro), accuracy
Named Entity Recognition	Token classification with BIO tagging	spaCy, BERT-NER, Flair	Entity-level F1 (CoNLL metric)
Summarisation	Encoder-decoder (T5, BART) or LLM prompting	facebook/bart-large-cnn, GPT-4	ROUGE-1/2/L, BERTScore
Question Answering	Extractive (span selection) or generative	deepset/roberta-base-squad2, RAG pipeline	Exact Match, F1 over token overlap
Machine Translation	Encoder-decoder, multilingual pre-training	Helsinki-NLP/opus-mt, NLLB-200	BLEU, COMET, BLEURT
Semantic Search / RAG	Bi-encoder embedding + ANN + cross-encoder reranker	sentence-transformers, FAISS, Pinecone	Recall@K, MRR, NDCG

NER and Semantic Search in Code

The following code demonstrates two production-grade NLP patterns: a BERT NER pipeline extracting structured entities from text, and a FAISS-based semantic search index for approximate nearest-neighbour retrieval:

from transformers import pipeline
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Named Entity Recognition with BERT
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english",
               aggregation_strategy="simple")
text = "Microsoft CEO Satya Nadella announced a $10B investment in OpenAI at Davos."
entities = ner(text)
for ent in entities:
    print(f"{ent['word']:20} → {ent['entity_group']} ({ent['score']:.2f})")
# Microsoft            → ORG (0.99)
# Satya Nadella        → PER (0.98)
# $10B                 → MONEY (0.97)
# OpenAI               → ORG (0.95)
# Davos                → LOC (0.89)

# Simple semantic search with FAISS
embedder = SentenceTransformer('all-mpnet-base-v2')  # higher quality, slower
corpus = ["Python is a programming language.", "Dogs are loyal companions.",
          "Neural networks learn from data.", "The stock market fluctuated today."]
corpus_embeddings = embedder.encode(corpus, normalize_embeddings=True)

# Build FAISS index for fast ANN search
index = faiss.IndexFlatIP(corpus_embeddings.shape[1])  # inner product = cosine for normalized
index.add(corpus_embeddings.astype('float32'))

query = "How do machine learning models train?"
query_emb = embedder.encode([query], normalize_embeddings=True)
scores, indices = index.search(query_emb.astype('float32'), k=2)
print(f"Top match: '{corpus[indices[0][0]]}' (score={scores[0][0]:.3f})")
# Top match: 'Neural networks learn from data.' (score=0.742)

                        
                        Important: NLP models trained on public web data encode and often amplify the biases present in that data. A sentiment classifier trained on product reviews will systematically underperform on reviews written in non-standard dialects. A hiring document classifier may rate CVs differently based on the demographic signals encoded in names and educational institutions. Never ship an NLP model into a decision-making context without an explicit audit of demographic and dialectal performance disparities — the gap between benchmark accuracy and equitable performance is consistently larger than expected.
                    

Semantic Search & Retrieval

Traditional keyword search engines like Elasticsearch use BM25 — a probabilistic scoring function built on term frequency and inverse document frequency — to rank documents by lexical overlap with a query. BM25 is fast, transparent, and robust, but it fails at semantic matching: a query for "cardiac arrest treatment" will not retrieve documents about "myocardial infarction management" unless those exact terms appear. The shift to semantic search replaces or supplements lexical matching with dense vector similarity.

The standard bi-encoder architecture encodes queries and documents independently using a sentence transformer model, producing dense vectors that are compared via cosine similarity or inner product. Because documents can be pre-encoded offline, the online retrieval step is purely a vector search operation — fast enough for real-time applications when implemented with approximate nearest neighbour (ANN) algorithms. FAISS (Facebook AI Similarity Search) provides highly optimised vector indexes in C++ with Python bindings; HNSW (Hierarchical Navigable Small World graphs) is the dominant ANN algorithm for recall-latency tradeoffs in production; managed vector databases like Pinecone, Weaviate, and Qdrant package these primitives with filtering, metadata handling, and horizontal scalability.

Cross-encoder rerankers complement bi-encoders in two-stage retrieval pipelines. A bi-encoder retrieves a broad candidate set efficiently (top-50 to top-100); a cross-encoder then processes each query-document pair jointly — allowing full attention across both — to produce a more accurate relevance score. Cross-encoders are too slow for full-corpus retrieval but fast enough for reranking a small candidate set. This retrieve-then-rerank architecture consistently outperforms either component alone. Hybrid search adds a third dimension: combining BM25 and dense retrieval scores (via RRF or learned fusion) recovers the lexical precision that pure dense retrieval sacrifices on exact-match queries — product codes, version numbers, proper nouns. Retrieval-Augmented Generation (RAG) — using semantic search to fetch relevant passages that are then inserted into an LLM's context for grounded generation — has become the dominant pattern for knowledge-grounded question answering, and will be covered in depth in Part 7 (Conversational AI).

Case Study

Dense Retrieval at Scale: Migrating from BM25 to Hybrid Search

A B2B SaaS company with a 2-million-document internal knowledge base ran a structured migration from keyword-based BM25 retrieval to a hybrid search architecture combining dense retrieval with lexical fallback. The first phase replaced the BM25 backend with a bi-encoder model (a fine-tuned sentence-transformers/all-mpnet-base-v2) serving embeddings into a FAISS flat index — retrieval latency was 40ms at the 95th percentile on commodity CPU hardware. The initial dense-only system improved recall@10 from 61% to 74% on their internal evaluation set, but showed degraded precision on queries containing rare product codes and version numbers that the embedding model had never seen. The hybrid architecture — combining BM25 scores and dense cosine similarity scores using Reciprocal Rank Fusion (RRF) — recovered precision on lexical queries while maintaining the semantic recall gains, reaching recall@10 of 79%. A cross-encoder reranker applied to the top-50 retrieved documents further improved MRR from 0.52 to 0.67. The lesson: dense retrieval and keyword search are complementary, not competitive, and production search systems benefit from both.

Semantic Search Vector Database Production

Practical Exercises

These exercises progress from simple NLP tooling to building a functional RAG pipeline. Each exercise builds directly on the concepts covered in this article. You will need Python 3.9+, a HuggingFace account (free), and approximately 4GB of disk space for model downloads.

Exercise 1 Beginner

Entity Extraction from News

Use spaCy to extract all entities from a news article of your choice (copy-paste from any newspaper). Count how many ORG, PERSON, and LOCATION entities appear. What does the entity distribution tell you about the article's topic? Now compare spaCy's output on the same text to the HuggingFace NER pipeline using dbmdz/bert-large-cased-finetuned-conll03-english. Which system identifies more entities? Which makes more errors? Document 3 disagreements and hypothesise why each occurred.

Exercise 2 Intermediate

Semantic Search vs Keyword Search

Implement a simple semantic search over 50 Wikipedia sentences (use the datasets library to pull from the Wikipedia dataset) using sentence-transformers and cosine similarity. Implement TF-IDF keyword search over the same corpus using sklearn. Design 5 test queries: 2 that use exact keywords from the corpus, 2 that use synonyms, and 1 that asks a question. Compare the top-3 results from each system. Which approach handles synonym queries better? Which handles exact-match queries better? This exercise reveals the complementarity of hybrid search.

Exercise 3 Intermediate

BERT Fine-tuning for Sentiment Classification

Fine-tune a BERT-based model (distilbert-base-uncased is recommended for speed) for sentiment classification on the IMDB dataset using HuggingFace Trainer. Train for 3 epochs with a learning rate of 2e-5. Track training and validation loss at each epoch. What accuracy do you achieve? Now compare to a TF-IDF + Logistic Regression baseline trained on the same data. How many labelled examples does the BERT model need before it overtakes the classical baseline? (Hint: run both systems at 100, 500, 1000, and 25000 examples.)

Exercise 4 Advanced

Build a Simple RAG Pipeline

Build a minimal Retrieval-Augmented Generation pipeline: (1) choose a PDF document of your choice (a technical report or paper works well); (2) parse it into paragraphs using pypdf; (3) embed all paragraphs with all-mpnet-base-v2 and store in a FAISS index; (4) at query time, retrieve the top-3 passages using cosine similarity; (5) format a prompt with the retrieved context and call a GPT-4o or Claude API to generate a grounded answer. Test on 5 questions about the document's content — 3 that are directly answerable from the text, and 2 that require synthesis across paragraphs. Compare the LLM's answers with and without the retrieved context. What happens when no relevant passages are retrieved?

NLP Pipeline Specification Generator

Use this tool to document your NLP pipeline design — from model selection to deployment. A clear specification helps align engineering, product, and data teams before implementation begins, and serves as a reference during QA and monitoring.

NLP Pipeline Specification Generator

Document your NLP pipeline design — from model selection to deployment. Generate a shareable specification document.

Author Name

Pipeline / System Name *

Text Domain

Primary NLP Task

Language Requirements

Input Volume & Length

Latency Requirement

Model Choice & Rationale *

Fine-tuning Data

Output Format

Quality Metrics

Conclusion & Next Steps

NLP has undergone a more complete paradigm shift than any other ML domain over the past decade. The progression from bag-of-words to word embeddings to contextual embeddings to transformer pre-training represents not incremental improvement but a fundamental change in what NLP systems can do. A modern text classifier built in 2024 on a fine-tuned BERT model trained on 1,000 examples outperforms the best classical system trained on 100,000 examples on almost any task — and the 2024 LLM zero-shot classifier often matches both without any task-specific training at all.

The practical takeaways for practitioners are clear. Tokenization matters more than it appears: vocabulary choices affect compute cost, context window effectiveness, and performance on low-frequency and multilingual text. Embeddings are the universal interface between text and downstream ML: whether you are building a classifier, a retrieval system, a clustering pipeline, or a RAG application, your entry point is always a text encoder producing dense vectors. And production NLP demands evaluation beyond benchmark accuracy — calibration, demographic fairness, latency under load, and robustness to the specific distribution of text your users actually write.

The concepts developed here — tokenization, attention, contextual representations, the encoder/decoder/encoder-decoder triad — are the substrate on which every subsequent article in this series builds. Conversational AI (Part 7) rests on intent classification, NER, and dialogue management. Large Language Models (Part 8) are transformers at scale. Prompt engineering (Part 9) is the art of eliciting useful behaviour from autoregressive decoders. Understanding NLP is no longer optional for AI practitioners — it is the common language of the field.

Next in the Series

In Part 4: Computer Vision in the Real World, we'll shift from text to images — covering CNNs, Vision Transformers, object detection, segmentation, and real deployment patterns across healthcare, retail, and autonomous systems.

Cookie Consent

Cookie Preferences

Natural Language Processing

Table of Contents

About This Series

AI in the Wild: Real-World Applications & Ethics

AI & ML Landscape Overview

ML Foundations for Practitioners

Natural Language Processing

Computer Vision in the Real World

Recommender Systems

Reinforcement Learning Applications

Conversational AI & Chatbots

Large Language Models

Prompt Engineering & In-Context Learning

Fine-tuning, RLHF & Model Alignment

Generative AI Applications

Multimodal AI

AI Agents & Agentic Workflows

AI in Healthcare & Life Sciences

AI in Finance & Fraud Detection

AI in Autonomous Systems & Robotics

AI Security & Adversarial Robustness

Explainable AI & Interpretability

AI Ethics & Bias Mitigation

MLOps & Model Deployment

Edge AI & On-Device Intelligence

AI Infrastructure, Hardware & Scaling

Responsible AI Governance

AI Policy, Regulation & Future Directions

Text Preprocessing & Tokenization

History & Origins of NLP

Classical Pipelines

Subword Tokenization

From Tokenization to Dense Embeddings

Setting Up Your NLP Environment

Word Embeddings & Representations

Word2Vec & GloVe

Contextual Embeddings

The Transformer Revolution

Attention Mechanism

BERT, GPT & Beyond

NLP Model Family Comparison

Core NLP Tasks in Production

Classification & Sentiment

Named Entity Recognition

Machine Translation

NLP Task to Algorithm Mapping

NER and Semantic Search in Code

Semantic Search & Retrieval

Dense Retrieval at Scale: Migrating from BM25 to Hybrid Search

Practical Exercises

Entity Extraction from News

Semantic Search vs Keyword Search

BERT Fine-tuning for Sentiment Classification

Build a Simple RAG Pipeline

NLP Pipeline Specification Generator

NLP Pipeline Specification Generator

Conclusion & Next Steps

Next in the Series

Continue This Series

Part 2: ML Foundations for Practitioners

Part 8: Large Language Models

Part 9: Prompt Engineering & In-Context Learning