Text generation and RAG: from LSTMs to vector search

Project	Paradigm	Core technique	Key artifact
Next Token Prediction (50)	Statistical / neural LM	N-gram, character-level RNN	Probability distribution over vocabulary
Text Generator (51)	Sequence-to-sequence	LSTM with teacher forcing	Generated text sequences
Prefix Tree Autocomplete (52)	Deterministic data structure	Trie + frequency ranking	Sorted completion candidates
RAG Injection Research Pipeline (45)	Retrieval + generation	Embeddings + vector DB + LLM	Grounded natural language answers

Next Token Prediction (Project 50)

Goal: Given a sequence of characters or words, predict the most likely next token. This is the foundational training objective behind all autoregressive language models.How it works:A character-level or word-level recurrent model is trained with a sliding window over the input corpus. At each step, the model receives the previous seq_len tokens and predicts a probability distribution over the vocabulary. Cross-entropy loss drives the model to assign high probability to the true next token.Character-level model (Keras):

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding

# Build character vocabulary
text = open("corpus.txt").read().lower()
chars = sorted(set(text))
char2idx = {c: i for i, c in enumerate(chars)}
idx2char = {i: c for c, i in char2idx.items()}
VOCAB_SIZE = len(chars)

# Create sliding-window sequences
SEQ_LEN = 40
step = 3
sequences, next_chars = [], []
for i in range(0, len(text) - SEQ_LEN, step):
    sequences.append([char2idx[c] for c in text[i:i + SEQ_LEN]])
    next_chars.append(char2idx[text[i + SEQ_LEN]])

X = np.array(sequences)             # (n_seq, SEQ_LEN)
y = np.array(next_chars)            # (n_seq,)

# Model
model = Sequential([
    Embedding(VOCAB_SIZE, 64, input_length=SEQ_LEN),
    LSTM(256, return_sequences=False),
    Dense(VOCAB_SIZE, activation="softmax"),
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.fit(X, y, batch_size=128, epochs=30)

Sampling with temperature:

def sample(preds: np.ndarray, temperature: float = 1.0) -> int:
    preds = np.log(preds + 1e-8) / temperature
    preds = np.exp(preds) / np.sum(np.exp(preds))
    return np.random.choice(len(preds), p=preds)

Lower temperature (0.2–0.5) produces more conservative, repetitive output. Higher temperature (0.8–1.2) yields more creative but less coherent text.

Text Generator (Project 51)

Goal: Generate coherent multi-sentence text by training an LSTM (Long Short-Term Memory) network on a domain corpus using teacher forcing.How it works:Unlike next-token prediction which predicts one token at a time during evaluation, the Text Generator is designed to unroll multiple generation steps, maintaining hidden state across them. Teacher forcing — passing the ground-truth token at each training step rather than the model’s own prediction — stabilizes training with LSTMs.Word-level LSTM generator:

import torch
import torch.nn as nn

class TextGeneratorLSTM(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int, n_layers: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, n_layers,
                            batch_first=True, dropout=0.3)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        emb = self.embedding(x)                   # (batch, seq, embed_dim)
        out, hidden = self.lstm(emb, hidden)      # (batch, seq, hidden_dim)
        logits = self.fc(out)                     # (batch, seq, vocab_size)
        return logits, hidden

# Generation loop
def generate(model, seed_tokens, idx2word, word2idx, n_words=50, temperature=0.8):
    model.eval()
    tokens = torch.tensor([seed_tokens])
    hidden = None
    generated = list(seed_tokens)

    with torch.no_grad():
        for _ in range(n_words):
            logits, hidden = model(tokens, hidden)
            probs = torch.softmax(logits[:, -1, :] / temperature, dim=-1)
            next_token = torch.multinomial(probs, 1).item()
            generated.append(next_token)
            tokens = torch.tensor([[next_token]])

    return " ".join(idx2word[t] for t in generated)

Training tip: Use gradient clipping (torch.nn.utils.clip_grad_norm_) with a max norm of 5.0 to prevent exploding gradients, which are common in deep LSTMs on long sequences.

Prefix Tree Autocomplete Engine (Project 52)

Goal: Given a partial string prefix, return a ranked list of completion candidates in sub-millisecond time — without any neural network.How it works:A trie (prefix tree) stores all known words or phrases. Each node represents one character. Insertion is O(k) where k is the word length; prefix lookup is also O(k) and returns all completions reachable from the prefix node. Completions are ranked by insertion frequency so the most common completions surface first.

from collections import defaultdict
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrieNode:
    children: dict = field(default_factory=dict)
    is_end: bool = False
    frequency: int = 0

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str, frequency: int = 1) -> None:
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True
        node.frequency += frequency

    def _collect(self, node: TrieNode, prefix: str, results: list) -> None:
        if node.is_end:
            results.append((prefix, node.frequency))
        for char, child in node.children.items():
            self._collect(child, prefix + char, results)

    def autocomplete(self, prefix: str, top_n: int = 5) -> list[str]:
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        results = []
        self._collect(node, prefix, results)
        results.sort(key=lambda x: -x[1])        # sort by frequency descending
        return [word for word, _ in results[:top_n]]

# Example usage
trie = Trie()
for word, freq in [("python", 120), ("pytorch", 95), ("pandas", 88), ("pickle", 40)]:
    trie.insert(word, freq)

print(trie.autocomplete("py"))   # ['python', 'pytorch']
print(trie.autocomplete("pa"))   # ['pandas']

When to use a trie over a neural autocomplete: Tries are deterministic, explainable, and extremely fast. They are the right choice when you have a fixed vocabulary (e.g., product names, command completions) and need guaranteed latency. Neural models are better when the completion space is open-ended and semantic similarity matters more than exact prefix matching.

RAG Injection Research Pipeline (Project 45)

Goal: Answer factual questions about a document corpus by retrieving relevant passages at query time and injecting them as context into a language model prompt.How it works:The pipeline has two phases:

Ingestion — documents are split into overlapping chunks, embedded with a sentence transformer, and stored in a vector database (ChromaDB in this project, as evidenced by the vector_store/chroma.sqlite3 artifact).
Query — the user’s question is embedded with the same model, the top-k most similar chunks are retrieved from the vector store, and they are concatenated into the prompt before the LLM generates an answer.

from sentence_transformers import SentenceTransformer
import chromadb

EMBED_MODEL = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMBED_MODEL)

# --- Ingestion ---
client = chromadb.PersistentClient(path="vector_store/")
collection = client.get_or_create_collection("research_docs")

def ingest_documents(docs: list[dict]) -> None:
    """docs: list of {"id": str, "text": str, "source": str}"""
    texts = [d["text"] for d in docs]
    embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()
    collection.add(
        ids=[d["id"] for d in docs],
        embeddings=embeddings,
        documents=texts,
        metadatas=[{"source": d["source"]} for d in docs],
    )

# --- Retrieval ---
def retrieve(query: str, top_k: int = 4) -> list[str]:
    q_emb = embedder.encode([query], normalize_embeddings=True).tolist()
    results = collection.query(query_embeddings=q_emb, n_results=top_k)
    return results["documents"][0]   # list of passage strings

# --- Generation (inject context into prompt) ---
def rag_answer(query: str, llm_generate_fn) -> str:
    passages = retrieve(query)
    context = "\n\n".join(f"[{i+1}] {p}" for i, p in enumerate(passages))
    prompt = (
        f"Use the following research passages to answer the question.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\nAnswer:"
    )
    return llm_generate_fn(prompt)

Vector store: This project persists embeddings in ChromaDB (vector_store/chroma.sqlite3). The collection is reloaded across sessions with chromadb.PersistentClient, so ingestion only needs to happen once.Chunk size matters: Chunks that are too small miss context; chunks that are too large dilute relevance. A chunk size of 256–512 tokens with a 50-token overlap is a good starting point for research papers.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Text generation and RAG: from LSTMs to vector search

Projects at a glance

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Projects at a glance

Build docs developers (and LLMs) love

Projects at a glance