RAG Systems: The Context Window Problem

The Problem

You’re building a RAG (Retrieval-Augmented Generation) system. You have a 500-page technical manual. Your LLM has a 4K token context window. How do you fit 500 pages into 4K tokens? You don’t.

The Solution: Smart Chunking + Retrieval

The trick is to retrieve only the relevant chunks and pass those to the LLM.

Step 1: Chunk Your Documents

Break large documents into smaller, semantically meaningful chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(text, chunk_size=500, chunk_overlap=50):
    """
    Split text into chunks with overlap for context preservation.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_text(text)
    return chunks

# Example
document = """
# Chapter 1: Introduction
This is a long document...

# Chapter 2: Architecture
System design principles...
"""

chunks = chunk_document(document)
# Result: ["# Chapter 1: Introduction\nThis is...", "# Chapter 2: Architecture\nSystem..."]

Step 2: Embed and Store

Convert chunks to vectors and store in a vector database:

from openai import OpenAI
import chromadb

client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("technical_docs")

def embed_chunks(chunks):
    """
    Generate embeddings for all chunks and store them.
    """
    for i, chunk in enumerate(chunks):
        # Generate embedding
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk
        )
        embedding = response.data[0].embedding
        
        # Store in vector DB
        collection.add(
            embeddings=[embedding],
            documents=[chunk],
            ids=[f"chunk_{i}"]
        )

Step 3: Retrieve Relevant Chunks

When a user asks a question, find the most relevant chunks:

def retrieve_relevant_chunks(query, top_k=3):
    """
    Find the top K most relevant chunks for a query.
    """
    # Embed the query
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = response.data[0].embedding
    
    # Query vector DB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    return results['documents'][0]

# Example
query = "How does the authentication system work?"
relevant_chunks = retrieve_relevant_chunks(query)
# Returns: ["# Chapter 5: Authentication\nThe system uses...", ...]

Step 4: Generate Answer with Context

Pass only the relevant chunks to the LLM:

def answer_question(query, chunks):
    """
    Generate answer using retrieved context.
    """
    # Build context from chunks
    context = "\n\n".join(chunks)
    
    # Create prompt
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""
    
    # Generate response
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.1
    )
    
    return response.choices[0].message.content

# Usage
query = "How does authentication work?"
relevant_chunks = retrieve_relevant_chunks(query, top_k=3)
answer = answer_question(query, relevant_chunks)

Optimization Strategies

1. Dynamic Chunk Count

Adjust top_k based on token budget:

def count_tokens(text):
    """Rough estimate: 1 token ≈ 4 characters"""
    return len(text) // 4

def retrieve_within_budget(query, max_tokens=3000):
    """
    Retrieve chunks until token budget is exhausted.
    """
    chunks = retrieve_relevant_chunks(query, top_k=10)
    
    selected_chunks = []
    total_tokens = 0
    
    for chunk in chunks:
        chunk_tokens = count_tokens(chunk)
        if total_tokens + chunk_tokens > max_tokens:
            break
        
        selected_chunks.append(chunk)
        total_tokens += chunk_tokens
    
    return selected_chunks

2. Re-ranking for Precision

Use a re-ranker to improve relevance:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_chunks(query, chunks):
    """
    Re-rank chunks for better relevance.
    """
    # Score each chunk
    scores = reranker.predict([
        (query, chunk) for chunk in chunks
    ])
    
    # Sort by score
    ranked = sorted(
        zip(chunks, scores),
        key=lambda x: x[1],
        reverse=True
    )
    
    return [chunk for chunk, score in ranked]

3. Hierarchical Retrieval

Retrieve at multiple levels:

def hierarchical_retrieval(query):
    """
    First find relevant sections, then chunks within those sections.
    """
    # Level 1: Find relevant sections
    sections = retrieve_relevant_chunks(query, top_k=2, collection="sections")
    
    # Level 2: Find chunks within those sections
    relevant_chunks = []
    for section in sections:
        section_chunks = retrieve_chunks_from_section(section, query)
        relevant_chunks.extend(section_chunks)
    
    return relevant_chunks[:5]  # Top 5 total

Common Pitfalls

❌ Chunks Too Small

# BAD: Loses context
chunk_size=100  # Too small, fragmented context

❌ No Overlap

# BAD: Context breaks at boundaries
chunk_overlap=0  # Information lost between chunks

❌ Ignoring Metadata

# GOOD: Include metadata for better filtering
collection.add(
    embeddings=[embedding],
    documents=[chunk],
    metadatas=[{
        "source": "manual.pdf",
        "section": "Chapter 5",
        "page": 42
    }],
    ids=[f"chunk_{i}"]
)

Production Checklist

✅ Chunk size: 300-1000 tokens (balance context vs granularity)
✅ Overlap: 10-20% of chunk size
✅ Use semantic separators (paragraphs, headers)
✅ Include metadata (source, section, page)
✅ Implement re-ranking for better precision
✅ Cache embeddings (don’t regenerate on every query)
✅ Monitor retrieval quality (are correct chunks retrieved?)

The Takeaway

RAG isn’t about fitting everything into context—it’s about retrieving the right pieces. Chunk smart, embed well, retrieve precisely.

Rule of thumb: 3-5 chunks of 500 tokens each = sweet spot for most questions.