RAG Systems: The Context Window Problem
How to handle large document sets when your LLM has a 4K token limit. Practical chunking strategies and retrieval optimization.
The Problem
You’re building a RAG (Retrieval-Augmented Generation) system. You have a 500-page technical manual. Your LLM has a 4K token context window. How do you fit 500 pages into 4K tokens? You don’t.
The Solution: Smart Chunking + Retrieval
The trick is to retrieve only the relevant chunks and pass those to the LLM.
Step 1: Chunk Your Documents
Break large documents into smaller, semantically meaningful chunks:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_document(text, chunk_size=500, chunk_overlap=50):
"""
Split text into chunks with overlap for context preservation.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(text)
return chunks
# Example
document = """
# Chapter 1: Introduction
This is a long document...
# Chapter 2: Architecture
System design principles...
"""
chunks = chunk_document(document)
# Result: ["# Chapter 1: Introduction\nThis is...", "# Chapter 2: Architecture\nSystem..."]
Step 2: Embed and Store
Convert chunks to vectors and store in a vector database:
from openai import OpenAI
import chromadb
client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("technical_docs")
def embed_chunks(chunks):
"""
Generate embeddings for all chunks and store them.
"""
for i, chunk in enumerate(chunks):
# Generate embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunk
)
embedding = response.data[0].embedding
# Store in vector DB
collection.add(
embeddings=[embedding],
documents=[chunk],
ids=[f"chunk_{i}"]
)
Step 3: Retrieve Relevant Chunks
When a user asks a question, find the most relevant chunks:
def retrieve_relevant_chunks(query, top_k=3):
"""
Find the top K most relevant chunks for a query.
"""
# Embed the query
response = client.embeddings.create(
model="text-embedding-3-small",
input=query
)
query_embedding = response.data[0].embedding
# Query vector DB
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results['documents'][0]
# Example
query = "How does the authentication system work?"
relevant_chunks = retrieve_relevant_chunks(query)
# Returns: ["# Chapter 5: Authentication\nThe system uses...", ...]
Step 4: Generate Answer with Context
Pass only the relevant chunks to the LLM:
def answer_question(query, chunks):
"""
Generate answer using retrieved context.
"""
# Build context from chunks
context = "\n\n".join(chunks)
# Create prompt
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
# Generate response
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
{"role": "user", "content": prompt}
],
temperature=0.1
)
return response.choices[0].message.content
# Usage
query = "How does authentication work?"
relevant_chunks = retrieve_relevant_chunks(query, top_k=3)
answer = answer_question(query, relevant_chunks)
Optimization Strategies
1. Dynamic Chunk Count
Adjust top_k based on token budget:
def count_tokens(text):
"""Rough estimate: 1 token ≈ 4 characters"""
return len(text) // 4
def retrieve_within_budget(query, max_tokens=3000):
"""
Retrieve chunks until token budget is exhausted.
"""
chunks = retrieve_relevant_chunks(query, top_k=10)
selected_chunks = []
total_tokens = 0
for chunk in chunks:
chunk_tokens = count_tokens(chunk)
if total_tokens + chunk_tokens > max_tokens:
break
selected_chunks.append(chunk)
total_tokens += chunk_tokens
return selected_chunks
2. Re-ranking for Precision
Use a re-ranker to improve relevance:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_chunks(query, chunks):
"""
Re-rank chunks for better relevance.
"""
# Score each chunk
scores = reranker.predict([
(query, chunk) for chunk in chunks
])
# Sort by score
ranked = sorted(
zip(chunks, scores),
key=lambda x: x[1],
reverse=True
)
return [chunk for chunk, score in ranked]
3. Hierarchical Retrieval
Retrieve at multiple levels:
def hierarchical_retrieval(query):
"""
First find relevant sections, then chunks within those sections.
"""
# Level 1: Find relevant sections
sections = retrieve_relevant_chunks(query, top_k=2, collection="sections")
# Level 2: Find chunks within those sections
relevant_chunks = []
for section in sections:
section_chunks = retrieve_chunks_from_section(section, query)
relevant_chunks.extend(section_chunks)
return relevant_chunks[:5] # Top 5 total
Common Pitfalls
❌ Chunks Too Small
# BAD: Loses context
chunk_size=100 # Too small, fragmented context
❌ No Overlap
# BAD: Context breaks at boundaries
chunk_overlap=0 # Information lost between chunks
❌ Ignoring Metadata
# GOOD: Include metadata for better filtering
collection.add(
embeddings=[embedding],
documents=[chunk],
metadatas=[{
"source": "manual.pdf",
"section": "Chapter 5",
"page": 42
}],
ids=[f"chunk_{i}"]
)
Production Checklist
- ✅ Chunk size: 300-1000 tokens (balance context vs granularity)
- ✅ Overlap: 10-20% of chunk size
- ✅ Use semantic separators (paragraphs, headers)
- ✅ Include metadata (source, section, page)
- ✅ Implement re-ranking for better precision
- ✅ Cache embeddings (don’t regenerate on every query)
- ✅ Monitor retrieval quality (are correct chunks retrieved?)
The Takeaway
RAG isn’t about fitting everything into context—it’s about retrieving the right pieces. Chunk smart, embed well, retrieve precisely.
Rule of thumb: 3-5 chunks of 500 tokens each = sweet spot for most questions.