How to Build Your Own RAG Knowledge Base with Python, LangChain, and Open-Source Tools
TL;DR: Building a RAG knowledge base means ingesting your documents, splitting them into chunks, generating embeddings, and storing them in a vector database so LLMs can retrieve relevant context. This guide walks you through the full pipeline with working Python code, from document loading to retrieval queries, using LangChain and ChromaDB.
Quick Takeaways
- RAG improves LLM accuracy: Pulls your actual data as context instead of relying on training data alone
- Chunking matters for retrieval: Splitting documents correctly directly impacts whether relevant information gets found
- Embeddings turn text into vectors: These numeric representations let you measure similarity and rank search results
- Vector databases are the backbone: ChromaDB, FAISS, or Pinecone store and retrieve chunks efficiently at scale
- The pipeline is straightforward: Load, chunk, embed, index, then query with your LLM
- Open-source tools work great: You don’t need expensive cloud services to build production RAG systems
- Common mistakes are fixable: Poor chunking, bad embedding choices, and weak retrieval strategies are addressable with testing
What is RAG and Why Build a Custom Knowledge Base?
Retrieval Augmented Generation (RAG) is straightforward: instead of asking an LLM to answer from its training data alone, you give it your own documents as context first. The model then generates responses based on what it actually finds in your data, not what it thinks it knows.
A custom RAG knowledge base means building this system with your own data. You’re not relying on someone else’s pre-built knowledge base. You control everything: what documents go in, how they’re processed, and how they’re retrieved.
Why does this matter? Training data gets stale. Your company’s internal policies, recent documents, and proprietary information aren’t in GPT-4’s training set. When you ask an LLM about your specific business context without RAG, you get hallucinations. With RAG, you get answers grounded in your actual data.
According to Databricks’ explanation of RAG, this approach directly improves LLM relevancy by providing domain-specific context at query time. You’re not changing the LLM itself, just feeding it better information.
The benefits are real: fewer hallucinations, faster updates (add documents without retraining), lower costs (smaller models work better with good context), and full control over your data.
Core Components: Embeddings, Chunking, and Vector Storage
A RAG system has three essential pieces that work together. Understanding each one prevents alot of problems downstream.
Embeddings convert text into numbers. An embedding model reads a piece of text and outputs a vector (a list of numbers, typically 384-1536 dimensions depending on the model). The key insight: similar text produces similar vectors. This lets you find relevant documents by computing vector similarity instead of keyword matching.
Popular embedding options include OpenAI’s embedding API (high quality, costs ~$0.02 per million tokens), open-source models from Hugging Face like all-MiniLM-L6-v2 (free, runs locally, decent quality for most use cases), and enterprise options like Cohere’s API. For most projects, all-MiniLM or OpenAI embeddings are the right starting point.
Chunking is how you split documents before embedding them. Raw documents are often too long. A 50-page PDF produces one huge vector, which loses detail. Instead, you split documents into chunks (typically 500-1000 tokens, with 20% overlap) so each chunk represents one idea or section. When you query, you retrieve individual chunks, not whole documents.
Chunk size matters. Too small (100 tokens): you lose context and retrieve dozens of irrelevant fragments. Too large (2000 tokens): you include noise and waste embedding space. The sweet spot for most domains is 600-800 tokens with 100-200 token overlap.
Vector storage is where chunks and their embeddings live. ChromaDB is the easiest for local development (in-memory or SQLite). FAISS scales to millions of vectors on a single machine. Pinecone or Weaviate handle cloud scale with managed infrastructure. Dev.to’s detailed RAG analysis covers the full architecture including these storage mechanisms.
These three pieces form the foundation. Bad embeddings make retrieval fail. Bad chunking loses relevant information. Bad vector storage creates latency problems. Get all three right and your RAG system works.
Did You Know? The chunk overlap is crucial but often overlooked. A question about “machine learning models” might match a chunk at position 1-500 that mentions models, and another at 400-900 that has important context. Without overlap, you miss connections between chunks.
Step-by-Step: Ingesting and Processing Your Data
Building a RAG knowledge base follows a proven sequence. As outlined in Astera’s step-by-step guide, the process goes: ingest, clean, chunk, embed, index.
Step 1: Ingest documents. LangChain provides document loaders for PDFs, markdown, web pages, APIs, and databases. You point it at your data source and it loads everything into LangChain Document objects (each has content and metadata).
Step 2: Clean your data. Remove headers, footers, and formatting noise that confuse embeddings. This step gets skipped often and causes retrieval problems later.
Step 3: Split into chunks. LangChain’s RecursiveCharacterTextSplitter handles this automatically. You specify chunk size and overlap, and it intelligently splits on sentence boundaries rather than arbitrary character positions.
Step 4: Generate embeddings. Pass each chunk through an embedding model. This happens once during setup, not on every query. Store the embedding vector alongside the chunk text and metadata.
Step 5: Index in vector storage. Insert embeddings into ChromaDB, FAISS, or your chosen vector database. The index structure makes similarity search fast.
Step 6: Query and retrieve. When a user asks a question, embed the question, search the vector store for similar chunks, and return the top-k results as context for the LLM.
The whole pipeline takes minutes for small datasets (under 10,000 documents) and scales to millions. The bottleneck is usually embedding generation, not storage.
Implementing Retrieval and Generation Pipeline
Once your knowledge base is built, the retrieval pipeline is where everything comes together. Here’s what happens when someone queries your RAG system.
Query embedding: The user’s question gets embedded using the same embedding model you used for documents. If you used OpenAI embeddings for documents, use OpenAI for the query too. Mixing embedding models breaks similarity matching.
Similarity search: The vector database finds chunks with vectors closest to the query vector. You typically retrieve top-k results (k=3-5 works for most cases). This is instant with proper indexing.
Context assembly: The retrieved chunks get formatted into a prompt context. You’re building something like: “Here’s relevant information: [chunk 1] [chunk 2] [chunk 3]. Now answer: [user question]”
LLM generation: Pass the assembled prompt to Claude, GPT-4, or your chosen model. The model generates an answer grounded in the retrieved context instead of relying on training data.
Optional: re-ranking. Advanced setups use a re-ranker model (smaller and faster) to reorder the top-k results before sending to the LLM. This catches cases where vector similarity is misleading. It costs more but significantly improves answer quality on hard questions.
The entire retrieval pipeline (embedding + search + formatting) typically takes 200-500ms. Generation takes longer depending on answer length. Galileo AI’s enterprise patterns dive deeper into optimization strategies for scale.
Code Examples: Python with LangChain and OpenAI
Here’s how to build this in working code. This example uses LangChain, OpenAI embeddings, ChromaDB for vector storage, and Claude for generation.
Install dependencies:
pip install langchain openai chromadb pypdf anthropic
Load and chunk documents:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load PDF document
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
# Split into chunks (600 tokens, 100 token overlap)
splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=100,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
Generate embeddings and store in ChromaDB:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Create embeddings using OpenAI
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Store in Chroma (creates persistent database)
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print("Knowledge base created")
Query and retrieve with Claude:
from langchain.chat_models import ChatAnthropic
from langchain.chains import RetrievalQA
# Initialize Claude
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
# Create retriever (fetches top 4 chunks)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
# Combine retrieval with generation
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# Query your knowledge base
query = "What are the main points about implementation?"
result = qa({"query": query})
print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
That’s a complete RAG pipeline in 30 lines. You load documents, split them, generate embeddings, store in a vector database, and query with an LLM that uses your actual data as context.
LangChain’s official RAG documentation has more advanced patterns, and the LangChain GitHub repository contains dozens of working examples you can adapt.
Troubleshooting and Optimization Tips
RAG systems don’t always work perfectly on the first try. Here are the most common problems and fixes.
Problem: Retrieval returns irrelevant chunks. This usually means your chunks are too large (includes noise) or your chunking strategy is bad (splits in the middle of ideas). Solution: reduce chunk size from 1000 to 500 tokens and test. If that doesn’t help, manually review what your query retrieves and look for patterns in what’s missing.
Problem: Embeddings are expensive. OpenAI embeddings cost $0.02 per million tokens. For large documents, this adds up. Solution: use open-source models locally. all-MiniLM-L6-v2 runs on CPU and costs nothing. Quality drops slightly but often dont matter for RAG. Benchmark your specific use case before paying.
Problem: LLM generates answers not in the context. The retrieval found relevant chunks but the LLM ignores them and hallucinates. This happens with weaker models. Solution: use better models (Claude 3.5, GPT-4) or add explicit instructions to the prompt: “Answer only based on the provided context. Do not use outside knowledge.”
Problem: Vector search is slow. Chroma on disk or FAISS with millions of vectors gets sluggish. Solution: add a reranker to filter results before they hit the LLM, or use cloud vector databases like Pinecone that handle indexing optimization.
Optimization tip: Metadata filtering. Add metadata to chunks (document source, timestamp, section) and filter during retrieval. Search only documents from Q4 2024 or specific categories. This massively reduces noise without expensive reranking.
Optimization tip: Hybrid search. Combine vector similarity with keyword matching. Vector search finds semantically similar content. Keyword search catches exact matches. Using both together beats either alone. LangChain supports this with BM25 + vector search pipelines.
Optimization tip: Query expansion. Before searching, generate multiple versions of the user’s query and retrieve results for all of them. This catches cases where the original question phrasing didn’t match your documents. It costs more (extra embeddings, extra retrieval) but significantly improves quality on hard questions.
Putting This Into Practice
Here’s how to implement RAG knowledge bases at different skill levels:
If you’re just starting: Load one PDF, split it into 500-token chunks with 100-token overlap, generate embeddings with OpenAI’s API (or local all-MiniLM), store in ChromaDB, and test retrieval with simple queries. Verify that searching for a specific word in your document returns the right chunk. This foundation works for 80% of real-world use cases. Time investment: 2-3 hours. Focus on getting the basic pipeline working before optimizing.
To deepen your practice: Implement custom chunking strategies (split by sections for documentation, by paragraphs for articles). Add metadata to chunks and test metadata filtering. Implement hybrid search combining vector and keyword matching. Build a simple FastAPI endpoint that accepts questions and returns answers with source citations. Test with 10-20 documents and measure retrieval accuracy (what percentage of top-k results were actually relevant?). Add monitoring to track embedding costs and retrieval latency. Time investment: 1 week. This puts you in the intermediate tier with real production patterns.
For serious exploration: Fine-tune embeddings on your specific domain using your own relevant/irrelevant pairs (improves domain accuracy 10-20%). Implement multi-query retrieval that reformulates questions automatically. Add re-ranking with a cross-encoder model to reorder retrieved results. Optimize vector database indexing and sharding for millions of chunks. Build monitoring dashboards tracking retrieval accuracy, latency, cost, and user satisfaction. Implement caching for repeated queries. Scale across multiple GPU instances if embedding generation becomes a bottleneck. Time investment: ongoing. This is the advanced practitioner level where you’re squeezing every ounce of accuracy and efficiency.
Conclusion
Building a RAG knowledge base sounds complex, but it’s actually a straightforward pipeline: load documents, chunk them intelligently, generate embeddings, store in a vector database, and retrieve chunks when users ask questions. Your LLM then generates answers grounded in actual data instead of guesses.
The biggest mistake most people make is skipping the troubleshooting phase. Your first RAG system probably won’t retrieve perfect results immediately. Chunk sizes will be wrong, embedding models might not fit your domain, metadata filtering will reveal patterns you missed. That’s normal. Test, measure, adjust. What works for marketing documents might fail for technical specifications.
The good news: you have solid open-source tools (LangChain, ChromaDB, Hugging Face) and fast, cheap embedding APIs (OpenAI). Building production RAG systems doesn’t require expensive enterprise software or cloud vendor lock-in. Many of the world’s best RAG implementations run on open-source stacks.
Start small. Load one document, verify retrieval works, then grow from there. By the time you’re handling thousands of documents, you’ll understand your domain well enough to optimize effectively. The principles stay the same: good chunks, good embeddings, good retrieval, good generation. Get those right and your knowledge base works.
Frequently Asked Questions
- Q: What is a knowledge base in RAG?
- A: A RAG knowledge base is a vector database containing your documents split into chunks, each with an embedding. When you query, the system finds similar chunks and passes them to an LLM as context. This grounds answers in your actual data instead of the model’s training data.
- Q: How do you build a RAG knowledge base?
- A: Load documents (e.g., PDFs) and split them into chunks using a RecursiveCharacterTextSplitter, targeting 600-800 tokens with 100-token overlap. Generate embeddings for these chunks using models like OpenAI or local alternatives, then store them in a vector database such as ChromaDB. Query by embedding user questions and retrieving similar chunks.
- Q: What are the best embedding models for RAG?
- A: OpenAI’s text-embedding-3-small costs $0.02 per million tokens and provides high quality. For local/free alternatives, all-MiniLM-L6-v2 from Hugging Face runs on CPU with decent accuracy. For enterprise scale, Cohere’s API or specialized domain models work. Benchmark your specific use case before choosing.
- Q: How to chunk data for RAG embeddings?
- A: Use RecursiveCharacterTextSplitter with chunk_size=600-800 tokens and chunk_overlap=100-200 tokens. Split on sentence/paragraph boundaries, not arbitrary characters. Too-small chunks lose context, too-large chunks include noise. Adjust based on your domain and test retrieval accuracy.
- Q: What vector databases work with RAG?
- A: ChromaDB is best for local development (in-memory or SQLite). FAISS scales to millions on a single machine. Pinecone and Weaviate handle cloud scale with managed infrastructure. All integrate with LangChain. Choose based on scale needs and whether you want managed vs. self-hosted.
