RAG Explained: How to Build AI That Knows Your Data
Jun 15, 2024
RAG Explained: How to Build AI That Knows Your Data
ChatGPT is impressive. It can write code, explain quantum physics, and compose poetry that doesn't completely suck.
But it has no idea what's in your company's internal documentation. It doesn't know your product specifications. It can't answer questions about your customer data. And when you ask it anyway, it confidently makes stuff up.
This is the fundamental problem with large language models: they know everything in general and nothing about you specifically.
Enter RAG: Retrieval-Augmented Generation. It's the technique that lets you give AI a library card to your own knowledge base. Instead of hallucinating answers, it actually looks stuff up.
If you've ever thought "I wish ChatGPT knew about MY data," this is how you make that happen. Let me show you what RAG is, when to use it, and how to build your first system in about 100 lines of code.
No PhD required. No buzzword nonsense. Just practical AI that actually works.
What the Hell is RAG?
Let me explain RAG with an analogy:
Traditional LLM: A really smart person with an amazing memory who refuses to admit when they don't know something. Ask them anything, and they'll give you a confident answer. Sometimes it's right. Sometimes it's completely made up. You can't tell the difference.
RAG System: That same smart person, but now they have a library card. When you ask them something they don't know, they go look it up, read the relevant pages, and then answer based on what they found. They can even cite their sources.
That's RAG. It's not magic. It's just giving LLMs the ability to look stuff up before answering.
Here's the problem RAG solves:
You ask ChatGPT: "What's our refund policy for enterprise customers?"
ChatGPT responds: "Based on typical enterprise SaaS practices, your refund policy probably allows 30-day returns with..."
WRONG. It just hallucinated your entire policy.
With RAG:
Your question gets converted to a search query
System searches your company docs for "refund policy enterprise"
Finds the actual policy document
Feeds it to the LLM along with your question
LLM answers based on YOUR real policy, not made-up nonsense
The difference? One is guessing. The other is reading and then answering.
Real-world example: Customer support chatbot
Without RAG:
"How do I reset my password?" → Generic answer that might not match your actual process
"What's included in the Pro plan?" → Hallucinated feature list
Result: Angry customers, confused support team
With RAG:
System searches your actual documentation
Returns accurate, sourced answers
Cites which doc it used
Result: Happy customers, less support load
The magic of RAG: You update your docs, and the AI automatically knows the new information. No retraining. No fine-tuning. Just update the knowledge base and you're done.
The Three Components of RAG
Building a RAG system has three main pieces. None of them are particularly complicated, which is good news.
1. Document Ingestion & Embedding
First, you need to turn your documents into something searchable.
The process:
Step 1: Break documents into chunks
You can't feed an entire 500-page manual to an LLM. Context windows have limits. So you split documents into smaller pieces (usually 500-1000 characters each).
Step 2: Convert chunks to embeddings
This is where it gets interesting. Each chunk gets converted into a vector—basically a list of numbers that represents the meaning of the text.
Similar meaning = similar numbers.
The magic: You can now search by meaning, not just keywords. "How do I get my money back?" will find your refund policy even though the words don't match exactly.
2. Vector Storage
Now you need somewhere to store these embeddings so you can search them later.
Vector databases are designed for this. They can:
Store millions of vectors efficiently
Search them fast (milliseconds)
Find the most similar vectors to your query
Popular options:
Pinecone - Cloud-hosted, easy, costs money
Weaviate - Open-source, self-hosted, powerful
pgvector - PostgreSQL extension (my favorite for small-to-medium projects)
FAISS - Facebook's library, great for local development
Chroma - Simple, embedded, perfect for prototypes
For your first RAG system, use Chroma or FAISS. They're free and work out of the box.
How retrieval works:
When someone asks "What's the refund policy?":
Question gets embedded: [0.25, -0.44, 0.66, ..., 0.13]
Vector database finds chunks with similar embeddings
Returns top 3-5 most relevant chunks
These chunks become context for the LLM
It's like Google, but for meaning instead of keywords.
3. Retrieval + Generation
This is where RAG becomes useful.
The flow:
User asks: "How long does shipping take?"
System embeds the question
Searches vector DB for relevant chunks
Finds: "Standard shipping: 5-7 business days. Express: 2-3 days..."
Builds a prompt: "Based on this information: [chunks], answer: [question]"
LLM generates answer using YOUR data
Returns answer with sources
Why this is better than raw LLM:
✅ Answers based on facts, not hallucinations
✅ Can cite sources
✅ Updates when docs update (no retraining)
✅ Works with private/proprietary data
✅ Much cheaper than fine-tuning
The catch: You need good documents. Garbage in, garbage out. RAG won't magically make bad documentation good.
RAG vs Fine-Tuning vs Prompt Engineering
"Should I use RAG, fine-tune, or just write better prompts?"
Great question. Here's when to use what:
Approach | When to Use | Cost | Speed | Best For |
|---|---|---|---|---|
Prompt Engineering | Knowledge fits in prompt | Free | Instant | Simple rules, small knowledge |
RAG | Large/changing knowledge base | Low-Med | Fast | Dynamic docs, Q&A systems |
Fine-Tuning | Need specific style/behavior | High | Slow | Domain expertise, writing style |
Examples:
Use Prompt Engineering when:
"Always respond in a professional tone"
"Format output as JSON"
"Never use contractions"
Simple rules that fit in system prompt
Use RAG when:
"Answer based on our 500 product manuals"
"Search through 10,000 customer support tickets"
"Query our internal documentation"
Large knowledge that changes frequently
Use Fine-Tuning when:
"Write legal contracts in our firm's specific style"
"Generate code following our company conventions"
"Diagnose medical conditions using domain knowledge"
Need behavior change, not just knowledge access
The truth nobody tells you: You'll probably use all three together.
Fine-tune for style and domain expertise
Add RAG for knowledge retrieval
Use prompts for formatting and rules
But if you're starting? Start with RAG. It's easier, cheaper, and more flexible than fine-tuning.
Building Your First RAG System
Let's build a simple document Q&A system. Full working code. No hand-waving.
What we're building:
Load documents from a folder
Make them searchable
Ask questions, get answers with sources
Step 1: Install Dependencies
What these do:
langchain- Framework for building LLM appsopenai- Access to GPT modelschromadb- Local vector databasetiktoken- Token counting (useful for chunking)
Step 2: Load Your Documents
Why chunking matters:
Imagine you have a 50-page product manual. You can't feed the whole thing to GPT every time someone asks a question. So you split it into chunks:
Chunk 1: Introduction section
Chunk 2: Setup instructions
Chunk 3: Troubleshooting
etc.
When someone asks "How do I set up?", you only retrieve the setup chunk, not the entire manual.
Why overlap?
If you split exactly at 1000 characters, you might cut a sentence in half. The overlap ensures you don't lose context at boundaries.
Step 3: Create Embeddings and Store in Vector DB
What just happened:
Each of your chunks got converted into a 1536-dimensional vector using OpenAI's embedding model. These vectors are now stored in Chroma, a local database.
Cost: Embedding 1000 chunks ≈ $0.01. Basically free.
Step 4: Build the Q&A Chain
What's chain_type="stuff"?
It means "stuff all the retrieved chunks into the prompt." Simple and works for most cases.
Other options:
map_reduce- Summarize each chunk, then combine (for lots of chunks)refine- Iteratively refine answer with each chunk (slower but better)
Start with "stuff." It works 90% of the time.
Step 5: Ask Questions!
Sample output:
Boom. You just built RAG.
The system:
Embedded your question
Searched for relevant chunks
Found your policy docs
Fed them to GPT-4
Got an accurate answer
Cited its sources
Step 6: Make It Interactive
That's it. You now have a working RAG system that can answer questions about YOUR documents.
Common RAG Challenges (And How to Fix Them)
RAG isn't magic. You'll hit problems. Here's what to expect and how to solve them.
Challenge #1: Retrieval Quality Sucks
Problem: System returns irrelevant chunks. You ask about refunds, it gives you shipping info.
Why it happens:
Bad chunking (chunks are too big or too small)
Documents aren't well-organized
Query doesn't match document language
Solutions:
Better chunking:
Add metadata filtering:
Use reranking:
Challenge #2: Context Window Limitations
Problem: You retrieve 10 chunks of 1000 chars each = 10,000 chars. Your context window is only 8,000. Now what?
Solutions:
Retrieve fewer chunks:
Use map-reduce for long docs:
Upgrade to larger context models:
Challenge #3: Slow Response Times
Problem: Users wait 5+ seconds for answers. That's unacceptable.
Why it's slow:
Embedding the query (200-500ms)
Searching vector DB (100-300ms)
LLM generation (2-4 seconds)
Total: 3-5 seconds
Solutions:
Cache common queries:
Use faster embedding models:
Optimize vector database:
Use streaming for better UX:
Challenge #4: Hallucinations Still Happen
Problem: Even with RAG, the LLM sometimes makes stuff up.
Why: LLMs are creative. Sometimes too creative.
Solutions:
Lower temperature:
Explicit prompts:
Add confidence scores:
Challenge #5: Cost Adds Up
Problem: Every query costs money. Embeddings + LLM calls = $$$.
Solutions:
Cache embeddings:
Use open-source embeddings:
Use GPT-3.5 for simple queries:
Batch processing:
When NOT to Use RAG
RAG is powerful, but it's not always the answer. Don't use it when:
1. Knowledge is Tiny and Static
Example: "Our company has 3 products with fixed prices"
Better solution: Just put it in the system prompt
No need for RAG. It's overkill.
2. You Need Real-Time Transactional Data
Example: "What's my current account balance?"
Better solution: Query the database directly with SQL or an API
RAG is for knowledge retrieval, not database queries. Use the right tool for the job.
3. You Need Guaranteed Accuracy
Example: Medical diagnoses, legal advice, financial calculations
Better solution: RAG can assist, but require human review
LLMs can still make mistakes even with RAG. For high-stakes decisions, humans need to be in the loop.
4. Documents Are Highly Structured
Example: Querying spreadsheets, database tables, structured forms
Better solution: Use SQL, Pandas, or structured query languages
5. You're Just Starting
Example: MVP with 5 documents, unclear if users will even ask questions
Better solution: Start with simple prompts, add RAG later
Don't over-engineer V1. Build RAG when the pain of not having it becomes real.
The rule of thumb:
Use RAG when:
✅ Knowledge base is large (100+ docs)
✅ Content changes frequently
✅ Users ask unpredictable questions
✅ You need source attribution
✅ You can't fit everything in a prompt
Real-World RAG Example: Customer Support Chatbot
Let's look at a concrete example.
The Problem:
Your company has:
500 product documentation pages
200 FAQ articles
1,000 past support tickets
Docs updated weekly by product team
Support team answers the same questions repeatedly. Customers wait hours for responses. Hiring more support agents is expensive.
The RAG Solution:
Results After Deployment:
80% of questions answered automatically
"How do I reset password?" ✅
"What's the refund policy?" ✅
"How do I upgrade my plan?" ✅
20% escalated to humans
Complex account issues
Billing disputes
Feature requests
Response time: <2 seconds (vs 2 hours with human-only)
Cost: ~$50/month for 1,000 queries
Embeddings: $5/month (one-time for new docs)
LLM calls: $45/month (at $0.045 per query)
Support team loves it
Handles repetitive questions
They focus on complex issues
Customers happier (instant responses)
The catch: You need good documentation. If your docs suck, RAG will just efficiently serve up sucky answers.
Advanced RAG Techniques (Teaser)
We've covered basic RAG. But there's more you can do:
Hybrid Search
Combine keyword search (BM25) + semantic search (embeddings)
Best of both worlds: exact matches + meaning
Reranking
Retrieve 20 chunks
Use a reranker model to pick best 3
Much better precision
Query Decomposition
Break "Compare Pro vs Enterprise features" into sub-questions
Answer each separately
Synthesize final answer
Hypothetical Document Embeddings (HyDE)
Generate a hypothetical answer to the question
Search for docs similar to that answer
Often finds better results than searching with the question
Agent-Based RAG
LLM decides when to retrieve
Can make multiple retrieval calls
Can use different knowledge sources
More flexible, more powerful
Multi-Modal RAG
Search images, videos, audio
Combine text + visual information
Future of RAG
These are advanced techniques. Master basic RAG first.
I'll cover these in detail in my "Advanced RAG Techniques" article (coming soon).
RAG Best Practices
After building several RAG systems, here's what actually matters:
1. Chunk Smartly
Don't:
Use fixed 1000-character splits that cut sentences in half
Make chunks too small (<300 chars) or too large (>1500 chars)
Ignore document structure
Do:
Split on semantic boundaries (paragraphs, sections)
Use 500-1000 characters for most cases
Add 150-200 char overlap
Respect document structure (headings, lists)
2. Add Metadata
Tag every chunk:
Why: Enables filtering, improves relevance, helps debugging
3. Test Retrieval Quality First
Before building the full system:
If retrieval sucks, the whole system sucks. Fix it before adding the LLM.
4. Monitor Everything
Track:
Query latency
Retrieval quality
Cost per query
User satisfaction (thumbs up/down)
5. Version Your Knowledge Base
Why: You can rollback if new docs break things, A/B test different chunking strategies
6. Handle "I Don't Know" Gracefully
Force the LLM to admit uncertainty:
Better to say "I don't know" than to hallucinate confidently.
7. Start Simple, Iterate
V1: Basic RAG with LangChain + Chroma
V2: Add metadata filtering
V3: Improve chunking strategy
V4: Add reranking
V5: Implement hybrid search
Don't build V5 on day one. Ship V1, measure, iterate.
Conclusion
RAG isn't magic. It's a practical, proven way to make LLMs useful for your specific data.
The core idea:
Store your documents as searchable vectors
Find relevant chunks for each query
Feed chunks + query to LLM for answer
It's that simple.
When to use RAG:
Large, changing knowledge bases (100+ docs)
Need accurate, sourced answers
Can't fine-tune constantly
Private/proprietary data
When NOT to use RAG:
Tiny knowledge base (use prompts)
Real-time transactional data (use APIs)
Guaranteed accuracy required (add human review)
Your next steps:
Build a simple RAG system
Pick 10-20 docs
Use the code from this article
Test with real questions
Test retrieval quality
Are you getting relevant chunks?
Iterate on chunking strategy
Add it to a real project
Internal doc search
Customer support chatbot
Product Q&A
Monitor and improve
Track what works
Fix what doesn't
Iterate based on user feedback
RAG won't solve every problem. But for making AI understand your domain? It's the best tool we have.
Want to go deeper?
Check out my other articles:
"Advanced RAG: Chunking Strategies That Actually Work" (coming soon)
"Building Production RAG with pgvector" (coming soon)
"5 Things I Wish I Knew Before Building My First MLOps Pipeline" (live now)
Questions? Let's connect:
Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
Now go build something cool. And when your RAG system hallucinates (it will), remember: even Google gets it wrong sometimes.
About the Author
Jonathan Sodeke is a Data Engineer and ML Engineer who builds AI systems that actually work in production, not just in demos. He specializes in RAG, MLOps, and making LLMs useful for real-world applications.
When he's not debugging vector databases at 2am, he's writing about the practical reality of building AI systems and teaching others to navigate the hype.
Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: https://www.linkedin.com/in/jonathan-sodeke/




