RAG: Making LLMs Smarter and More Accurate
Retrieval-Augmented Generation (RAG) combines external knowledge bases with large language models to significantly improve accuracy, timeliness, and reliability. This article explores RAG principles and practical implementation.
What is RAG?
RAG is a hybrid approach combining information retrieval and text generation. It enhances LLM capabilities by:
- Retrieving relevant information from external knowledge bases
- Combining retrieved results with the user query
- Generating precise answers based on augmented context
- Enabling real-time knowledge updates and expansion
RAG System Architecture
1. Document Processing Layer
# Document splitting and preprocessing
def process_documents(docs):
chunks = []
for doc in docs:
# Intelligent document splitting
segments = smart_split(doc,
chunk_size=512,
overlap=50)
# Add metadata
for segment in segments:
chunks.append({
'text': segment.text,
'metadata': {
'source': doc.source,
'page': segment.page,
'timestamp': doc.timestamp
}
})
return chunks2. Embedding Layer
# Text embedding
from openai import OpenAI
client = OpenAI(api_key="your-key")
def create_embeddings(texts):
embeddings = []
for text in texts:
response = client.embeddings.create(
model="text-embedding-ada-002",
input=text
)
embeddings.append(response.data[0].embedding)
return embeddings3. Retrieval Layer
# Similarity search
def retrieve_context(query, vector_store, top_k=5):
# Embed the query
query_embedding = create_embedding(query)
# Vector similarity search
results = vector_store.similarity_search(
query_embedding,
k=top_k,
filter={'score': {'$gte': 0.75}}
)
# Rerank
reranked = rerank_results(query, results)
return reranked4. Generation Layer
# RAG generation
def generate_with_rag(query, context):
prompt = f"""Answer the question based on the context below.
Context:
{context}
User question: {query}
Please answer accurately based on the context. If the context lacks relevant information, state that explicitly.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an accurate QA assistant"},
{"role": "user", "content": prompt}
],
temperature=0.3
)
return response.choices[0].message.contentVector Database Selection
Open-source Solutions
- •Chroma
Lightweight and easy to integrate
- •Weaviate
Feature-rich, supports hybrid search
- •Milvus
High performance and highly scalable
- •Qdrant
Rust implementation with excellent performance
Cloud Services
- •Pinecone
Fully managed with zero ops
- •AWS OpenSearch
Deep integration with AWS ecosystem
- •Azure Cognitive Search
Microsoft ecosystem support
- •Alibaba Cloud Vector Search
Latency advantages in Mainland China
RAG Optimization Strategies
Key tips to improve RAG system performance
1. Document Splitting Optimization
- • Choose splitting strategies based on document type
- • Preserve semantic integrity, avoid cutting mid-thought
- • Add overlaps to improve recall
- • Use hierarchical splitting for multi-granularity retrieval
2. Retrieval Strategy Optimization
- • Hybrid retrieval: vector + keyword + metadata
- • Query rewriting and expansion
- • Multi-path retrieval and result fusion
- • Dynamically adjust number of retrieved chunks
3. Prompt Engineering
- • Make instructions explicit to reduce hallucination
- • Guide the model to answer strictly from context
- • Add chain-of-thought reasoning
- • Set appropriate temperature
Complete RAG Implementation Example
import chromadb
from openai import OpenAI
import tiktoken
class RAGSystem:
def __init__(self, api_key):
self.client = OpenAI(api_key=api_key)
self.chroma = chromadb.Client()
self.collection = self.chroma.create_collection("knowledge_base")
self.tokenizer = tiktoken.encoding_for_model("gpt-4")
def add_documents(self, documents):
"""Add documents to the knowledge base"""
for doc in documents:
chunks = self.split_document(doc)
embeddings = self.create_embeddings(chunks)
self.collection.add(
embeddings=embeddings,
documents=chunks,
ids=[f"doc_{i}" for i in range(len(chunks))],
metadatas=[{"source": doc.source} for _ in chunks]
)
def split_document(self, doc, chunk_size=500, overlap=50):
"""Intelligent document splitting"""
tokens = self.tokenizer.encode(doc.text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunk_text = self.tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
def create_embeddings(self, texts):
"""Create text embeddings"""
response = self.client.embeddings.create(
model="text-embedding-ada-002",
input=texts
)
return [data.embedding for data in response.data]
def retrieve(self, query, top_k=5):
"""Retrieve relevant documents"""
query_embedding = self.create_embeddings([query])[0]
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results['documents'][0]
def generate_answer(self, query):
"""Generate RAG answer"""
# Retrieve relevant context
contexts = self.retrieve(query)
# Build prompt
context_str = "
".join(contexts)
prompt = f"""Answer the user's question based on the references below.
References:
{context_str}
User question: {query}
Provide an accurate and detailed response. If references are insufficient, say so explicitly.
"""
# Call LLM to generate the answer
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a knowledge-base QA assistant"},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=1000
)
return {
"answer": response.choices[0].message.content,
"contexts": contexts,
"usage": response.usage
}
# Usage example
rag = RAGSystem(api_key="your-api-key")
# Add documents
rag.add_documents([
{"text": "The company was founded in 2020...", "source": "company_intro.pdf"},
{"text": "Product features include ...", "source": "product_manual.pdf"}
])
# Q&A
result = rag.generate_answer("When was the company founded?")
print(f"Answer: {result['answer']}")
print(f"Contexts: {result['contexts']}")RAG Application Scenarios
📚 Enterprise Knowledge Base
- • Internal document Q&A
- • Policy and regulation lookup
- • Technical documentation search
- • Historical data analysis
🤖 Intelligent Customer Support
- • Product inquiries
- • After-sales support
- • Automated FAQ
- • Order lookup
📖 Education and Training
- • Course Q&A
- • Learning material retrieval
- • Personalized tutoring
- • Exam review
Performance Evaluation Metrics
RAG evaluation dimensions
Retrieval Quality
- • Recall: Proportion of relevant documents retrieved
- • Precision: Relevance of retrieved results
- • MRR: Mean reciprocal rank
- • NDCG: Normalized discounted cumulative gain
Generation Quality
- • Accuracy: Factual correctness
- • Relevance: Alignment with the question
- • Completeness: Coverage of required information
- • Consistency: Consistency with the knowledge base
FAQ and Solutions
Issue 1: Irrelevant retrieval results
Cause: Vector similarity ≠ semantic relevance
Solution: Use hybrid retrieval, reranking, and query rewriting
Issue 2: Hallucinated answers
Cause: Model extrapolates beyond provided context
Solution: Stricter prompt constraints, lower temperature, add verification
Issue 3: Slow responses
Cause: Retrieval and generation both add latency
Solution: Caching, asynchronous processing, optimized retrieval algorithms
Start Building Your RAG System
RAG makes LLMs more grounded and controllable. With LLM APIs, you can quickly build enterprise-grade RAG applications, combining knowledge management with intelligent Q&A.
Try LLM API Now