RAG: Making LLMs Smarter and More Accurate

Retrieval-Augmented Generation (RAG) combines external knowledge bases with large language models to significantly improve accuracy, timeliness, and reliability. This article explores RAG principles and practical implementation.

What is RAG?

RAG is a hybrid approach combining information retrieval and text generation. It enhances LLM capabilities by:

  • Retrieving relevant information from external knowledge bases
  • Combining retrieved results with the user query
  • Generating precise answers based on augmented context
  • Enabling real-time knowledge updates and expansion

RAG System Architecture

1. Document Processing Layer

# Document splitting and preprocessing
def process_documents(docs):
    chunks = []
    for doc in docs:
        # Intelligent document splitting
        segments = smart_split(doc, 
                             chunk_size=512,
                             overlap=50)
        
        # Add metadata
        for segment in segments:
            chunks.append({
                'text': segment.text,
                'metadata': {
                    'source': doc.source,
                    'page': segment.page,
                    'timestamp': doc.timestamp
                }
            })
    return chunks

2. Embedding Layer

# Text embedding
from openai import OpenAI

client = OpenAI(api_key="your-key")

def create_embeddings(texts):
    embeddings = []
    for text in texts:
        response = client.embeddings.create(
            model="text-embedding-ada-002",
            input=text
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

3. Retrieval Layer

# Similarity search
def retrieve_context(query, vector_store, top_k=5):
    # Embed the query
    query_embedding = create_embedding(query)
    
    # Vector similarity search
    results = vector_store.similarity_search(
        query_embedding,
        k=top_k,
        filter={'score': {'$gte': 0.75}}
    )
    
    # Rerank
    reranked = rerank_results(query, results)
    
    return reranked

4. Generation Layer

# RAG generation
def generate_with_rag(query, context):
    prompt = f"""Answer the question based on the context below. 
    
Context: 
{context}

User question: {query}

Please answer accurately based on the context. If the context lacks relevant information, state that explicitly. 
"""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an accurate QA assistant"},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    
    return response.choices[0].message.content

Vector Database Selection

Open-source Solutions

  • Chroma

    Lightweight and easy to integrate

  • Weaviate

    Feature-rich, supports hybrid search

  • Milvus

    High performance and highly scalable

  • Qdrant

    Rust implementation with excellent performance

Cloud Services

  • Pinecone

    Fully managed with zero ops

  • AWS OpenSearch

    Deep integration with AWS ecosystem

  • Azure Cognitive Search

    Microsoft ecosystem support

  • Alibaba Cloud Vector Search

    Latency advantages in Mainland China

RAG Optimization Strategies

Key tips to improve RAG system performance

1. Document Splitting Optimization

  • • Choose splitting strategies based on document type
  • • Preserve semantic integrity, avoid cutting mid-thought
  • • Add overlaps to improve recall
  • • Use hierarchical splitting for multi-granularity retrieval

2. Retrieval Strategy Optimization

  • • Hybrid retrieval: vector + keyword + metadata
  • • Query rewriting and expansion
  • • Multi-path retrieval and result fusion
  • • Dynamically adjust number of retrieved chunks

3. Prompt Engineering

  • • Make instructions explicit to reduce hallucination
  • • Guide the model to answer strictly from context
  • • Add chain-of-thought reasoning
  • • Set appropriate temperature

Complete RAG Implementation Example

import chromadb
from openai import OpenAI
import tiktoken

class RAGSystem:
    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)
        self.chroma = chromadb.Client()
        self.collection = self.chroma.create_collection("knowledge_base")
        self.tokenizer = tiktoken.encoding_for_model("gpt-4")
        
    def add_documents(self, documents):
        """Add documents to the knowledge base"""
        for doc in documents:
            chunks = self.split_document(doc)
            embeddings = self.create_embeddings(chunks)
            
            self.collection.add(
                embeddings=embeddings,
                documents=chunks,
                ids=[f"doc_{i}" for i in range(len(chunks))],
                metadatas=[{"source": doc.source} for _ in chunks]
            )
    
    def split_document(self, doc, chunk_size=500, overlap=50):
        """Intelligent document splitting"""
        tokens = self.tokenizer.encode(doc.text)
        chunks = []
        
        for i in range(0, len(tokens), chunk_size - overlap):
            chunk_tokens = tokens[i:i + chunk_size]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            
        return chunks
    
    def create_embeddings(self, texts):
        """Create text embeddings"""
        response = self.client.embeddings.create(
            model="text-embedding-ada-002",
            input=texts
        )
        return [data.embedding for data in response.data]
    
    def retrieve(self, query, top_k=5):
        """Retrieve relevant documents"""
        query_embedding = self.create_embeddings([query])[0]
        
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        
        return results['documents'][0]
    
    def generate_answer(self, query):
        """Generate RAG answer"""
        # Retrieve relevant context
        contexts = self.retrieve(query)
        
        # Build prompt
        context_str = "

".join(contexts)
        prompt = f"""Answer the user's question based on the references below. 
        
References: 
{context_str}

User question: {query}

Provide an accurate and detailed response. If references are insufficient, say so explicitly. 
"""
        
        # Call LLM to generate the answer
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a knowledge-base QA assistant"},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=1000
        )
        
        return {
            "answer": response.choices[0].message.content,
            "contexts": contexts,
            "usage": response.usage
        }

# Usage example
rag = RAGSystem(api_key="your-api-key")

# Add documents
rag.add_documents([
    {"text": "The company was founded in 2020...", "source": "company_intro.pdf"},
    {"text": "Product features include ...", "source": "product_manual.pdf"}
])

# Q&A
result = rag.generate_answer("When was the company founded?")
print(f"Answer: {result['answer']}")
print(f"Contexts: {result['contexts']}")

RAG Application Scenarios

📚 Enterprise Knowledge Base

  • • Internal document Q&A
  • • Policy and regulation lookup
  • • Technical documentation search
  • • Historical data analysis

🤖 Intelligent Customer Support

  • • Product inquiries
  • • After-sales support
  • • Automated FAQ
  • • Order lookup

📖 Education and Training

  • • Course Q&A
  • • Learning material retrieval
  • • Personalized tutoring
  • • Exam review

Performance Evaluation Metrics

RAG evaluation dimensions

Retrieval Quality

  • Recall: Proportion of relevant documents retrieved
  • Precision: Relevance of retrieved results
  • MRR: Mean reciprocal rank
  • NDCG: Normalized discounted cumulative gain

Generation Quality

  • Accuracy: Factual correctness
  • Relevance: Alignment with the question
  • Completeness: Coverage of required information
  • Consistency: Consistency with the knowledge base

FAQ and Solutions

Issue 1: Irrelevant retrieval results

Cause: Vector similarity ≠ semantic relevance

Solution: Use hybrid retrieval, reranking, and query rewriting

Issue 2: Hallucinated answers

Cause: Model extrapolates beyond provided context

Solution: Stricter prompt constraints, lower temperature, add verification

Issue 3: Slow responses

Cause: Retrieval and generation both add latency

Solution: Caching, asynchronous processing, optimized retrieval algorithms

Start Building Your RAG System

RAG makes LLMs more grounded and controllable. With LLM APIs, you can quickly build enterprise-grade RAG applications, combining knowledge management with intelligent Q&A.

Try LLM API Now