Intelligent Caching: Reduce AI Call Costs by 90%

Through intelligent caching strategies, you can significantly reduce repetitive API calls, not only saving costs but also significantly improving response speed.

Caching Strategy Comparison

🔤 Exact Match Cache

  • • Identical inputs
  • • Hit rate: 20-30%
  • • Simple to implement
  • • Suitable for FAQ scenarios

🧠 Semantic Cache

  • • Similar question matching
  • • Hit rate: 50-70%
  • • Requires a vector database
  • • Highly intelligent

📊 Dialogue Cache

  • • Multi-turn dialogue optimization
  • • Hit rate: 30-40%
  • • Context-aware
  • • Higher complexity

Semantic Cache Implementation

import redis
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticCache:
    """Semantic cache system"""
    
    def __init__(self, redis_host='localhost', threshold=0.85):
        self.redis_client = redis.Redis(host=redis_host)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = threshold
        self.vector_dim = 384
        
    def get_embedding(self, text):
        """Generate text vector"""
        return self.encoder.encode(text, normalize_embeddings=True)
    
    def search_similar(self, query, top_k=5):
        """Search for similar queries"""
        query_vec = self.get_embedding(query)
        
        # Get all cached vectors
        keys = self.redis_client.keys("cache:vec:*")
        
        similarities = []
        for key in keys:
            cached_vec = np.frombuffer(
                self.redis_client.get(key), 
                dtype=np.float32
            )
            
            # Calculate cosine similarity
            similarity = np.dot(query_vec, cached_vec)
            if similarity > self.threshold:
                cache_id = key.decode().split(":")[-1]
                similarities.append((cache_id, similarity))
        
        # Return the most similar results
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]
    
    def get(self, query):
        """Get cached result"""
        # Exact match
        exact_key = f"cache:exact:{hash(query)}"
        exact_result = self.redis_client.get(exact_key)
        if exact_result:
            return exact_result.decode()
        
        # Semantic match
        similar_results = self.search_similar(query, top_k=1)
        if similar_results:
            cache_id, similarity = similar_results[0]
            result_key = f"cache:result:{cache_id}"
            return self.redis_client.get(result_key).decode()
        
        return None
    
    def set(self, query, result, ttl=3600):
        """Set cache"""
        cache_id = str(hash(query))
        
        # Store vector
        query_vec = self.get_embedding(query)
        vec_key = f"cache:vec:{cache_id}"
        self.redis_client.set(vec_key, query_vec.tobytes(), ex=ttl)
        
        # Store result
        result_key = f"cache:result:{cache_id}"
        self.redis_client.set(result_key, result, ex=ttl)
        
        # Exact match cache
        exact_key = f"cache:exact:{cache_id}"
        self.redis_client.set(exact_key, result, ex=ttl)

# Usage Example
cache = SemanticCache()

def llm_with_cache(prompt):
    # Check cache first
    cached = cache.get(prompt)
    if cached:
        print("Cache hit!")
        return cached
    
    # Call LLM
    result = call_llm_api(prompt)
    
    # Store in cache
    cache.set(prompt, result)
    
    return result

Cache Effect Analysis

Actual Application Data

Customer Service Scenario

  • Cache Hit Rate65%
  • Cost Reduction-70%
  • Response Time50ms

Knowledge Q&A

  • Cache Hit Rate82%
  • Cost Reduction-85%
  • Response Time20ms

Advanced Caching Techniques

Distributed Cache Architecture

# Redis Cluster Configuration
class DistributedCache:
    def __init__(self):
        self.nodes = [
            {"host": "redis1", "port": 6379},
            {"host": "redis2", "port": 6379},
            {"host": "redis3", "port": 6379}
        ]
        self.rc = RedisCluster(
            startup_nodes=self.nodes,
            decode_responses=True
        )
        
    def consistent_hash(self, key):
        """Consistent hashing sharding"""
        return hashlib.md5(key.encode()).hexdigest()

Smart Expiration Strategy

  • • Popularity-aware: Extend the validity period of frequently accessed cache
  • • Capacity control: LRU evicts the least used cache
  • • Tiered storage: Hot data in memory, cold data persisted
  • • Preloading: Predict and preload possible queries in advance

Cache Monitoring Metrics

Hit Rate

Cache Hit Ratio

Cost Savings

Cost Savings

Latency Reduction

Latency Reduction

Start Optimizing Your AI Costs

Implement smart caching strategies to make every API call worthwhile.

Implement Now