Intelligent Caching: Reduce AI Call Costs by 90%
Through intelligent caching strategies, you can significantly reduce repetitive API calls, not only saving costs but also significantly improving response speed.
Caching Strategy Comparison
🔤 Exact Match Cache
- • Identical inputs
- • Hit rate: 20-30%
- • Simple to implement
- • Suitable for FAQ scenarios
🧠Semantic Cache
- • Similar question matching
- • Hit rate: 50-70%
- • Requires a vector database
- • Highly intelligent
📊 Dialogue Cache
- • Multi-turn dialogue optimization
- • Hit rate: 30-40%
- • Context-aware
- • Higher complexity
Semantic Cache Implementation
import redis
import numpy as np
from sentence_transformers import SentenceTransformer
class SemanticCache:
"""Semantic cache system"""
def __init__(self, redis_host='localhost', threshold=0.85):
self.redis_client = redis.Redis(host=redis_host)
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = threshold
self.vector_dim = 384
def get_embedding(self, text):
"""Generate text vector"""
return self.encoder.encode(text, normalize_embeddings=True)
def search_similar(self, query, top_k=5):
"""Search for similar queries"""
query_vec = self.get_embedding(query)
# Get all cached vectors
keys = self.redis_client.keys("cache:vec:*")
similarities = []
for key in keys:
cached_vec = np.frombuffer(
self.redis_client.get(key),
dtype=np.float32
)
# Calculate cosine similarity
similarity = np.dot(query_vec, cached_vec)
if similarity > self.threshold:
cache_id = key.decode().split(":")[-1]
similarities.append((cache_id, similarity))
# Return the most similar results
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
def get(self, query):
"""Get cached result"""
# Exact match
exact_key = f"cache:exact:{hash(query)}"
exact_result = self.redis_client.get(exact_key)
if exact_result:
return exact_result.decode()
# Semantic match
similar_results = self.search_similar(query, top_k=1)
if similar_results:
cache_id, similarity = similar_results[0]
result_key = f"cache:result:{cache_id}"
return self.redis_client.get(result_key).decode()
return None
def set(self, query, result, ttl=3600):
"""Set cache"""
cache_id = str(hash(query))
# Store vector
query_vec = self.get_embedding(query)
vec_key = f"cache:vec:{cache_id}"
self.redis_client.set(vec_key, query_vec.tobytes(), ex=ttl)
# Store result
result_key = f"cache:result:{cache_id}"
self.redis_client.set(result_key, result, ex=ttl)
# Exact match cache
exact_key = f"cache:exact:{cache_id}"
self.redis_client.set(exact_key, result, ex=ttl)
# Usage Example
cache = SemanticCache()
def llm_with_cache(prompt):
# Check cache first
cached = cache.get(prompt)
if cached:
print("Cache hit!")
return cached
# Call LLM
result = call_llm_api(prompt)
# Store in cache
cache.set(prompt, result)
return resultCache Effect Analysis
Actual Application Data
Customer Service Scenario
- Cache Hit Rate65%
- Cost Reduction-70%
- Response Time50ms
Knowledge Q&A
- Cache Hit Rate82%
- Cost Reduction-85%
- Response Time20ms
Advanced Caching Techniques
Distributed Cache Architecture
# Redis Cluster Configuration
class DistributedCache:
def __init__(self):
self.nodes = [
{"host": "redis1", "port": 6379},
{"host": "redis2", "port": 6379},
{"host": "redis3", "port": 6379}
]
self.rc = RedisCluster(
startup_nodes=self.nodes,
decode_responses=True
)
def consistent_hash(self, key):
"""Consistent hashing sharding"""
return hashlib.md5(key.encode()).hexdigest()Smart Expiration Strategy
- • Popularity-aware: Extend the validity period of frequently accessed cache
- • Capacity control: LRU evicts the least used cache
- • Tiered storage: Hot data in memory, cold data persisted
- • Preloading: Predict and preload possible queries in advance
Cache Monitoring Metrics
Hit Rate
Cache Hit Ratio
Cost Savings
Cost Savings
Latency Reduction
Latency Reduction
Start Optimizing Your AI Costs
Implement smart caching strategies to make every API call worthwhile.
Implement Now