LLM API Performance Testing Complete Guide

Systematic performance testing can help you understand API performance boundaries, optimize system architecture, and ensure stable operation under high load.

Performance Testing Dimensions

⏱️ Latency Testing

  • • First Token Latency (TTFT)
  • • End-to-end latency
  • • P50/P95/P99 percentiles
  • • Streaming output latency

📊 Throughput Testing

  • • QPS (Queries Per Second)
  • • TPS (Tokens Per Second)
  • • Concurrent users
  • • Resource utilization

TestToolImplement

import asyncio
import time
import statistics
from concurrent.futures import ThreadPoolExecutor
import aiohttp

class LLMPerformanceTester:
    """LLM API performance testing tool"""
    
    def __init__(self, api_key, base_url):
        self.api_key = api_key
        self.base_url = base_url
        self.results = []
    
    async def single_request(self, prompt, session):
        """Single request test"""
        start_time = time.time()
        first_token_time = None
        tokens = []
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        data = {
            "model": "gpt-4",
            "messages": [{"role": "user", "content": prompt}],
            "stream": True
        }
        
        try:
            async with session.post(
                f"{self.base_url}/chat/completions",
                headers=headers,
                json=data
            ) as response:
                async for line in response.content:
                    if first_token_time is None:
                        first_token_time = time.time()
                    
                    # Parse tokens
                    tokens.append(line)
                
                end_time = time.time()
                
                return {
                    "total_time": end_time - start_time,
                    "ttft": first_token_time - start_time,
                    "tokens": len(tokens),
                    "tps": len(tokens) / (end_time - start_time)
                }
        except Exception as e:
            return {"error": str(e)}
    
    async def concurrent_test(self, prompt, num_requests=100):
        """Concurrency test"""
        async with aiohttp.ClientSession() as session:
            tasks = []
            for _ in range(num_requests):
                task = self.single_request(prompt, session)
                tasks.append(task)
            
            results = await asyncio.gather(*tasks)
            
        # Analyze results
        successful = [r for r in results if "error" not in r]
        failed = len(results) - len(successful)
        
        if successful:
            latencies = [r["total_time"] for r in successful]
            ttfts = [r["ttft"] for r in successful]
            tps_values = [r["tps"] for r in successful]
            
            return {
                "total_requests": num_requests,
                "successful": len(successful),
                "failed": failed,
                "avg_latency": statistics.mean(latencies),
                "p50_latency": statistics.median(latencies),
                "p95_latency": self.percentile(latencies, 95),
                "p99_latency": self.percentile(latencies, 99),
                "avg_ttft": statistics.mean(ttfts),
                "avg_tps": statistics.mean(tps_values)
            }
    
    def percentile(self, data, p):
        """Calculate percentiles"""
        sorted_data = sorted(data)
        index = int(len(sorted_data) * p / 100)
        return sorted_data[index]
    
    def load_test(self, prompt, duration=60, rps=10):
        """Load test"""
        start_time = time.time()
        results = []
        
        while time.time() - start_time < duration:
            # Send requests at specified RPS
            asyncio.run(self.single_request(prompt))
            time.sleep(1 / rps)
        
        return self.analyze_results(results)

Performance Benchmark Testing

Performance Metrics for Different Scenarios

Test ScenarioTTFTTPSP95 Latency
Simple Q&A< 500ms> 80< 2s
CodeGenerate< 800ms> 60< 5s
Long Text Generation< 1s> 40< 10s

Performance Optimization Recommendations

Client-side Optimization

  • ✅ Use connection pooling
  • ✅ Implement request retry
  • ✅ Batch process requests
  • ✅ Local cache results

Server-side Optimization

  • ✅ Load balancing
  • ✅ Auto scaling
  • ✅ Edge caching
  • ✅ Traffic control

Optimize Your API Performance

Through systematic performance testing, ensure your AI applications run stably and efficiently under any load.

Start Testing