LLM API Performance Testing Complete Guide
Systematic performance testing can help you understand API performance boundaries, optimize system architecture, and ensure stable operation under high load.
Performance Testing Dimensions
⏱️ Latency Testing
- • First Token Latency (TTFT)
- • End-to-end latency
- • P50/P95/P99 percentiles
- • Streaming output latency
📊 Throughput Testing
- • QPS (Queries Per Second)
- • TPS (Tokens Per Second)
- • Concurrent users
- • Resource utilization
TestToolImplement
import asyncio
import time
import statistics
from concurrent.futures import ThreadPoolExecutor
import aiohttp
class LLMPerformanceTester:
"""LLM API performance testing tool"""
def __init__(self, api_key, base_url):
self.api_key = api_key
self.base_url = base_url
self.results = []
async def single_request(self, prompt, session):
"""Single request test"""
start_time = time.time()
first_token_time = None
tokens = []
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
data = {
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
try:
async with session.post(
f"{self.base_url}/chat/completions",
headers=headers,
json=data
) as response:
async for line in response.content:
if first_token_time is None:
first_token_time = time.time()
# Parse tokens
tokens.append(line)
end_time = time.time()
return {
"total_time": end_time - start_time,
"ttft": first_token_time - start_time,
"tokens": len(tokens),
"tps": len(tokens) / (end_time - start_time)
}
except Exception as e:
return {"error": str(e)}
async def concurrent_test(self, prompt, num_requests=100):
"""Concurrency test"""
async with aiohttp.ClientSession() as session:
tasks = []
for _ in range(num_requests):
task = self.single_request(prompt, session)
tasks.append(task)
results = await asyncio.gather(*tasks)
# Analyze results
successful = [r for r in results if "error" not in r]
failed = len(results) - len(successful)
if successful:
latencies = [r["total_time"] for r in successful]
ttfts = [r["ttft"] for r in successful]
tps_values = [r["tps"] for r in successful]
return {
"total_requests": num_requests,
"successful": len(successful),
"failed": failed,
"avg_latency": statistics.mean(latencies),
"p50_latency": statistics.median(latencies),
"p95_latency": self.percentile(latencies, 95),
"p99_latency": self.percentile(latencies, 99),
"avg_ttft": statistics.mean(ttfts),
"avg_tps": statistics.mean(tps_values)
}
def percentile(self, data, p):
"""Calculate percentiles"""
sorted_data = sorted(data)
index = int(len(sorted_data) * p / 100)
return sorted_data[index]
def load_test(self, prompt, duration=60, rps=10):
"""Load test"""
start_time = time.time()
results = []
while time.time() - start_time < duration:
# Send requests at specified RPS
asyncio.run(self.single_request(prompt))
time.sleep(1 / rps)
return self.analyze_results(results)Performance Benchmark Testing
Performance Metrics for Different Scenarios
| Test Scenario | TTFT | TPS | P95 Latency |
|---|---|---|---|
| Simple Q&A | < 500ms | > 80 | < 2s |
| CodeGenerate | < 800ms | > 60 | < 5s |
| Long Text Generation | < 1s | > 40 | < 10s |
Performance Optimization Recommendations
Client-side Optimization
- ✅ Use connection pooling
- ✅ Implement request retry
- ✅ Batch process requests
- ✅ Local cache results
Server-side Optimization
- ✅ Load balancing
- ✅ Auto scaling
- ✅ Edge caching
- ✅ Traffic control
Optimize Your API Performance
Through systematic performance testing, ensure your AI applications run stably and efficiently under any load.
Start Testing