Context Management Guide
Master context window management techniques, optimize token usage, improve Large Language Model conversation quality
TokenOptimizePerformance BoostCost Control
Understanding Context Window
What is Context Window?
Context window is the maximum number of tokens a Large Language Model can process at once, including the sum of input and output.
Context Window = Input Tokens + Output Tokens
Input
Output
Context Limits for Popular Models
| GPT-3.5-Turbo | 4,096 tokens |
| GPT-3.5-16k | 16,384 tokens |
| GPT-4 | 8,192 tokens |
| GPT-4-32k | 32,768 tokens |
| Claude-2 | 100,000 tokens |
Token Estimation Reference
1 Token ≈0.75 English words
1 Token ≈0.5 Chinese characters
1000 Chinese characters ≈2000 tokens
1 A4 page document ≈500-800 tokens
Context Management Strategies
1. Sliding Window Strategy
Keep only the most recent N rounds of conversation, automatically discarding old conversation history.
class ConversationManager {
constructor(maxRounds = 5) {
this.maxRounds = maxRounds;
this.messages = [];
}
addMessage(message) {
this.messages.push(message);
// Keep system message + recent N rounds of conversation
const systemMsg = this.messages.filter(m => m.role === 'system');
const conversation = this.messages.filter(m => m.role !== 'system');
// Each round includes user message and assistant reply
const recentRounds = conversation.slice(-this.maxRounds * 2);
this.messages = [...systemMsg, ...recentRounds];
}
getContext() {
return this.messages;
}
}2. Summary Compression Strategy
Compress historical conversations into summaries, preserving key information.
async function compressHistory(messages) {
// Compress when history messages exceed threshold
if (calculateTokens(messages) > 2000) {
const historyText = messages
.slice(0, -4) // Keep last 2 rounds
.map(m => `${m.role}: ${m.content}`)
.join('\n');
// Use AI to generate summary
const summary = await generateSummary(historyText);
// Return compressed context
return [
{ role: 'system', content: `Conversation history summary: ${summary}` },
...messages.slice(-4) // Last 2 rounds of detailed conversation
];
}
return messages;
}Advantage: Can retain longer conversation history while controlling token usage.
3. Topic Segmentation Strategy
Identify conversation topic changes, keeping only context relevant to the current topic.
class TopicAwareContext {
constructor() {
this.currentTopic = null;
this.topicHistory = new Map();
}
async detectTopicChange(newMessage) {
// Use embedding vectors to detect topic changes
const embedding = await getEmbedding(newMessage);
const similarity = this.currentTopic
? cosineSimilarity(embedding, this.currentTopic.embedding)
: 0;
if (similarity < 0.7) {
// Topic changed, create new context
this.currentTopic = {
id: generateId(),
embedding: embedding,
messages: []
};
return true;
}
return false;
}
addMessage(message) {
if (this.currentTopic) {
this.currentTopic.messages.push(message);
// Limit messages per topic
if (this.currentTopic.messages.length > 10) {
this.currentTopic.messages =
this.currentTopic.messages.slice(-10);
}
}
}
}Long Text Processing Techniques
Text Chunking
Smart Chunking Algorithm
function smartChunking(text, maxTokens = 2000) {
const chunks = [];
const sentences = text.split(/[. ! ? .!?]+/);
let currentChunk = '';
let currentTokens = 0;
for (const sentence of sentences) {
const sentenceTokens = estimateTokens(sentence);
if (currentTokens + sentenceTokens > maxTokens) {
// Current chunk is full, save and start new chunk
if (currentChunk) {
chunks.push({
content: currentChunk,
tokens: currentTokens,
overlap: chunks.length > 0 ? 100 : 0 // Overlapping tokens
});
}
// Add context overlap
const previousEnd = currentChunk.slice(-200);
currentChunk = previousEnd + sentence;
currentTokens = estimateTokens(currentChunk);
} else {
currentChunk += sentence + '. ';
currentTokens += sentenceTokens;
}
}
// Add the last chunk
if (currentChunk) {
chunks.push({
content: currentChunk,
tokens: currentTokens
});
}
return chunks;
}
// Process multiple chunks in parallel
async function processLongText(text) {
const chunks = smartChunking(text);
// Process all chunks in parallel
const results = await Promise.all(
chunks.map(chunk =>
processChunk(chunk.content)
)
);
// Merge results
return mergeResults(results);
}Overlapping Window Technique
Keep overlapping parts between chunks to ensure context continuity:
Chunk 1
Overlap
Chunk 2
Overlap
Chunk 3
Optimization Tips
Token Counting Optimization
- • Pre-calculate token counts
- • Use tiktoken library for accurate calculation
- • Cache calculation results
- • Set token budget
Smart Trimming
- • Remove redundant information
- • Compress duplicate content
- • Simplify system prompts
- • Dynamically adjust detail level
External Storage
- • Use vector databases
- • Implement RAG architecture
- • Load context on demand
- • Knowledge base retrieval
Dynamic Compression
- • Sort by importance
- • Preserve key information
- • Adaptive compression rate
- • Progressive summarization
Complete Practical Example
import { encode } from 'tiktoken';
class AdvancedContextManager {
constructor(options = {}) {
this.maxTokens = options.maxTokens || 4000;
this.maxOutputTokens = options.maxOutputTokens || 1000;
this.compressionThreshold = options.compressionThreshold || 0.7;
this.messages = [];
this.summaries = [];
}
// Accurately count tokens
countTokens(messages) {
const encoding = encode(JSON.stringify(messages));
return encoding.length;
}
// Add message and automatically manage context
async addMessage(message) {
this.messages.push(message);
const currentTokens = this.countTokens(this.messages);
const availableTokens = this.maxTokens - this.maxOutputTokens;
if (currentTokens > availableTokens * this.compressionThreshold) {
await this.compressContext();
}
}
// Compress context
async compressContext() {
// Separate system messages and conversation
const systemMessages = this.messages.filter(m => m.role === 'system');
const conversation = this.messages.filter(m => m.role !== 'system');
// Choose compression strategy
if (conversation.length > 10) {
// Strategy 1: Generate summary
const oldMessages = conversation.slice(0, -6);
const summary = await this.generateSummary(oldMessages);
this.summaries.push({
timestamp: Date.now(),
summary: summary,
messageCount: oldMessages.length
});
// Reconstruct message list
this.messages = [
...systemMessages,
{
role: 'system',
content: `Previous conversation summary: ${summary}`
},
...conversation.slice(-6)
];
} else {
// Strategy 2: Simple truncation
this.messages = [
...systemMessages,
...conversation.slice(-6)
];
}
}
// Generate summary
async generateSummary(messages) {
const prompt = `Summarize the following conversation in 100 words:
${messages.map(m => `${m.role}: ${m.content}`).join('\n')}`;
// Call AI to generate summary
const response = await callAPI({
messages: [{ role: 'user', content: prompt }],
max_tokens: 200
});
return response.content;
}
// Get optimized context
getOptimizedContext() {
const tokens = this.countTokens(this.messages);
const budget = this.maxTokens - this.maxOutputTokens;
if (tokens <= budget) {
return this.messages;
}
// Need further trimming
return this.trimToFit(this.messages, budget);
}
// Trim to specified size
trimToFit(messages, targetTokens) {
let trimmed = [...messages];
// Priority: system > recent > old
while (this.countTokens(trimmed) > targetTokens && trimmed.length > 2) {
// Remove oldest non-system message
const nonSystemIndex = trimmed.findIndex(m => m.role !== 'system');
if (nonSystemIndex > -1 && nonSystemIndex < trimmed.length - 2) {
trimmed.splice(nonSystemIndex, 1);
} else {
break;
}
}
return trimmed;
}
}
// usingExample
const contextManager = new AdvancedContextManager({
maxTokens: 4000,
maxOutputTokens: 1000,
compressionThreshold: 0.7
});
// Handle conversation
async function handleConversation(userInput) {
await contextManager.addMessage({ role: 'user', content: userInput });
const optimizedContext = contextManager.getOptimizedContext();
const response = await callAPI({
messages: optimizedContext,
max_tokens: contextManager.maxOutputTokens
});
await contextManager.addMessage(response.message);
return response.content;
}Important Notes
- • Over-compression may lose important information
- • Token calculation consumes certain performance
- • Different models may have different token calculation methods
- • Preserving necessary context is very important for conversation quality
- • Consider user experience, avoid sudden context switches