Context Management Guide

Master context window management techniques, optimize token usage, improve Large Language Model conversation quality

TokenOptimizePerformance BoostCost Control

Understanding Context Window

What is Context Window?

Context window is the maximum number of tokens a Large Language Model can process at once, including the sum of input and output.

Context Window = Input Tokens + Output Tokens
Input
Output

Context Limits for Popular Models

GPT-3.5-Turbo4,096 tokens
GPT-3.5-16k16,384 tokens
GPT-48,192 tokens
GPT-4-32k32,768 tokens
Claude-2100,000 tokens

Token Estimation Reference

1 Token ≈0.75 English words
1 Token ≈0.5 Chinese characters
1000 Chinese characters ≈2000 tokens
1 A4 page document ≈500-800 tokens

Context Management Strategies

1. Sliding Window Strategy

Keep only the most recent N rounds of conversation, automatically discarding old conversation history.

class ConversationManager { constructor(maxRounds = 5) { this.maxRounds = maxRounds; this.messages = []; } addMessage(message) { this.messages.push(message); // Keep system message + recent N rounds of conversation const systemMsg = this.messages.filter(m => m.role === 'system'); const conversation = this.messages.filter(m => m.role !== 'system'); // Each round includes user message and assistant reply const recentRounds = conversation.slice(-this.maxRounds * 2); this.messages = [...systemMsg, ...recentRounds]; } getContext() { return this.messages; } }

2. Summary Compression Strategy

Compress historical conversations into summaries, preserving key information.

async function compressHistory(messages) { // Compress when history messages exceed threshold if (calculateTokens(messages) > 2000) { const historyText = messages .slice(0, -4) // Keep last 2 rounds .map(m => `${m.role}: ${m.content}`) .join('\n'); // Use AI to generate summary const summary = await generateSummary(historyText); // Return compressed context return [ { role: 'system', content: `Conversation history summary: ${summary}` }, ...messages.slice(-4) // Last 2 rounds of detailed conversation ]; } return messages; }

Advantage: Can retain longer conversation history while controlling token usage.

3. Topic Segmentation Strategy

Identify conversation topic changes, keeping only context relevant to the current topic.

class TopicAwareContext { constructor() { this.currentTopic = null; this.topicHistory = new Map(); } async detectTopicChange(newMessage) { // Use embedding vectors to detect topic changes const embedding = await getEmbedding(newMessage); const similarity = this.currentTopic ? cosineSimilarity(embedding, this.currentTopic.embedding) : 0; if (similarity < 0.7) { // Topic changed, create new context this.currentTopic = { id: generateId(), embedding: embedding, messages: [] }; return true; } return false; } addMessage(message) { if (this.currentTopic) { this.currentTopic.messages.push(message); // Limit messages per topic if (this.currentTopic.messages.length > 10) { this.currentTopic.messages = this.currentTopic.messages.slice(-10); } } } }

Long Text Processing Techniques

Text Chunking

Smart Chunking Algorithm

function smartChunking(text, maxTokens = 2000) { const chunks = []; const sentences = text.split(/[. ! ? .!?]+/); let currentChunk = ''; let currentTokens = 0; for (const sentence of sentences) { const sentenceTokens = estimateTokens(sentence); if (currentTokens + sentenceTokens > maxTokens) { // Current chunk is full, save and start new chunk if (currentChunk) { chunks.push({ content: currentChunk, tokens: currentTokens, overlap: chunks.length > 0 ? 100 : 0 // Overlapping tokens }); } // Add context overlap const previousEnd = currentChunk.slice(-200); currentChunk = previousEnd + sentence; currentTokens = estimateTokens(currentChunk); } else { currentChunk += sentence + '. '; currentTokens += sentenceTokens; } } // Add the last chunk if (currentChunk) { chunks.push({ content: currentChunk, tokens: currentTokens }); } return chunks; } // Process multiple chunks in parallel async function processLongText(text) { const chunks = smartChunking(text); // Process all chunks in parallel const results = await Promise.all( chunks.map(chunk => processChunk(chunk.content) ) ); // Merge results return mergeResults(results); }

Overlapping Window Technique

Keep overlapping parts between chunks to ensure context continuity:

Chunk 1
Overlap
Chunk 2
Overlap
Chunk 3

Optimization Tips

Token Counting Optimization

  • • Pre-calculate token counts
  • • Use tiktoken library for accurate calculation
  • • Cache calculation results
  • • Set token budget

Smart Trimming

  • • Remove redundant information
  • • Compress duplicate content
  • • Simplify system prompts
  • • Dynamically adjust detail level

External Storage

  • • Use vector databases
  • • Implement RAG architecture
  • • Load context on demand
  • • Knowledge base retrieval

Dynamic Compression

  • • Sort by importance
  • • Preserve key information
  • • Adaptive compression rate
  • • Progressive summarization

Complete Practical Example

import { encode } from 'tiktoken';

class AdvancedContextManager {
  constructor(options = {}) {
    this.maxTokens = options.maxTokens || 4000;
    this.maxOutputTokens = options.maxOutputTokens || 1000;
    this.compressionThreshold = options.compressionThreshold || 0.7;
    this.messages = [];
    this.summaries = [];
  }
  
  // Accurately count tokens
  countTokens(messages) {
    const encoding = encode(JSON.stringify(messages));
    return encoding.length;
  }
  
  // Add message and automatically manage context
  async addMessage(message) {
    this.messages.push(message);
    
    const currentTokens = this.countTokens(this.messages);
    const availableTokens = this.maxTokens - this.maxOutputTokens;
    
    if (currentTokens > availableTokens * this.compressionThreshold) {
      await this.compressContext();
    }
  }
  
  // Compress context
  async compressContext() {
    // Separate system messages and conversation
    const systemMessages = this.messages.filter(m => m.role === 'system');
    const conversation = this.messages.filter(m => m.role !== 'system');
    
    // Choose compression strategy
    if (conversation.length > 10) {
      // Strategy 1: Generate summary
      const oldMessages = conversation.slice(0, -6);
      const summary = await this.generateSummary(oldMessages);
      
      this.summaries.push({
        timestamp: Date.now(),
        summary: summary,
        messageCount: oldMessages.length
      });
      
      // Reconstruct message list
      this.messages = [
        ...systemMessages,
        { 
          role: 'system', 
          content: `Previous conversation summary: ${summary}` 
        },
        ...conversation.slice(-6)
      ];
    } else {
      // Strategy 2: Simple truncation
      this.messages = [
        ...systemMessages,
        ...conversation.slice(-6)
      ];
    }
  }
  
  // Generate summary
  async generateSummary(messages) {
    const prompt = `Summarize the following conversation in 100 words:
    ${messages.map(m => `${m.role}: ${m.content}`).join('\n')}`;
    
    // Call AI to generate summary
    const response = await callAPI({
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 200
    });
    
    return response.content;
  }
  
  // Get optimized context
  getOptimizedContext() {
    const tokens = this.countTokens(this.messages);
    const budget = this.maxTokens - this.maxOutputTokens;
    
    if (tokens <= budget) {
      return this.messages;
    }
    
    // Need further trimming
    return this.trimToFit(this.messages, budget);
  }
  
  // Trim to specified size
  trimToFit(messages, targetTokens) {
    let trimmed = [...messages];
    
    // Priority: system > recent > old
    while (this.countTokens(trimmed) > targetTokens && trimmed.length > 2) {
      // Remove oldest non-system message
      const nonSystemIndex = trimmed.findIndex(m => m.role !== 'system');
      if (nonSystemIndex > -1 && nonSystemIndex < trimmed.length - 2) {
        trimmed.splice(nonSystemIndex, 1);
      } else {
        break;
      }
    }
    
    return trimmed;
  }
}

// usingExample
const contextManager = new AdvancedContextManager({
  maxTokens: 4000,
  maxOutputTokens: 1000,
  compressionThreshold: 0.7
});

// Handle conversation
async function handleConversation(userInput) {
  await contextManager.addMessage({ role: 'user', content: userInput });
  
  const optimizedContext = contextManager.getOptimizedContext();
  
  const response = await callAPI({
    messages: optimizedContext,
    max_tokens: contextManager.maxOutputTokens
  });
  
  await contextManager.addMessage(response.message);
  
  return response.content;
}

Important Notes

  • • Over-compression may lose important information
  • • Token calculation consumes certain performance
  • • Different models may have different token calculation methods
  • • Preserving necessary context is very important for conversation quality
  • • Consider user experience, avoid sudden context switches