Model Compression: Bringing LLMs Everywhere

With advanced compression techniques, compress tens-of-GB LLMs down to MB scale while maintaining performance, enabling AI to run on phones, edge devices, and even embedded systems.

Core Compression Techniques

🔢 Quantization

  • • INT8/INT4 quantization
  • • Mixed-precision quantization
  • • Dynamic quantization
  • • Quantization-aware training

✂️ Model Pruning

  • • Structured pruning
  • • Unstructured pruning
  • • Dynamic sparsity
  • • Channel pruning

📚 Knowledge Distillation

  • • Teacher–student models
  • • Self-distillation
  • • Feature distillation
  • • Relation distillation

🔄 Low-rank Factorization

  • • Matrix factorization
  • • LoRA adapters
  • • SVD
  • • Tucker decomposition

Quantization in Detail

Performance Comparison: FP32 → INT4

import torch
import torch.nn as nn
from transformers import AutoModel

class QuantizationEngine:
    """Model quantization engine"""
    
    def quantize_model(self, model, quantization_config):
        """Perform model quantization"""
        
        # 1. Collect calibration data
        calibration_data = self.collect_calibration_data(model)
        
        # 2. Compute quantization parameters
        quant_params = self.calculate_quantization_params(
            model, 
            calibration_data,
            method=quantization_config['method']  # 'symmetric' or 'asymmetric'
        )
        
        # 3. Apply quantization
        if quantization_config['bits'] == 8:
            quantized_model = self.int8_quantization(model, quant_params)
        elif quantization_config['bits'] == 4:
            quantized_model = self.int4_quantization(model, quant_params)
        
        # 4. Quantization-aware fine-tuning (optional)
        if quantization_config.get('qat', False):
            quantized_model = self.quantization_aware_training(
                quantized_model,
                training_data=quantization_config['training_data']
            )
        
        return quantized_model
    
    def benchmark_quantization(self, original_model, quantized_model):
        """Performance benchmark test"""
        results = {
            'model_size': {
                'original': self.get_model_size(original_model),  # 13GB
                'quantized': self.get_model_size(quantized_model), # 3.2GB
                'compression_ratio': '4.1x'
            },
            'inference_speed': {
                'original': self.measure_latency(original_model),   # 245ms
                'quantized': self.measure_latency(quantized_model), # 62ms
                'speedup': '3.9x'
            },
            'accuracy': {
                'original': self.evaluate_accuracy(original_model),   # 92.5%
                'quantized': self.evaluate_accuracy(quantized_model), # 91.8%
                'degradation': '-0.7%'
            }
        }
        return results

Quantization Results Comparison

PrecisionModel SizeInference LatencyAccuracyMemory Usage
FP32 (original)13GB245ms92.5%16GB
FP166.5GB125ms92.3%8GB
INT83.2GB62ms91.8%4GB
INT41.6GB35ms89.2%2GB

Knowledge Distillation

Teacher–Student Architecture

Distillation Process

🎓 Teacher Model
  • • Parameters: 175B
  • • Layers: 96
  • • Hidden size: 12288
  • • Accuracy: 95.2%
👶 Student Model
  • • Parameters: 1.3B
  • • Layers: 24
  • • Hidden size: 2048
  • • Target accuracy: > 90%

Distillation Strategies

1Soft-label distillation: transfer teacher probability distribution
2Feature distillation: match intermediate representations
3Attention distillation: learn attention patterns

Model Pruning Techniques

Structured Pruning in Practice

Pruning Results Analysis

Attention head pruningRemove 30% redundant heads
FFN layer pruningCompress 40% neurons
Layer pruningRemove 6 redundant layers

-65%

Parameters reduced

2.8x

Speedup

-2.1%

Accuracy drop

Edge Deployment Optimization

Mobile Deployment Solutions

Deployment Architecture Comparison

☁️ Cloud
  • • Latency: 200–500ms
  • • Cost: $0.01/request
  • • Privacy: Data uploaded
  • • Reliability: Network-dependent
🖥️ Edge server
  • • Latency: 20–50ms
  • • Cost: $0.001/request
  • • Privacy: Local processing
  • • Reliability: High availability
📱 On-device
  • • Latency: < 10ms
  • • Cost: $0
  • • Privacy: Fully local
  • • Reliability: Works offline

Real-world Case Studies

Mobile Keyboard AI

Compression Strategy

INT8 quantization + knowledge distillation + structured pruning

Results

  • • Model size: 2GB → 45MB
  • • Response time: < 5ms
  • • Accuracy retained: 95%

In-vehicle Voice Assistant

Compression Strategy

Mixed-precision quantization + LoRA fine-tuning

Results

  • • Deployment chip: Qualcomm 8155
  • • Power consumption reduced: 70%
  • • Offline recognition rate: 92%

Choosing Compression Techniques

Best Practices by Scenario

📱 Mobile applications

Recommend: INT8 + distillation; keep model size under 100MB

🚗 Embedded devices

Recommend: INT4 + aggressive pruning; fit tight compute budgets

🖥️ Edge servers

Recommend: FP16 + structured pruning to balance performance and accuracy

Make LLMs Within Reach

Master model compression and deploy powerful AI to any device.

Learn More