LLM API is a professional AI interface service platform that provides unified API interfaces to call mainstream language models like GPT, Claude, and Llama. Enterprise-grade API service helping developers quickly integrate AI capabilities.

How to get started with LLM API?

After registration, you will receive API keys. Use our SDKs or call RESTful APIs directly to complete LLM API integration in 5 minutes. Supports Python, Node.js, PHP and other languages.

Which AI models does LLM API support?

Our LLM API supports GPT-4o, GPT-4, Claude 3 Opus/Sonnet/Haiku, Llama 3, Mistral and other mainstream language models through unified API interface.

How does LLM API charge?

LLM API uses flexible pay-as-you-go pricing with free credits for trial. Professional plan at $1 per credit supports 500K calls/month. Enterprise plan offers custom solutions for large-scale API needs.

What is the difference between API services?

LLM API (Large Language Model API) is a unified interface service for language models. We provide standardized API interfaces for all mainstream AI models including GPT, Claude, and Llama.

LLM Fine-tuning Guide | LLM API | Best Practices for Fine-tuning

Fine-tuning adapts a pre-trained LLM to specific tasks. By continuing training on domain data, you can significantly improve performance on specialized tasks while preserving general capabilities.

Fine-tuning Methods Comparison

Method	Params Trained	VRAM Requirement	Training Speed	Effect	Use Case
Full-parameter FT	100%	Very high	Slow	Best	Ample resources
LoRA	0.1-1%	Low	Fast	Great	General recommendation
QLoRA	0.1-1%	Very low	Medium	Good	Resource-constrained
Prefix Tuning	0.01%	Very low	Very fast	Moderate	Simple tasks
Adapter	1-5%	Medium	Fast	Good	Multi-task

LoRA Fine-tuning in Practice

Most popular parameter-efficient method

# LoRA fine-tuning example
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                      # LoRA rank
    lora_alpha=32,            # LoRA scaling factor
    target_modules=[          # Target modules
        "q_proj",
        "k_proj", 
        "v_proj",
        "o_proj"
    ],
    lora_dropout=0.1,         # Dropout rate
    bias="none",              # Bias handling
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# Prepare training data
def prepare_dataset(examples):
    # Build instruction format
    texts = []
    for instruction, output in zip(examples['instruction'], examples['output']):
        text = f"""### Instruction:
{instruction}

### Response:
{output}"""
        texts.append(text)
    
    # Tokenize
    encodings = tokenizer(
        texts,
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors='pt'
    )
    
    return encodings

# Training configuration
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./lora_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch"
)

# Start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

trainer.train()

# Save LoRA weights
model.save_pretrained("./lora_weights")

# Load for inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, "./lora_weights")
model = model.merge_and_unload()  # Merge weights

QLoRA: 4-bit Fine-tuning

Fine-tune LLMs on consumer GPUs

# QLoRA fine-tuning — very low VRAM usage
from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",  # 13B model
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# QLoRA configuration
qlora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply QLoRA
model = get_peft_model(model, qlora_config)

# VRAM usage comparison
# Full-precision 13B: ~26GB
# QLoRA 13B: ~6GB (including gradients)

# Optionally use DeepSpeed for further optimization
deepspeed_config = {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 2e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-4,
            "warmup_num_steps": 100
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8
    }
}

Data Preparation Best Practices

High-quality data is critical

Data format example

[
  {
    "instruction": "Translate the following text into English",
    "input": "Machine learning is a branch of artificial intelligence",
    "output": "Machine learning is a branch of artificial intelligence"
  },
  {
    "instruction": "Summarize the key points of the following text",
    "input": "Large language models are pre-trained on massive text corpora...",
    "output": "Key points: 1. Pretraining 2. Data scale 3. Application capabilities"
  },
  {
    "instruction": "Write a Python function implementing quicksort",
    "input": "",
    "output": "def quicksort(arr):
    if len(arr) <= 1:..."
  }
]

Data quality requirements

✅ Accuracy: Ensure labels are correct
✅ Diversity: Cover scenarios and edge cases
✅ Consistency: Maintain uniform formatting and style
✅ Representativeness: Reflect real usage scenarios
✅ Right scale: Typically 1k–10k high-quality samples

Data cleaning flow

def clean_dataset(data):
    cleaned = []
    for item in data:
        # Remove empty items
        if not item['instruction'] or not item['output']:
            continue
        
        # Length filter
        if len(item['output']) < 10 or len(item['output']) > 2000:
            continue
        
        # Deduplicate
        text = item['instruction'] + item['output']
        if text in seen_texts:
            continue
        seen_texts.add(text)
        
        # Normalize format
        item['instruction'] = item['instruction'].strip()
        item['output'] = item['output'].strip()
        
        cleaned.append(item)
    
    return cleaned

Evaluating Fine-tuned Models

How to evaluate fine-tuned model performance

Automated metrics

• Perplexity: Language model uncertainty
• BLEU: Translation quality
• ROUGE: Summarization quality
• Accuracy: Classification accuracy
• F1 Score: Harmonic mean of precision and recall

Human evaluation

• Relevance: Is the answer on-topic?
• Correctness: Is the information accurate?
• Fluency: Is the text natural?
• Completeness: Is the answer comprehensive?
• Consistency: Is style consistent?

A/B Test Example

# Compare base vs fine-tuned models
def compare_models(base_model, finetuned_model, test_cases):
    results = []
    
    for case in test_cases:
        base_output = generate(base_model, case['input'])
        ft_output = generate(finetuned_model, case['input'])
        
        # Auto-scoring
        base_score = calculate_score(base_output, case['expected'])
        ft_score = calculate_score(ft_output, case['expected'])
        
        results.append({
            'input': case['input'],
            'base_output': base_output,
            'ft_output': ft_output,
            'base_score': base_score,
            'ft_score': ft_score,
            'improvement': ft_score - base_score
        })
    
    # Aggregate stats
    avg_improvement = np.mean([r['improvement'] for r in results])
    win_rate = sum(1 for r in results if r['ft_score'] > r['base_score']) / len(results)
    
    print(f"Average improvement: {avg_improvement:.2%}")
    print(f"Win rate: {win_rate:.2%}")

Serving and Deployment

Bring fine-tuned models to production

# Deploy a fine-tuned model with vLLM
from vllm import LLM, SamplingParams

# Load merged model
llm = LLM(
    model="./merged_model",
    tensor_parallel_size=1,
    dtype="half",  # FP16 inference
    max_model_len=2048
)

# API service
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.9

@app.post("/generate")
async def generate(request: GenerationRequest):
    sampling_params = SamplingParams(
        temperature=request.temperature,
        top_p=request.top_p,
        max_tokens=request.max_tokens
    )
    
    outputs = llm.generate([request.prompt], sampling_params)
    
    return {
        "generated_text": outputs[0].outputs[0].text,
        "usage": {
            "prompt_tokens": len(outputs[0].prompt_token_ids),
            "completion_tokens": len(outputs[0].outputs[0].token_ids)
        }
    }

# Run the service
# uvicorn main:app --host 0.0.0.0 --port 8000

# Docker deployment
dockerfile = """
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

RUN pip install vllm fastapi uvicorn

COPY ./merged_model /model
COPY ./main.py /app/main.py

WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

Fine-tuning Cost Estimates

Different Model Sizes Cost

Model size	GPU requirement	Training time	Cloud cost
7B (LoRA)	1 × A100 40GB	2–4 hours	$10–20
13B (QLoRA)	1 × A100 40GB	4–8 hours	$20–40
30B (LoRA)	2 × A100 80GB	8–16 hours	$80–160
70B (QLoRA)	4 × A100 80GB	24–48 hours	$500–1000

* Estimate based on 10k training samples and 3 epochs

Start Fine-tuning Your Own Model

With fine-tuning, you can make LLMs better understand and handle your domain tasks. Combined with LLM API, quickly deploy and use your fine-tuned model.

Start Fine-tuning

LLM Fine-tuning: Make AI Understand Your Business