Fine-Tuning
Fine-tuning is the process of further training a pre-trained AI model on your own dataset to adapt its behavior, knowledge, or style for a specific domain or task. It changes the model's internal weights, making it permanently better at your particular use case.

How It Works
A base language model like GPT-4 or Claude knows a lot about the world but nothing about your specific business. Fine-tuning teaches the model your domain by training it on examples of the inputs and outputs you care about.
The process works like this: you prepare a dataset of example inputs and desired outputs, then run a training process that adjusts the model's weights. After fine-tuning, the model responds in ways that align with your examples. It might learn your company's terminology, follow a specific format, or handle domain-specific tasks more accurately.
Modern fine-tuning usually means parameter-efficient fine-tuning (PEFT) rather than updating all of a model's weights. LoRA (Low-Rank Adaptation) and QLoRA train a small number of additional parameters (typically 0.1-1% of the base model) while keeping the base weights frozen. This cuts training cost and memory by 10-100x compared to full fine-tuning, with competitive quality for most tasks. For reasoning-style tasks, RLHF and DPO (direct preference optimization) add another layer where the model is trained to prefer better outputs over worse ones using human or AI preference pairs.
Fine-tuning is most useful when you need the model to adopt a specific style, consistently follow a complex format, or develop deep expertise in a narrow domain. Medical report generation, legal document drafting, code generation for a specific framework, and structured extraction tasks are common fits.
The main alternative is RAG. RAG gives the model relevant context at runtime without changing the model itself. For most enterprise use cases, RAG is the better starting point. It's cheaper, faster to set up, and easier to update when your data changes.
Fine-tuning makes sense when RAG isn't enough. If the model needs to behave differently (not just know different things), fine-tuning is the right tool. Many production systems use both: a fine-tuned model for style and format, with RAG for current knowledge. When not to fine-tune: when your training set is under 500 examples (prompt engineering will do better), when the task changes frequently (you'll re-train constantly), or when the base model already performs well with a good prompt.
In Practice
Fine-tuning infrastructure falls into two camps. Hosted services include OpenAI's fine-tuning API (supervised fine-tuning and DPO on GPT-4o mini and GPT-4.1 variants), Anthropic's fine-tuning on Claude Haiku via AWS Bedrock, and Google's Vertex AI tuning. Self-hosted options use Hugging Face's PEFT library plus accelerate or DeepSpeed, with LoRA adapters trained on a single A100 or H100 for 7-13B models and on multi-GPU nodes for 70B models.
Typical configuration: 1,000-10,000 training examples for supervised fine-tuning, 3 epochs, learning rate around 1e-4 for LoRA, rank (r) of 8-32, and alpha of 16-64. Dataset format is JSONL with instruction-response pairs or chat-style messages. Training time for a LoRA on a 7B model: roughly 2-6 hours on an H100 for 5,000 examples. Hosted fine-tuning pricing is typically a few dollars per million training tokens, plus a small premium on inference for the fine-tuned model.
A working workflow. Curate a dataset of 2,000-5,000 high-quality examples labeled or verified by domain experts. Hold out 10% for evaluation. Train a LoRA adapter on a base model, track loss and eval metrics in Weights & Biases, and compare against a prompt-engineered baseline on the same eval set. Deploy only if the fine-tuned model beats the baseline by a meaningful margin (usually 5+ points on your domain metric). Keep the training pipeline reproducible so you can retrain on refreshed data every quarter.
Worked Example
A medical coding vendor wants an LLM that reads physician notes and outputs ICD-10 and CPT codes. The base Claude Haiku with a good prompt hits 78% code accuracy on a labeled eval set of 500 notes, which isn't enough for billing-grade output. The team has 12,000 expert-labeled note-to-code pairs from the past two years of production.
They fine-tune Claude Haiku via AWS Bedrock using 10,000 pairs for training and 2,000 held out for eval. The training job runs for about 4 hours and costs around $180 in compute. The fine-tuned model hits 94% accuracy on the held-out set, with the biggest gains on specialty-specific codes (cardiology, orthopedics) where the base model had been weakest. Per-call inference cost on the fine-tuned model is about 8% higher than base Haiku on Bedrock.
In production, the fine-tuned model handles first-pass coding with a confidence threshold: codes above 0.9 confidence go straight to billing, codes between 0.7 and 0.9 get a human coder review, and codes below 0.7 get a full human workflow. Human coders now review about 30% of notes instead of 100%, so each coder handles 3x more volume. Payback on the fine-tuning project was roughly 6 weeks. The team re-trains the model every quarter as new coding guidelines come out.
What People Get Wrong
Myth
Fine-tuning is how you teach a model new facts.
Reality
Fine-tuning changes behavior more reliably than it adds knowledge. Teaching a model specific facts through fine-tuning requires many examples per fact and still produces confident errors on related questions. RAG is the right tool for new facts: retrieve them at query time, let the model read them, done. Fine-tune for style, format, and reasoning patterns.
Myth
You need thousands of examples to fine-tune usefully.
Reality
For LoRA on narrow tasks, 100-500 high-quality examples can produce real gains. Data quality matters more than volume. 500 carefully-curated examples typically outperform 5,000 noisy ones. Start small, measure, and scale the dataset only if you need more. Many teams over-invest in data collection when better prompts and a smaller, cleaner dataset would have worked.
Myth
Fine-tuned models always beat prompted base models.
Reality
They often don't, especially on general reasoning tasks. Frontier base models with good prompts (few-shot examples, clear output format, chain-of-thought) match or beat fine-tuned smaller models on many benchmarks. Always run the prompt-engineered baseline first and beat it with fine-tuning, not instead of it. Deploying a fine-tuned model that loses to a base model with a better prompt is a common, painful mistake.
Related Reading
Need help implementing this?
We build production AI systems for enterprises. Tell us what you are working on and we will scope it in 30 minutes.