Glossary

Token (LLM)

A token is the basic unit of text that a large language model processes. It can be a word, part of a word, or a punctuation mark. Language models read, process, and generate text as sequences of tokens, and API pricing is based on the number of tokens used.

Token (LLM)

How It Works

Language models don't read text the way humans do. They break text into tokens, which are typically 3-4 characters long. The word "understanding" might be two tokens: "under" and "standing." Common words like "the" are a single token. Rare or technical terms get split into more tokens.

Tokenization is handled by a tokenizer specific to each model. OpenAI's GPT-4 family uses cl100k_base or o200k_base (tiktoken). Anthropic's Claude uses its own tokenizer, exposed through the count_tokens API. Google's Gemini uses SentencePiece variants. These tokenizers assign different token counts to the same text. A paragraph that's 200 tokens for GPT-4o might be 210 for Claude and 190 for Gemini. Close but not identical, which matters when you're comparing prices across providers.

This matters for two practical reasons. First, pricing. API providers charge per token, with separate rates for input tokens (your prompt) and output tokens (the model's response). Output tokens are typically 3-5x more expensive than input. A long prompt with lots of context costs more than a short one. Understanding your token usage is essential for managing AI costs at scale.

Second, token limits. Every model has a maximum number of tokens it can handle in a single request (its context window). This includes both your input and the model's output. If your prompt is too long, you need to either shorten it, use a model with a larger context window, or redesign your approach.

In RAG systems, token management is a key design decision. You need to fit the user's question, the retrieved documents, the system prompt, and leave room for the model's response, all within the context window. This is why chunking strategy and retrieval count matter: you want to maximize relevant context without hitting the token limit or burning budget.

As a rough rule of thumb, 1 token is about 0.75 words in English. A 1,000-word document is roughly 1,300 tokens. Non-English text, code, and structured data often tokenize less efficiently. Chinese and Japanese text can use 2-3x more tokens per character than English. JSON with deeply nested keys tokenizes worse than flat Markdown. If your workload has a lot of non-English content or code, benchmark actual token usage rather than estimating from the English rule of thumb.

In Practice

Pricing across the major providers as of 2026 sits in a rough range: input tokens around $1-15 per million, output tokens around $3-75 per million, depending on model tier. OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet are both around $3/$15 per million in/out. Frontier tiers (Claude Opus, GPT-4.1) run higher. Smaller models (Haiku, GPT-4o mini, Gemini Flash) run 5-20x cheaper. Prompt caching discounts cached prefix tokens to about 10% of base rate on Anthropic and OpenAI, with TTLs of 5 minutes or 1 hour.

Token counting tools: tiktoken for OpenAI, Anthropic's count_tokens API for Claude, the transformers AutoTokenizer for open models. Every serious production system logs input tokens, output tokens, cached tokens, and derived cost per request to an observability tool like Langfuse or Helicone. Team-level budgets are enforced via per-user or per-tenant quotas, often checked at the API gateway layer before the request reaches the model.

A typical cost-management pattern. Log every LLM call with input and output token counts tagged by user, feature, and request ID. Aggregate nightly to a dashboard showing cost per feature and per cohort of users. Alert when any feature's daily cost exceeds 2x its 7-day rolling average. For high-volume workloads, use prompt caching on stable prefixes (system prompts, retrieved reference docs), route simpler queries to cheaper models via a router (a small classifier that decides Haiku vs Sonnet), and cap output lengths with max_tokens to prevent runaway generation.

Worked Example

A SaaS company builds an AI-powered email reply assistant embedded in their customer support tool. In the first month of production, usage ramps to 14,000 assistant invocations per day. Each invocation uses roughly 3,500 input tokens (system prompt, 3 retrieved policy docs, conversation history) and produces about 400 output tokens.

Total daily tokens: 49M input plus 5.6M output. On Claude 3.5 Sonnet at $3 per million input and $15 per million output, that's $147 + $84 = $231 per day, or about $7,000 per month. The finance team flags the spend. The engineering team looks at traces in Langfuse and finds three optimizations.

One: the system prompt and the retrieved policy docs are identical across the first few turns of a conversation. Adding Anthropic prompt caching on that prefix drops input costs on cached hits to about $0.30 per million tokens. About 70% of invocations are follow-up turns that hit the cache. Effective input cost drops by roughly 55%. Two: they route simple acknowledgement replies ("got it, thanks") through a small router classifier to Claude Haiku at $0.25/$1.25 per million. That's about 30% of traffic at 20x lower cost. Three: max_tokens is set to 500 instead of unlimited, preventing the occasional 4,000-token response that nobody reads.

Post-optimization: monthly cost drops from $7,000 to $2,100 with no quality regression. The team adds a per-tenant token budget in the API gateway as a safety net.

What People Get Wrong

Myth

A token is a word.

Reality

Tokens are subword units. Common words are one token. Uncommon words, long words, and most non-English text break into multiple tokens. "Tokenization" might be 3 tokens. A Japanese sentence might use 2-3x more tokens than an equivalent English sentence. Always count with the model's actual tokenizer for accuracy, not word counts or character estimates.

Myth

Input tokens and output tokens cost the same.

Reality

Output is typically 3-5x more expensive than input across the major providers. This matters for system design. A chatty system with long responses costs far more than a retrieval system with concise answers, even if total token volume is similar. Capping max_tokens and prompting for brevity can meaningfully reduce bills on high-volume workloads.

Myth

Prompt caching is automatic and free.

Reality

It needs to be explicitly opted into via cache_control markers (Anthropic) or a specific API (OpenAI Responses, Gemini Context Caching). It only kicks in when the prefix is stable and reused within the TTL (5 minutes to 1 hour). If your "cached" prefix has varying content (a date, a user ID, a timestamp), caching doesn't activate and you pay full rate. Verify cache hits in your observability tool.

Related Solutions

Generative AI ApplicationsView →
AI Agent DevelopmentView →

Need help implementing this?

We build production AI systems for enterprises. Tell us what you are working on and we will scope it in 30 minutes.