Context Window
The context window is the maximum number of tokens a language model can process in a single request, including both the input prompt and the generated output. It defines how much information the model can consider at once when generating a response.
How It Works
Think of the context window as the model's working memory. Everything the model needs to know for a given request has to fit inside this window: the system prompt, conversation history, retrieved documents, user question, and the space for the response.
Context window sizes have grown rapidly. GPT-3 had a 4K token context window. GPT-4 offered 128K. Claude now supports up to 200K tokens. Google's Gemini goes up to 1M. Larger windows mean you can include more context, but there are tradeoffs in cost and sometimes in quality.
For enterprise applications, context window size affects architecture decisions. With a small context window, you need aggressive chunking and careful retrieval to fit the most relevant information. With a large context window, you can include more documents, but you also spend more on tokens. Sending 100K tokens of context when 10K would have been enough is wasteful.
There is also the question of attention quality. Models tend to pay more attention to information at the beginning and end of the context window, sometimes missing details in the middle. This is known as the "lost in the middle" problem. Smart prompt design places the most important context where the model is most likely to attend to it.
In practice, most enterprise AI applications use context windows between 4K and 32K tokens per request. Larger context windows are useful for document analysis, long conversations, and complex reasoning tasks that require lots of supporting information.
Related Reading
Need help implementing this?
We build production AI systems for enterprises. Tell us what you are working on and we will scope it in 30 minutes.