Context Window

A context window is the maximum number of tokens that a large language model can process in a single interaction, encompassing both the input prompt and the generated output. It represents the model's working memory, determining how much information it can consider at once when generating a response.

Key characteristics of context windows include:

Token-Based Measurement: Context windows are measured in tokens, not words. A token is typically a word fragment, punctuation mark, or common word. English text averages roughly 1.3 tokens per word.
Shared Budget: The context window is shared between input and output. A model with a 200K token window that receives a 150K token prompt can generate at most 50K tokens in response.
Rapid Expansion: Context windows have grown dramatically, from GPT-3's 4K tokens in 2020 to models like Claude and Gemini supporting 200K or more tokens by 2025, enabling processing of entire codebases or books.
Attention Trade-offs: While larger windows allow more context, models may attend less reliably to information in the middle of very long contexts, a phenomenon known as the "lost in the middle" problem.
Cost Implications: Larger context windows increase inference costs, since processing scales with the number of tokens. This makes efficient context management a key engineering concern.

Context Window

Definition