Inference is the process of running a trained AI model on new input to generate a prediction or output. When you send a message to Claude or generate an image with DALL·E, the serving infrastructure is performing inference — running the model weights against your input in real time.
Inference cost and latency are the dominant operational concerns for production AI products. Larger models are more capable but slower and more expensive to serve. Techniques like quantization, batching, KV caching, and speculative decoding optimize inference throughput. Inference providers include Anthropic, OpenAI, Google, and specialized hardware clouds like Together.ai and Groq.