Back to Learn

AI Inference

How trained AI models make predictions in real-time

What is Inference?

Inference is when a trained AI model is actually used to make predictions or generate outputs. While training teaches the model, inference is when you put it to work.

When you ask ChatGPT a question, upload a photo for facial recognition, or get a Netflix recommendation—that's all inference.

Training vs. Inference

Training: Expensive, done once, takes weeks. Inference: Cheap, done constantly, takes milliseconds to seconds.

How Inference Works

During inference:

  1. Input — Data enters the model (text, image, audio)
  2. Processing — Data flows through the network's layers
  3. Output — Model produces predictions or generated content

The model's weights are now fixed—it's not learning anymore, just applying what it learned during training.

Key Metrics

Latency

How long from input to output. For real-time applications like voice assistants, latency needs to be under 100 milliseconds. For batch processing, it matters less.

Throughput

How many inferences per second the system can handle. Important for high-traffic applications like Google Search or recommendation systems.

Accuracy

How often the model gets the right answer. Sometimes there's a trade-off between speed and accuracy.

Where Inference Happens

Cloud Inference

Most AI runs in data centers. When you use ChatGPT, your request goes to servers running powerful GPUs. Benefits: massive computing power, easy updates. Downside: requires internet, privacy concerns.

Edge Inference

AI runs locally on your device—phone, car, camera. Benefits: faster response, works offline, better privacy. Downside: limited computing power, harder to update.

Hybrid

Simple tasks run locally; complex ones go to the cloud. Siri does initial voice processing on your iPhone, then sends to Apple's servers for complex requests.

Optimizing Inference

Making inference faster and cheaper:

  • Quantization — Using smaller numbers (8-bit instead of 32-bit) for faster math
  • Pruning — Removing unnecessary connections in the network
  • Distillation — Training smaller models to mimic larger ones
  • Batching — Processing multiple requests together efficiently
  • Caching — Storing common responses to avoid recomputing

Inference Hardware

  • GPUs — Graphics cards adapted for AI (NVIDIA dominates)
  • TPUs — Google's custom AI chips
  • NPUs — Neural processing units in phones (Apple Neural Engine)
  • FPGAs — Programmable chips for specialized tasks

Summary

  • • Inference is using a trained model to make predictions
  • • Key metrics: latency, throughput, and accuracy
  • • Can run in the cloud, on edge devices, or hybrid
  • • Optimization techniques make models faster and cheaper to run