Back to Learn

Reinforcement Learning

How AI learns through trial, error, and rewards

What is Reinforcement Learning?

Reinforcement learning (RL) is a type of machine learning where an AI learns by taking actions and receiving rewards or penalties. It's like training a dog: reward good behavior, correct bad behavior.

Famous Example

AlphaGo learned to beat the world Go champion by playing millions of games against itself, getting rewarded for winning.

How It Works

  1. Agent — The AI that takes actions
  2. Environment — The world the agent operates in
  3. Actions — What the agent can do
  4. State — Current situation
  5. Reward — Feedback signal (good or bad)

The agent learns to maximize total reward over time.

RL vs. Other Machine Learning

  • Supervised learning — Learn from labeled examples
  • Unsupervised learning — Find patterns without labels
  • Reinforcement learning — Learn from actions and consequences

Famous Applications

Game Playing

  • AlphaGo — Beat world Go champion
  • AlphaStar — Grandmaster level StarCraft II
  • OpenAI Five — Beat Dota 2 world champions
  • Atari games — RL breakthroughs in 2013

Robotics

Robots learn to walk, grasp objects, and navigate by trying and failing.

RLHF (RL from Human Feedback)

Used to train ChatGPT. Humans rate AI responses, and RL trains the model to generate better-rated outputs.

Key Concepts

  • Exploration vs. exploitation — Try new things or stick with what works?
  • Delayed rewards — Actions now may pay off later
  • Policy — The strategy for picking actions
  • Value function — Estimating future rewards

Challenges

  • Sample efficiency — Needs many trials to learn
  • Reward hacking — AI finds loopholes instead of intended behavior
  • Simulation-to-reality gap — What works in simulation may fail in real world
  • Safety — Trial-and-error can be dangerous for real robots

Summary

  • • RL learns from actions and rewards, not examples
  • • Powers game-playing AI like AlphaGo and robotics
  • • RLHF was key to making ChatGPT useful
  • • Challenges: sample efficiency, reward hacking, safety