Reinforcement Learning

What is Reinforcement Learning?

Reinforcement learning (RL) is a type of machine learning where an AI learns by taking actions and receiving rewards or penalties. It's like training a dog: reward good behavior, correct bad behavior.

Famous Example

AlphaGo learned to beat the world Go champion by playing millions of games against itself, getting rewarded for winning.

How It Works

Agent — The AI that takes actions
Environment — The world the agent operates in
Actions — What the agent can do
State — Current situation
Reward — Feedback signal (good or bad)

The agent learns to maximize total reward over time.

RL vs. Other Machine Learning

Supervised learning — Learn from labeled examples
Unsupervised learning — Find patterns without labels
Reinforcement learning — Learn from actions and consequences

Famous Applications

Game Playing

AlphaGo — Beat world Go champion
AlphaStar — Grandmaster level StarCraft II
OpenAI Five — Beat Dota 2 world champions
Atari games — RL breakthroughs in 2013

Robotics

Robots learn to walk, grasp objects, and navigate by trying and failing.

RLHF (RL from Human Feedback)

Used to train ChatGPT. Humans rate AI responses, and RL trains the model to generate better-rated outputs.

Key Concepts

Exploration vs. exploitation — Try new things or stick with what works?
Delayed rewards — Actions now may pay off later
Policy — The strategy for picking actions
Value function — Estimating future rewards

Challenges

Sample efficiency — Needs many trials to learn
Reward hacking — AI finds loopholes instead of intended behavior
Simulation-to-reality gap — What works in simulation may fail in real world
Safety — Trial-and-error can be dangerous for real robots

Summary

• RL learns from actions and rewards, not examples
• Powers game-playing AI like AlphaGo and robotics
• RLHF was key to making ChatGPT useful
• Challenges: sample efficiency, reward hacking, safety