RLHF (Reinforcement Learning from Human Feedback) — definition

RLHF is the training technique that aligns LLMs to human preferences — making them helpful, harmless, and honest. Human raters score model outputs, those scores train a "reward model," and the LLM is then optimized with reinforcement learning to generate outputs that score highly.

RLHF is responsible for the conversational, instruction-following behavior of modern chatbots. Without it, a raw language model generates text that is statistically plausible but not reliably useful or safe. The quality of RLHF data — and the values encoded in rating guidelines — shapes how a model responds to edge cases, sensitive topics, and ambiguous requests.