Candlestick Labs

Alignment refers to the process of making an AI’s actions and responses aligned with human goals and values. Alignment is a somewhat fuzzy concept, and AI systems are often trained to optimize a measurable metric. This is where approaches such as Reinforcement Learning with Human Feedback (RLHF) come in handy.

A key aspect of RLHF is reinforcement learning (RL) which is a framework where an agent learns what to do so as to maximize a reward signal. RL has a long history, with a revival in the 1980s. Its shining moment was when AlphaGo, a computer program using RL developed by DeepMind defeated Lee Sedol, the Go world champion in 2016. The use of RL has been credited with enabling AlphaGo to further extend its abilities beyond what was previously achievable after being trained on human play data. A more recent example has been the use of RL by DeepSeek R1 published in Jan 2025 that enabled it to develop strong reasoning capabilities. A deep dive into reinforcement learning is beyond the scope of this post, and we refer interested readers to the text by Sutton and Barto.

The manner in which RLHF helps with LLM alignment is as follows: A finetuned LLM is used to generate two or more responses to each prompt, for a set of prompts. A human evaluator is then provided with the prompt and the responses, and they rank the responses by order of preference. The prompt-response pairs are then used to train a reward model (RM) to predict for each prompt-response pair, what score a human would give it. The reward model is essentially meant to mimic the human evaluator, since it is hard to scale up human rating of responses to the large number of prompt-response pairs needed to align the LLM. The LLM can now work directly with the reward model and align its responses using a reinforcement learning approach such as PPO (proximal policy optimization). In this stage, for each prompt, the LLM generates a response. Both the prompt and the response are then scored by the reward model. This score is used as feedback (i.e. reinforcement) to update weights of the LLM, along with a constraint to ensure that the responses of the updated LLM do not deviate significantly from the original. This is repeated many times, until the LLM is ‘aligned’ with human behavior as modeled by the reward model.

There are caveats to this approach. RLHF can result in LLMs that engage in reward hacking - coming up with responses that score high on reward, but are nonsensical. Human preferences can be inconsistent or biased and this can subsequently surface in the LLM model response. It also resource-intensive, requiring human labels, training a reward model, and the RL fine-tuning loop.

A recent alternative to RLHF is Direct Preference Optimization (DPO), presented at Neurips 2023, that removes the need for the multiple steps in RLHF by allowing the LLM to learn human preferences directly. It does so by identifying an analytical mapping from reward function to optimal policies. This allows DPO to directly optimize over policies. There is an excellent talk by the authors of the paper that starts with an explanation of RLHF and the elegant manner in which DPO fuses the multiple steps in RLFH. DPO has been used in Mixtral 8x7B Instruct and the Llama 3 LLM models

There are a few caveats with DPO:

DPO is susceptible to length-exploitation bias like RLHF, where it tends to associate longer responses with higher quality
Since DPO uses relative rankings rather than absolute scores, it can sometimes reduce the likelihood of preferred responses. This may increase the chance of generating out-of-distribution outputs—a phenomenon referred to as Degraded Chosen Response (DCR)
DPO is also likely to overfit the preference data in the case of limited preference pairs

Alternatives to DPO such as Identity Preference Optimization (IPO) from Google DeepMind, and Kahneman-Tversky Optimization (KTO) from contextual.ai attempt to address some of these short-comings. A comparison of these techniques can be found in this post from HuggingFace (HF), who also have a library called TRL (Transformer Reinforcement Learning) that implement these techniques for post-training foundation models. A further add-on to DPO or its alternatives is Constitutional AI from Anthropic, where an AI is prompted with an undesirable request and is then asked to self-critique its responses by asking if its response violates its constitution i.e. a set of desired values such as for e.g. avoiding responses that are illegal or harmful, and revise its response accordingly. The prompt along with the initial and revised responses can be used as preference pairs to align the model using DPO. HF provides a detailed evaluation of this approach.

As fuzzy as alignment may be, the methods to achieve it are evolving rapidly. From RLHF to DPO and beyond, the field is actively exploring how to make LLMs better reflect human preferences and values. This ongoing research is critical—not only for niche applications but also for the broader consumer adoption of AI systems.

Improving AI Alignment: RLHF vs. DPO and alternatives