Instead of the classical SFT and DPO alignment for training our LLMs, there is a new method available. A innovative "reference model-free" monolithic odds ratio reference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.
A New Preference-aligned SFT method.
We explore this idea from a theoretical physics perspective and notice a similarity to the regularizations terms methodologies. We further explore the conceptional similarities from a Lagrange Multiplier to new correction terms in addition to the classical SFT loss functional.
The performance figures of ORPO are given in comparison to a LLama 2 and a Mistral 7B model.
ORPO: Monolithic Preference Optimization without Reference Model
https://arxiv.org/pdf/2403.07691v2.pdf