ORPO: NEW DPO Alignment and SFT Method for LLM

Опубликовано: 17 Октябрь 2024
на канале: Discover AI

4,104

113

Instead of the classical SFT and DPO alignment for training our LLMs, there is a new method available. A innovative "reference model-free" monolithic odds ratio reference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.

A New Preference-aligned SFT method.

We explore this idea from a theoretical physics perspective and notice a similarity to the regularizations terms methodologies. We further explore the conceptional similarities from a Lagrange Multiplier to new correction terms in addition to the classical SFT loss functional.

The performance figures of ORPO are given in comparison to a LLama 2 and a Mistral 7B model.

ORPO: Monolithic Preference Optimization without Reference Model
https://arxiv.org/pdf/2403.07691v2.pdf