Can AI Call its Own Bluffs?
Hallucinations are one of the key issues preventing wider adoption of LLMs in products [3] [4] [10].
Current LLMs undergo alignment phase where they are fine-tuned to score highly in “helpfulness”, “truthfulness” and “harmlessness”. [13] [1]
However, the details of the alignment process make it less likely to ensure the “truthfulness” of generations, the “convincingness” seems to be optimised instead.
Typically, during the latest stage of alignment process the LLM is being fine-tuned via a RL algorithm such as PPO [7] to maximise some reward.
The latency of human feedback doesn’t allow to use actual human feedback directly, so instead a preference model is trained on human feedback and used as a proxy.
Such preference model is unlikely to be better at factual knowledge than the base model, since is typically initialised from the weights of the base pre-trained model and is fine-tuned on pairs of generated answers to select the one preferred by a human, and might instead focus on the tone of the response and other cues.
Even if actual humans are used instead of the reward model, it is quite possible that they wouldn’t be able to reliably evaluate the truthfulness of various statements (unless each answer will involve an extensive fact-checking procedure), leading us to believe that such approach to alignment is insufficient to ensure the truthfulness of the model.
Read the full piece here.