AI Alignment (2024 Mar)

Some Issues in Predictive Ethics Modeling: An Annotated Contrast Set of “Moral Stories”

By Ben Fitzgerald (Published on June 19, 2024)

This project was a runner-up to the 'Novel research' prize on our AI Alignment (March 2024) course.

Models like the Allen Institute’s Delphi have been able to label ethical dilemmas as moral or immoral with astonishing accuracy. This paper challenges accuracy as a holistic metric for ethics modeling by identifying issues with translating moral dilemmas into text-based input. It demonstrates these issues with contrast sets that substantially reduce the performance of classifiers trained on the dataset Moral Stories. Ultimately, we obtain concrete estimates for how much specific forms of data misrepresentation harm classifier accuracy. Specifically, label-changing tweaks to a situation’s descriptive content (as small as 3-5 words) can reduce classifier accuracy to as low as 51%, almost half the initial accuracy of 99.8%. Associating situations with a misleading social norm lowers accuracy to 98.8%, while adding textual bias (i.e. an implication that a situation already fits a certain label) lowers accuracy to 77%.

These results suggest not only that many ethics models have substantially overfit, but that several precautions are required to ensure that input accurately captures a moral dilemma. This paper recommends re-examining the structure of a social norm, training models to ask for context with defeasible reasoning, and filtering input for textual bias. Doing so not only gives us the first concrete estimates of ethical data misrepresentation’s average cost on accuracy, but gives researchers practical tips for considering these estimates in research.

Read the full piece here.