Inner Misalignment as a Correlation Problem
TLDR; We tackled the problem of inner misalignment in AI systems, which occurs when a model’s learned behavior diverges from the intended behavior due to spurious correlations in the training data. We proposed two methods, Simple Mahalanobis and Class Mahalanobis, to detect inner misalignment by modeling the correlational structure of learned features using covariance matrices and Mahalanobis distances. Our methods were evaluated on adversarial attack detection and out-of-distribution detection tasks using the OpenOOD Benchmark. The Class Mahalanobis method showed strong performance in adversarial attack detection, while the results for out-of-distribution detection were mixed. Despite the varying performance, our methods demonstrated computational efficiency with minimal overhead. This work highlights the importance of addressing inner misalignment and proposes a promising approach based on modeling feature correlations, paving the way for future research in building more reliable and aligned AI systems.
Read the full piece here.