AI Alignment (2024 Mar)

Large Language Models and Tacit Cyber Risk Awareness

By Emile Delcourt (Published on June 26, 2024)

The ability to discriminate risky scenarios vs low-risk ones is one of the “hidden features” present in most later layers of Mistral7B. I developed and analyzed “linear probes” on those layers’ hidden activations, and found confidence that the model generally knows when “something is up” with a proposed text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96). The top neurons activating in risky scenarios also have security oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack” and questioning “necessity” or likelihood. My findings give me confidence that we could develop LLM based risk assessment systems (minus some signal/noise tradeoffs).

Read the full piece here.