AI Alignment (2024 Mar)

Can large language models effectively identify cybersecurity risks?

By Emile Delcourt (Published on July 3, 2024)

The ability to discriminate risky scenarios vs low-risk ones is one of the “hidden features” present in most later layers of Mistral7B. I developed and analyzed “linear probes” on those layers’ hidden activations, and found confidence that the model generally knows when “something is up” with a proposed text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96). The top neurons activating in risky scenarios also have security oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack” and questioning “necessity” or likelihood. My findings give me confidence that we could develop LLM based risk assessment systems (minus some signal/noise tradeoffs).

Read the full piece here.