A new security threat to AI models

SRI research demonstrates how bad actors might encode undetectable malware inside complex deep neural networks.

Information security teams have spent decades stressing the same message: Any software downloaded from the internet can potentially contain malicious code. But is this true of AI models?

Many AI developers rely on pre-trained, third-party models available on public repositories, including deep learning models that power generative AI (GenAI) tools. These have generally been considered “safe” — developers might find that a given model doesn’t perform well, but these models haven’t been understood as serious IT security threats.

A recent paper by Briland Hitaj (an advanced computer scientist at SRI) and his co-authors highlights an alarming finding: It is currently quite possible to inject malware into deep neural networks in a way that is virtually undetectable.

Understanding a new wave of AI vulnerabilities

“During my PhD, I worked a lot on devising security and privacy attacks, especially focused on decentralized learning and federated learning,” Hitaj comments. More recently, he has explored how large language models (LLMs) can be used by bad actors to improve their ability to guess passwords.

In this new paper, Hitaj and his co-authors were curious to better understand potential vulnerabilities in publicly available deep learning models.

The deep learning algorithms that power capabilities like GenAI are markedly different from traditional software. These neural networks contain billions or even trillions of “parameters” — think of them as digital neurons that blink on or off depending on the prompt. These parameters are what enable the uncannily human-like performance of the best generative AI systems. They’re also why many GenAI outputs struggle with “explainability.” The answer may be correct or useful, but there are simply too many variables to explain exactly how the model arrived at the answer.

This same complexity, computer science has recognized, creates numerous opportunities for malware to hide.

“Say an adversary makes a model publicly available,” Hitaj explains. “They release it on Hugging Face, on GitHub, or on any other platform — you name it. They make sure that the model operates as intended on its main task. So, if it’s generating some interesting filters to improve your images or helping you with text summarization, it does its original task perfectly. However, the model may contain within its weight parameters additional hidden capabilities. Therefore, we decided to investigate whether it is possible to hide a malicious payload within a deep neural network.”

Building MaleficNet

Previous efforts to hide malware in deep learning models have proven that malware need not compromise model performance. Even so, these past approaches have been easy for antivirus software to detect.

To explore a different way of embedding malware, Hitaj and his co-authors created a framework called MaleficNet. The framework leverages capabilities called Code-Division Multiple-Access (CDMA) and Low-Density Parity-Check (LDPC) error correction, with the aim of making a malware payload undetectable by the most robust malware detection engines.

“It’s important to uncover any potential weaknesses in the ML supply chain.” — Briland Hitaj

The bad news: According to the MaleficNet paper, “Through extensive empirical analysis, we showed that MaleficNet models incur little to no performance penalty, that MaleficNet generalizes across a wide variety of architectures and dataset, and that state-of-the-art malware detection and statistical analysis techniques fail to detect the malware payload.”

In other words, an infected deep learning model can demonstrate uncompromised performance while hiding malware that no current system can detect. Under some circumstances, Hitaj adds, it’s quite possible that a malicious payload could circulate widely before being automatically triggered, leading to widespread impacts.

A problem in search of a solution

“Companies are starting to think in terms of a ‘machine learning supply chain,’” Hitaj observes. The outcomes of AI programs — new efficiencies, new products, and new ways of doing business — are critical to achieving an edge in today’s market. And if the inputs (data, human teams, or the AI models themselves) are compromised, the outcomes will inevitably suffer.

Most conversations around AI risks focus on model performance and training data limitations. AI models can hallucinate, for example, or encode biases. The fact that high-performing models might be perfect for hiding certain types of malware is a concern that the AI community will need to address head-on.

“It’s important to uncover any potential weaknesses in the ML supply chain,” Hitaj concludes. “Understanding the vulnerabilities is the first step toward addressing them, and we hope our findings spur additional research into how to mitigate this particular risk to deep learning models.”

MaleficNet is the result of a collaboration between researchers from SRI, the Swiss Data Science Center, and Sapienza University of Rome. It is documented in two research papers, the first presented at the European Symposium on Research in Computer Security and the second available as an arXiv preprint.

A new security threat to AI models

SRI research demonstrates how bad actors might encode undetectable malware inside complex deep neural networks.

Understanding a new wave of AI vulnerabilities

Building MaleficNet

A problem in search of a solution

Read more from SRI

SRI’s PARC Forum is now a podcast

With DomiNite®, extreme low-light imaging goes digital

New research shows kids learn STEM skills from PBS KIDS