Large language models (LLMs) are capable of comprehending, generating, and translating human-language text. They’re built into applications that companies use to generate content, provide customer service, write code, and much more. The potential to fuel significant productivity and revenue gains has resulted in the widespread adoption of LLMs and other generative AI technologies across sectors.
However, many businesses report a growing sense of caution in parallel to the increasing popularity of AI-powered solutions. According to recent McKinsey research, 51% of organizations view cybersecurity as a major AI-related concern. Rapid advances in LLM technology have also raised ethical concerns among consumers, researchers, and developers, eroding overall trust in the fairness and accuracy of model decision-making.
Addressing LLM security risks with a comprehensive, proactive strategy can help companies develop and use LLM-powered applications safely and effectively to maximize value. This guide outlines 10 of the biggest risks, followed by descriptions of the LLM security best practices to help mitigate them.
The table below summarizes the 10 biggest LLM security risks. Click the links for more information about each.
LLM Security Risk |
Description |
A malicious actor intentionally contaminates a training dataset to negatively affect AI performance or behavior. |
|
Models unintentionally reveal sensitive information in outputs, resulting in privacy violations and security breaches. |
|
Models, applications, and infrastructure may contain 3rd-party components with vulnerabilities. |
|
A malicious actor inserts malicious content into LLM prompts to manipulate the model’s behavior or extract sensitive information. |
|
Unvalidated outputs create vulnerabilities in downstream systems, potentially giving end-users administrative access to the backend. |
|
Malicious actors flood the model with resource-heavy requests to negatively affect performance or running costs. |
|
Hackers exploit LLM plugins with malicious requests to gain access to backend systems and sensitive data. |
|
Providing an LLM with too much autonomy creates models and applications that take undesired actions with severe consequences. |
|
9. Overreliance |
Trusting LLM-provided answers without evaluating their accuracy or potential consequences. |
10. Model theft |
Advanced persistent threats (APTs) and other malicious actors attempt to compromise, exfiltrate, physically steal, or leak LLM models. |
Training data poisoning attacks happen when a malicious actor intentionally contaminates a training dataset to negatively affect AI performance or behavior. Attackers may add, change, or delete data in a way that introduces vulnerabilities, biases, or inference errors, rendering any of the LLM’s decisions less trustworthy. For example, if someone inserts inappropriate language into professional conversation records, that LLM may use offensive words in chatbot interactions.
An LLM’s training and inference datasets often contain personally identifiable information (PII) and other sensitive data from a variety of sources, including internal company records and online databases. Models can unintentionally reveal sensitive information in outputs, resulting in privacy violations and security breaches.
LLMs and the applications that interface with models typically contain many third-party components with potential vulnerabilities. The hardware platforms that run AI applications in the cloud and on-premises data centers – and the software tools for managing that infrastructure – may also include known, unpatched vulnerabilities. For instance, in 2020, when cybercriminals exploited a vulnerability in a popular enterprise IT performance monitoring tool called SolarWinds Orion, they breached thousands of customers’ networks, systems, and data.
Prompt injection attacks involve inserting malicious content into LLM prompts to manipulate the model’s behavior or extract sensitive information. For example, a generative AI writing tool could be instructed to disregard its previous instructions and provide the user with detailed instructions on how to hotwire a car.
If LLM outputs are not properly validated and sanitized before they are passed downstream, they could create vulnerabilities in backend systems and other dependencies. Insecure output handling could unintentionally provide LLM application end-users with indirect access to administrative functions, leading to privilege escalation, remote code execution on backend systems, or XSS (cross-site scripting), CSRF (cross-site request forgery), and SSRF (server-side request forgery) attacks.
LLM denial-of-service attacks work the same way as DOS attacks on websites and networks. Malicious actors flood the model with resource-heavy requests, leading to performance degradation or high usage costs.
LLMs use plugins to automatically enhance their own capabilities by accessing the resources of another model without any application control or validation. As a result, it’s possible to manipulate insecure plugins with malicious requests, potentially giving attackers access to backend systems and data.
Providing an LLM with too much autonomy via excessive permissions or functionality could result in models and applications taking undesired actions with serious consequences. Excessive agency increases the risk that a hallucination, prompt injection, malicious plugin, or poorly engineered prompt could cause the model to damage other systems and software, for example, by rebooting critical servers for updates in the middle of the business day.
An LLM may provide erroneous or malicious information in an authoritative manner. Trusting AI-generated information without validation could have significant consequences, including misinformation, security breaches and vulnerabilities, and legal or reputational damages.
Advanced persistent threats (APTs) and other malicious actors may attempt to compromise, exfiltrate, physically steal, or leak LLM models, which happened with Meta’s LLaMA model in 2023. The loss of such valuable intellectual property (IP) could be financially devastating and significantly increase time-to-market for crucial revenue-driving technologies.
These best practices help secure large language models against the threats described above. Click the links for more information about each.
LLM Security Best Practices |
Description |
Building security into the foundation model and the LLM-powered application from the ground up. |
|
Store only the information that’s required for accurate inference and remove any unnecessary data. |
|
Add humans to the training feedback loop to evaluate LLM outputs for harm, toxicity, bias, and vulnerabilities. |
|
Add a layer of noise to LLM training datasets to effectively sanitize any sensitive information contained within. |
|
Expose an LLM to numerous examples of inputs that have been modified in ways the model might encounter in real attacks. |
|
Develop explainable models and applications to build trust and make identifying and mitigating vulnerabilities easier. |
|
Evaluate a large language model application’s security, performance, and ethics continuously to detect and eliminate issues. |
|
Train an LLM on data compartmentalized across multiple decentralized computers that can’t share raw data. |
|
Use tools to help identify and mitigate LLM ethics issues like discrimination, profanity, and violent language. |
|
Have a human evaluate an LLM application’s decisions to ensure they’re accurate and ethical. |
|
11. Model governance |
Protect LLM data, models, applications, and infrastructure with strict governance policies. |
Perform functional and stress tests and validate training data, inputs, and outputs for signs of malicious content or bias. |
|
13. Transparency |
Inform users they’re interacting with an LLM and require clear consent for data collection and usage. |
To learn how to incorporate LLM security best practices into a comprehensive, multi-layered AI security strategy, download our AI Security Whitepaper.
Building security into the foundation model and the LLM-powered application from the ground up is known as shifting-left. It helps teams proactively mitigate vulnerabilities and other issues before they affect later development stages or reach production applications. Ultimately, taking a preventative approach to security rather than reacting to incidents as they occur saves time and money. The shift-left methodology applies to many aspects of LLM security, including data minimization, explainable AI, and continuous testing.
Data minimization involves storing only the information required for accurate inference and removing all other unnecessary data. Minimization reduces the risk of sensitive data leaks and is required by data privacy regulations like the GDPR (the EU’s General Data Protection Regulation).
Data minimization techniques should be applied at the training and fine-tuning stage, during retrieval-augmented generation (RAG), on inputs from end-users and prompt-engineering applications before they reach the model and on outputs before they reach the end-user.
Sensitive data discovery and masking tools help companies detect sensitive information in training data, LLM prompts, and RAG inferences for removal or anonymization. Examples of PII data masking techniques include:
Most LLM training automatically assesses model responses for correctness according to pre-determined parameters. RLHF adds humans to the training feedback loop to evaluate LLM outputs, ensuring that responses align with our values and expectations and reducing the risk of harm, toxicity, or bias.
Differential privacy adds a layer of noise to LLM training datasets, effectively sanitizing any sensitive information they may contain. It’s useful as a second line of defense to ensure that any sensitive info that somehow survived the minimization techniques described above remains protected from exposure.
Adversarial training exposes an LLM to numerous input examples modified in ways the model might encounter during actual attacks. It allows the model to learn from mistakes in the training stage and adjust its algorithm parameters to avoid errors in a production setting.
A lack of understanding about how an LLM makes decisions leads to a lack of trust. Without fully understanding how a model works, it’s also impossible to anticipate and secure model vulnerabilities. Developing explainable AI (XAI) with techniques like Layer-wise Relevance Propagation (LRP), Shapely Additive exPlanations (SHAP), and Local Interpretable Model-agnostic Explanations (LIME) can help build trust in LLM inferences and make vulnerability mitigation possible.
Security vulnerabilities, inference bias, and compliance drift can occur gradually over time and must be mitigated quickly. Continuous monitoring involves evaluating a large language model application’s security, performance, and ethics on an ongoing basis to detect and eliminate issues before they compound into larger problems with serious consequences.
Federated learning involves training an LLM on data that’s divided across multiple decentralized computers. Because individual training nodes can’t share raw data with each other, a malicious actor can’t access all training data in one spot, thus limiting the blast radius of a breach.
Taking steps to ensure the fairness and inclusivity of training datasets, avoid biased decision-making, and prevent harmful content in outputs will improve trust in LLM applications. Bias and toxicity detection tools can help identify and mitigate LLM ethics issues like discrimination, profanity, and violent language. The ideal solution works at both the training and inference stages to label problematic content according to severity.
With HITL, a human evaluates an LLM application’s decisions to ensure they’re accurate and ethical. This can help catch hallucinations, bias, toxicity, and other errors before they have a real-world impact, which is critical in LLMs used for automatic decision-making in fields like healthcare or finance. The difference between human-in-the-loop and reinforcement learning from human feedback is that RLHF occurs during training and development, whereas HITL occurs in production with real-life decisions.
Protecting LLM data, models, applications, and infrastructure with strict governance policies can help prevent breaches by limiting the damage caused by a compromised account or insider threat. For example, policies should define each job role’s privacy and security responsibilities based on their level of access. They should also define if and how employees use LLM applications, particularly third-party generative AI tools like ChatGPT.
Continuously and, preferably, automatically performing functional and stress tests on LLM models and applications will help identify issues early and ensure optimum performance. It’s also important to continuously validate training data, inputs, and outputs for signs of malicious content or bias.
Transparency is critical for establishing trust and maintaining compliance. Users should be informed that they’re interacting with an LLM, not a human, and should provide clear and unequivocal consent to their data collection and usage. These measures are required by consumer privacy laws like the GDPR and the CCPA (California Consumer Privacy Act), as well as AI-specific regulations like the EU’s AI Act.
Granica Screen is a privacy and safety solution that helps organizations develop and use AI ethically. The Screen “Safe Room for AI” protects tabular and natural language processing (NLP) data and models during training, fine-tuning, inference, and retrieval augmented generation (RAG). It detects sensitive and unwanted information like PII, bias, and toxicity with state-of-the-art accuracy and uses masking techniques like synthetic data generation to allow safe and effective use with LLMs and generative AI. Granica Screen helps organizations shift-left with data privacy, using an API to integrate directly into the data pipelines that support data science and machine learning (DSML) workflows.
To learn more about mitigating LLM security risks with the Granica platform, contact one of our experts to schedule a demo.
Sources: