This post is part of a series of posts on our bias and toxicity detection feature in Granica Screen - see Part 1 here for more background on detecting biased and toxic content and challenges associated with this task.
We wanted to build the best possible tool for organizations wanting to easily scan any text they’re working with for harmful content. In AI and ML-related settings, there are numerous applications where this can be useful, such as:
There are several widely-used data safety solutions for detecting harmful content, but we found that there were still important unmet needs for models:
Almost all existing data safety solutions treat the task of bias and toxicity detection as a multiclass binary classification task, collapsing both harm severity and model uncertainty into a single probability score. Rather than merely deciding if a piece of content is toxic, we often need to know how toxic it is. This highlights the need for models that provide more granular outputs specifically addressing the issue of severity.
For instance, consider a content moderation system that assigns both of these texts a p(unsafe) = 99% score:
While both receive the same “unsafe” probability, the reasons differ significantly. One is truly severe and extreme, while the other is a familiar, if offensive, pattern. Without separate severity modeling, we can’t distinguish between the gravity of these two scenarios.
This distinction matters in practice because different types of harmful content need different responses. Ideally, we’d want both:
These two goals can be hard to satisfy simultaneously without triaging limited content moderation resources - hence the need to prioritize examples accurately.
When our systems can’t tell whether they’re confident because they’ve seen lots of mild insults or because they’ve identified severe harm, we end up misallocating our limited resources. We might focus too much on common but less harmful content - and miss the rare but consequential cases that need immediate human attention.
Additionally, you can’t solve this problem just by calibrating your binary prediction model better - continuing with the above example, whenever the model predicts 99% chance of toxicity, 99 out of 100 outcomes for such predictions may very well indeed be toxic - but you would still have fundamental uncertainty about what that tells you about how much qualitatively more offensive that text is when it’s compared to a text that got eg a 75% chance of toxicity.
You also often need to know not just “how toxic or biased is this?” but also “why, specifically, is this toxic or biased?” - in other words, “which toxicity or bias categories, specifically, is this about?”
Some examples:
In each of these scenarios, there is additional utility for distinguishing between harm types at a more granular level.
At a high level, many safety models have categories that have to do with ‘toxicity’ or to do with ‘bias’. While they typically provide some granularity for the toxicity categories, they often lack granularity for bias categories.
However, most bias and toxicity detection tools do not distinguish sufficiently between different forms of bias, typically using a “Hate speech” or ”Discrimination” catch-all category. The problem with this is that it’s not informative enough to be actionable. Suppose a company discovers they’ve violated anti-discrimination laws – how can they take swift corrective action if they don’t know which group had been affected?
It’s also worth pointing out that, worldwide, most companies and governments are already operating within some given regulatory framework about safeguarding protected groups. So this isn’t a suggestion that the field of machine learning come up with new definitions for categories, but rather just to work within the well-established ones.
Model |
Toxicity categories |
Bias-related categories |
Llama Guard 7B |
|
Violence and hate |
Llama Guard 3 8B |
|
Hate |
Nvidia Aegis |
|
Hate / Identity Hate |
OpenAI text-moderation-stable |
|
Hate Hate / Threatening |
OpenAI omni-moderation-2024-09-26 |
|
Hate Hate / Threatening |
Mistral mistral-moderation-latest |
|
Hate and Discrimination |
Perspective API |
|
Identity Attack |
Granica |
|
Hate
Socioeconomic status |
So, if we are:
... then we think that safety models should provide higher granularity of subcategories for types of biases.
All told, as we work to overcome these limitations we want to achieve three goals with Granica Screen:
And so, we are proud to share more information about the two new models for bias and toxicity detection we have built for Granica Screen: screen-bt and screen-bt-lite. These models are very competitive against the state-of-the-art on a wide set of benchmarks:
screen-bt, in addition to outputting a binary probability, also returns scores on a harm severity scale ranging from 1 to 4 - something we hope allows many new possibilities for bias and toxicity-related use cases:
Severity Level |
Example Indications for Severity Level |
L1: not toxic |
|
L2: somewhat toxic |
|
L3: very toxic |
|
L4: extremely toxic |
|
Here are some examples of text and the severity scores per category:
text |
We should kill everyone who is still single after 30 |
The frail, outdated minds of Alan Turing, who passed away at the age of 41, and Ada Lovelace, dead at 36, hindered them from unlocking the true potential of technology that younger innovators have been able to accomplish in the years since. |
Those people who can't handle taking responsibility for their own lives are probably the same whiny ones who cry about government control. |
disrespectful |
4 |
3 |
2 |
hate |
4 |
1 |
1 |
Identity attack |
4 |
1 |
2 |
violence |
4 |
1 |
1 |
sexual material |
1 |
1 |
1 |
profanity |
1 |
1 |
1 |
physical safety |
4 |
1 |
1 |
sexual orientation |
1 |
1 |
1 |
age |
4 |
2 |
1 |
disability |
1 |
1 |
1 |
physical appearance |
1 |
1 |
1 |
religion |
1 |
1 |
1 |
pregnancy status |
1 |
1 |
1 |
marital status |
4 |
1 |
1 |
nationality location |
1 |
1 |
1 |
gender |
1 |
1 |
1 |
race / ethnicity |
1 |
1 |
1 |
socioeconomic |
1 |
1 |
1 |
political |
1 |
1 |
2 |
We benchmarked our models against several data safety models in wide use in production across the industry:
Although these models share a common tendency to have coarse grained definitions of bias, they actually do have nuanced differences in their safety taxonomies as a whole. For instance,
This poses an obstacle to doing direct category-wise comparisons between models. One standard practice in the bias and toxicity detection model for this situation is to reduce predictions to just a single binary label:
We’ve included the respective safety policies in the appendix below.
We benchmarked the models on the following datasets:
1. The datasets benchmarked in Nvidia’s paper AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Ghosh et al. 2024)AIR-Bench is a particularly interesting benchmark due to the design of its taxonomy.
[AIR-Bench 2024] is the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI Risks study. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality, provides a unique and actionable tool for assessing the alignment of AI systems with real-world safety concerns.
The implication being that you can measure your performance on different subsets of the examples (definitions of the subsets are included in the benchmark) to figure out approximately how well aligned your model is with the policies of a certain government, or determine which of the 314 highly-specific sub-sub-sub-categories are areas of poor performance:
Source: https://github.com/stanford-crfm/air-bench-2024
3. Selected Adversarial Semantics
4. The test split from our proprietary dataset developed internally
Note: Both Simple Safety Tests and AIR-bench 2024 are all 100% toxic examples, so AUPRC is undefined and precision is always 100% as long as there is at least one example predicted toxic.
Metrics:
We stress that we have observed that many APIs with widespread use across the industry (in particular, Perspective API and OpenAI Moderation API) have significantly inconsistent performance.
For example, we were puzzled by our findings that Perspective API only successfully identified two out of 5694 toxic examples from AIR-Bench – a recall of 0.035%. Investigating this result was how we discovered the Selected Adversarial Semantics benchmark, developed for the paper Critical Perspectives: A Benchmark Revealing Pitfalls in Perspective API by Rosenblatt et al., 2022. We verified that our benchmark pipeline reproduces their metrics for Perspective’s performance on the Selected Adversarial Semantics benchmark, which increased our confidence that we were accurately measuring Perspective’s performance on AIR-Bench.
We were also surprised that OpenAI’s models performed a little worse than expected on AIR-Bench 2024. Mismatch between the safety policies of the respective models and the relevant AIR-Bench categories may provide an explanation for this.
For a service operating at scale, small differences in metrics can be very important.
Consider a hypothetical service with 1M daily messages. Without loss of generality, let’s say that there’s a true positive rate of 5% for harmful content, i.e. 1 in 20 messages is harmful. Then, the number of missed true positives (i.e. false negatives) is (1M * 0.05 * (1 - recall)) every day.
Since AIR-Bench aligns with regulatory safety policies, encompasses the widest set of subcategories, and incorporates adversarial prompting techniques as a way to stress test model safety behavior robustness, we use it as a proxy for a realistic, diverse, and challenging setting. Using each model’s recall on AIR-Bench, if we estimate the expected number of daily false negatives, we find that there is a very large range of missed examples. We then calculate the ratio of these false negatives vs. screen-bt-lite, as well as the % of increased exposure risk vs. screen-bt-lite. This latter metric represents the “so what” of these benchmark results and is illustrated in the following chart:
Model | Recall | Expected # false negatives /day | Ratio of false negatives vs screen-bt |
% Increased exposure risk vs. screen-bt |
Screen-bt-lite | 0.718 | 14100 | 1 | 0% |
Aegis Defensive | 0.7 | 15000 | 1.06 | 6% |
Aegis Permissive | 0.471 | 26450 | 1.88 | 47% |
Screen-bt | 0.699 | 15050 | 1.07 | 6% |
LlamaGuard 1 7B | 0.297 | 35150 | 2.49 | 60% |
LlamaGuard 3 8B | 0.531 | 23450 | 1.66 | 40% |
Mistral | 0.694 | 15300 | 1.09 | 8% |
OpenAI omni-moderation 2024-09-26 | 0.304 | 34800 | 2.47 | 59% |
OpenAI text-moderation-stable | 0.038 | 48100 | 3.41 | 71% |
Perspective | 0.002 | 49900 | 3.54 | 72% |
Average % risk increase | 41% | |||
Median % risk increase | 47% |
Note: despite screen-bt-lite having a higher recall in this test than screen-bt, we still recommend using screen-bt overall as its performance is most balanced.
This hypothetical service would face several significant real-world challenges were to use the models we have compared ourselves against:
Overall, we think our approach of training models that distinguish between a wide variety of types of harm, and are able to grade them with greater nuance, greatly helped our models to achieve state-of-the-art results.
Request a demo to learn how Granica Screen can improve your data safety and AI model performance, without driving up costs.
Model |
Safety policy taxonomy |
Meta Llama Guard 1 7B |
|
Meta Llama Guard 3 8B |
|
Nvidia Aegis |
|
OpenAI text-moderation-stable |
|
OpenAI omni-moderation-2024-09-26 |
|
Mistral mistral-moderation-latest |
|
Perspective API |
|
Granica |
Toxicity categories:
Bias categories:
|