Skip to main content

Bias and Toxicity Detection for LLM-Powered Applications

Companies are racing to leverage the power of generative AI by LLM-enabling a wide range of applications, especially chat-bots. However, the ability to detect and mitigate biased or toxic content generated from these applications is crucial to success, and accurate detection today is very challenging.

We have developed bias and toxicity detection as part of the Granica Screen Safe Room for AI that can be used at multiple stages of the LLM development and inference pipeline. These capabilities help ensure a delightful, harm-free user experience as well as alignment with company content safety policies.

Let’s dive into the details!

Biased and toxic LLM responses means failed product launches

Historically, failures in adequate bias and toxicity detection negatively affected the public rollout of language models that had otherwise undergone extensive internal testing.

In 2016, Microsoft released a chatbot called Tay. Tay’s novelty was that it was continuously learning from social media conversations mentioning Tay. 16 hours (and 95000 tweets) later, Tay’s responses had become incredibly toxic and xenophobic, with Microsoft shutting Tay down and apologizing for Tay's "wildly inappropriate and reprehensible words and images".

Although Microsoft had pretrained Tay on clean data, the feature to further train given online interactions posed as a secondary source of training data. When adversarial users strategically coordinated their targeted harmful posts, they were able to exploit vulnerabilities in Tay to introduce additional - toxic - training data. Similarly, Tay’s outputs, too, were not adequately screened for harm. The consequences of the lack of these mitigations (for both inputs and outputs of the model) made Tay infamous.

More recently, in 2023, Microsoft released Bing Chat, an LLM to help users with Bing Search. But Bing Chat, too, sometimes engaged with users quite rudely:

MSFT_Bing_Chat

Source: https://x.com/MovingToTheSun/status/1625156575202537474 

Inherent challenges

Screening natural language does come with a set of unique challenges. Some forms of toxicity can be subtle to classify accurately without overtly explicit language. Likewise, models may become too prone to taking a ‘keyword-filter’-like approach, and such approaches do not scale well as more languages are supported.

Additionally, the usage-context of certain kinds of language may affect whether a message should be considered toxic or not - mistakenly overzealously classifying a text as toxic without understanding the context of its usage may propagate a subtle kind of bias against groups that naive classifiers learn as being the targets of discrimination:

"When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users."
Source: Jigsaw Unintended Bias in Toxicity Classification.

To develop a robust LLM, both inputs and outputs require proactive evaluation at every stage of development, essentially a multi-stage approach:

  • Training dataset curation, part of which includes searching for and removal of toxic or discriminatory training data from corpora
  • Screening prompt inputs at inference-time
  • Screening model responses at inference-time
  • Reinforcement Learning from Human Feedback (RLHF), using response preference datasets
  • Internal red-teaming via coordinated and targeted inputs to drive harmful outputs, to try to identify and assess weaknesses in the model

The state of detection

Suffice to say, organizations developing LLMs and LLM-powered applications have significant time, cost, and effectiveness challenges as they attempt to implement a comprehensive strategy using existing tools.

Such tools can give a coarse-grained indication of the degree of harm in unsafe content. For example, both Jigsaw’s Perspective API and Microsoft Azure’s AI Content Safety API allow for quantification of toxicity and bias. However, their bias detectors do not highlight which protected attributes were targeted.

Granica is directly tackling these challenges to reduce the effort, and increase the effectiveness, of detection. Granica Screen has new, fine-grained capabilities - part of Granica’s Safe Room for AI - that return scores for toxicity detected, as well as labels for the presence of several forms of bias. These fine-grained scores and labels can then be used to develop further tooling across the multi-stage approach described above.

Building the next generation of bias and toxicity detection

A common use case we see with our customers is their end-customer facing chatbots, with a requirement that these chatbots engage appropriately during natural-language conversations. In particular, responses must be helpful, professional, and nondiscriminatory.

Taxonomy

To comprehensively cover various forms of toxicity and bias we built the following granular taxonomy into our Granica detection model.

Toxicity Categories Bias Categories
  • Disrespectful
  • Hate
  • Identity attack
  • Violence
  • Sexual material
  • Profanity
  • Physical safety

Protected attributes:

  • Sexual orientation
  • Age
  • Disability
  • Physical appearance
  • Religion
  • Pregnancy status
  • Marital status
  • Nationality / location
  • Gender
  • Race / ethnicity

Polarization attributes:

  • Socioeconomic status
  • Political affiliation

Additionally, we incorporate a severity scale for toxicity detection, and binary classification for bias detection.

Granica in action

The following examples clarify what the different levels and categories are meant to capture. Highly offensive terms have been redacted here.

Toxicity scale examples

Toxicity Levels Examples
L1: Not Toxic

  • “It’s a bit controversial but I think their new work is kind of interesting.”
  • “Violence against [group] is terrible.”
L2: Somewhat Toxic
  • “I f****** hate pineapple on pizza.”
  • “That rumor was so dumb, I can’t believe so many people fell for it.”
L3: Very Toxic
  • “You idiot.”
  • “All those people from [country] are taking our jobs.”
L4: Extremely Toxic
  • “You f****** idiot.”
  • “We should round up all the [group] and get rid of them.”

Bias classification examples

Bias Categories Unbiased Examples Biased Examples
Sexual orientation

The company is proud to offer equal benefits to all employees regardless of sexual orientation. We don’t hire people who aren’t straight here, they’re not a good fit for our company culture.
Age
Marital status
The job is open to qualified candidates of all backgrounds. We need talent who can focus and deliver: you need to be young to keep up, and if you have a spouse or family they’ll just distract you.
Socioeconomic We make sure our education is accessible to students from all backgrounds. How can we enroll as many rich students as possible?
Nationality / location
Race / ethnicity
We celebrate the diverse nationalities and backgrounds represented in our global team. We’ve found that time spent with [group] people, or people from [country], is seldom worthwhile.


Summing up

When building LLMs and LLM-based applications, getting bias and toxicity detection right is critical. We’re building a detection service that fills in the gaps of existing approaches.

Stay tuned for future posts as we continue to enhance our detection models and benchmark our results against the status quo best-in-class.

Request early access to Granica Screen's bias and toxicity detection capabilities to improve your genAI-based products.