AI Development Blog & News | Granica AI

PII Data Masking Techniques Explained

Written by Granica | Jul 25, 2024 1:40:01 PM

User data is highly valuable, but collecting, storing, and using it for analytics and AI is inherently risky in today’s cybersecurity climate. High-profile breaches involving personally identifiable information (PII), such as the attacks on Ticketmaster and Evolve Bank and Trust, illustrate the dangers of storing sensitive data without adequate protection.

PII data masking first discovers, then removes, or hides sensitive information from datasets to mitigate security and compliance risks while still allowing that data to be used for data analysis and AI/machine learning.

A variety of PII data masking techniques offer differing degrees of privacy and data usability.  However, the difficulty with PII discovery and masking is accuracy, as many tools have a high rate of false positives and negatives. These inaccurate results either restrict data usage unnecessarily, lowering business value, or leave data open to leakage, creating business risk.

This blog explains the most common and effective data masking techniques before providing a brief comparison of PII data masking tools to help protect user privacy with high accuracy. 

PII data masking techniques explained

PII Data Masking Technique

Description

Example

1. Redaction

Removing PII without replacing it with anything

My name is Fred Johnson

becomes

My name is

2. Replacement

Replacing PII with a fixed value

My name is Fred Johnson

becomes

Name name is [REDACTED]

3. Size-preserving replacement

Replacing PII with a value of equal length

My name is Fred Johnson

becomes

My name is XXXX XXXXXXX

4. Named/numbered replacement

Replacing PII with an identifying label

My name is Fred Johnson

becomes

My name is [FIRSTNAME1] [SURNAME1]

5. Encryption

Replacing PII with an encrypted value

fjohnson@email.ai

becomes

[EMAIL_m&3s85+;sdfm)

6. Format-preserving encryption

Replacing PII with an encrypted value in the original format

fjohnson@email.ai

becomes

le4ds&cd@nedf.op

7. Synthetic data replacement

Replacing PII with a similar synthetic value of the same type

My name is Fred Johnson

becomes

My name is Lenny Smith

All the PII data masking techniques listed above effectively sanitize data while enabling safe usage for data analysis, generative AI, and other data-heavy applications. However, the first six methods on this list can limit how much information is inferred from masked data, potentially affecting an AI model’s inference abilities or generating inaccurate business intelligence.

Synthetic data replacement, on the other hand, provides realistic information for model training, data analysis, and generative AI. The named/numbered replacement approach also preserves context, which is helpful for AI/ML, but synthetic data looks more “real” to the model. This technique is the best way to ensure the quality and accuracy of downstream data processes while protecting user privacy and mitigating security and compliance risks.

Comparing PII data masking tools

PII Data Masking Vendor

Capabilities

Pros and Cons

Granica

• PII data discovery, classification, and masking

• Large-scale data lake privacy

• Real-time LLM prompt privacy

✔ State-of-the-art accuracy for named entity recognition (NER) from PII to custom fields across any text/tabular data

Extensive support for 100+ languages across 20+ regions, recognizing 80+ types of global PII

✔ Unified platform for comprehensive data privacy from training to inference

✔ Highly compute efficient for low-cost scanning of large scale, AWS and Google Cloud data lakes

✔ Real-time performance to protect LLM prompt inputs

✔ Deployed in customers’ VPC, ensuring information never leaves the customers’ environment

✘ Technical and CLI/API-oriented with a limited GUI

Nightfall AI

• PII data discovery and masking for SaaS, genAI, email, and endpoints

• Data loss prevention (DLP)

• SaaS data privacy posture management

✔ Streamlined, easy-to-use platform

✔ Excellent sales and technical support

✘ Notifications can be noisy

✘ Performance of some detection services could be improved

Private AI

• On-premises PII data discovery

• Data masking

✔ Private AI’s PII data discovery is highly accurate

✔ The user interface is easy to use

✘ High compute requirements drive up infrastructure costs

✘ Data sampling techniques create security concerns

Satori

• PII discovery and data masking

• Data access control

• Data audits and monitoring

✔ Provides robust security and privacy features

✔ The platform is intuitive with easy integrations

✘ Platform performance can be slow

✘ Inbound and outbound data transfers are also slow

K2view

• PII data discovery and masking

• Data pipelining

• Master data management

✔ Easy data integrations

✔ Has an extensive data management featureset

✘ Platform has a steep learning curve

✘ Pricing is high compared to similar tools

Some of the most important qualities to look for in a PII data discovery and masking solution include:

  • Highly accurate named entity recognition (NER) to reduce the rate of false positives and negatives, prevent data from being screened unnecessarily, and ensure all sensitive data is protected.

  • Accurate NER for different types of PII in multiple languages. This will ensure that no private data is left unprotected, even if it’s in an unusual format or a non-Latin alphabet.

  • Compute-efficient data scanning algorithms. Lightweight scanning tools will use fewer cloud or data center resources, making them less expensive to run and allowing companies to protect more data, especially large-scale training data sets.

  • Real-time PII scanning and masking at LLM inference time. This will eliminate any delays in protection, for example, when masking PII in LLM prompts and RAG before data reaches the genAI model.

  • Secure data scanning techniques. Ideally, discovery and masking should all occur within the customer’s cloud environment (e.g. VPC) to ensure sensitive information stays safe even if the vendor is breached.

PII data masking with Granica Screen

Granica Screen provides PII data masking with synthetic data replacement for AI and LLMs. It runs as a lightweight software agent within end-customer data lake and lakehouse environments, protecting PII in tabular and natural language processing data without ever removing data from the environment. Granica offers state-of-the-art (NER) accuracy and real-time PII protection to ensure user privacy and AI safety from training and fine-tuning to inference.

Get an interactive demo to see Granica Screen’s PII data masking techniques in action.