Agentic RAG Data Security Risks and Mitigations

AI Security

Ariel Shiftan

CTO & Co-founder

December 30, 2024

5 mins

min read

Join our newsletter

Your privacy is important to us, privacy policy.

Agentic Retrieval Augmented Generation (RAG) workflows are powerful tools for enhancing productivity and knowledge generation. By integrating external data sources and real-time information, they can provide users with more relevant and comprehensive responses. However, these advanced capabilities also introduce significant privacy and security risks.

Data security and privacy issues with RAGs

One of the primary concerns with RAG workflows is the potential for unauthorized access to sensitive information. If not properly secured, these workflows could inadvertently expose personal data, confidential business information, or intellectual property. This could result in severe consequences, including identity theft, financial loss, reputational damage, and legal liabilities.

Therefore, it is essential to implement robust security measures to mitigate this potential data exposure. This includes implementing strict access controls, encrypting sensitive data and embeddings, tokenizing PII, and regularly monitoring for suspicious activity. Additionally, it is important to educate developers about the potential risks and best practices for safeguarding their customer data when integrating with gen AI and RAG.

For example, prompts, text and other types of data used in RAG systems often contains sensitive data — like PII, PHI, and PCI — and may also include other sensitive content (e.g., user personal notes, diary entries, chats, or saved draft messages) that can be easily exposed during processing, agentic retrieval from DBs, APIs and web searches, or prompt generation.

Specifically, embeddings, which are generated from data, can be sensitive and contain identifiable information. This sensitivity arises because embeddings are mathematical representations of data, and even with transformations, the potential for reversibility exists. This reversibility means that the original data, which may include personally identifiable information (PII), could be extracted from the embeddings, leading to privacy breaches and potential harm to individuals.

That data, being indexed and used by the RAG system, may inadvertently be exposed to unauthorized users if not handled properly. For example, data that belongs to one user may be leaked to another user by using the same embeddings.

Storage and cache systems (e.g., documents, SQL, vector stores, graph databases) are often not well protected against unauthorized access. Also because of how the application layers work with these components. This is where more risk can be introduced, with data leaks between customers.

Another issue is enforcing regulatory requirements like the right-to-be-forgotten of the GDPR. Deleting data upon request from an end user can be very challenging. Do you keep an inventory of where each user’s data is stored throughout your system and now in RAGs too?

Data leaks and jailbreaks

Yet another issue we see is sending and receiving sensitive data to third-party models and APIs is risky and may also violate regulatory requirements. Many businesses or even consumers are worried about how their own data flows around when AI is involved. We have already seen situations where even Microsoft leaked customer information in their training sets. Because of such potential drastic privacy breaches, people are right about worrying how and where their data is used. De-identifying the data is one viable solution.

Furthermore, end user queries (prompts), public data that get indexed, and any other information retrieved by agents can both be malicious and should not be blindly trusted. It can be used to jailbreak and take over an LLM or abuse it.

Risks in an AI data workflow

In the following data flow diagram you can see a few risks involved when using RAGs and LLMs in your new AI-based workflows. Starting from the querying string (prompt), to using the RAG to sending to the LLM and getting a result.

Mitigations

How can we mitigate these risks in Agentic RAG workflows?

1) Identification and tokenization of identifiers (PII): Protect clearly identifiable sensitive data (like SSNs, credit card numbers, and health records) by tokenizing them early in the workflow, eliminating leak probability.

2) Field level encryption: Encrypt the rest of the unstructured text, the embeddings and the metadata that may still hold sensitive information, providing ongoing protection while allowing retrieval.

3) Context-aware access controls, decryption & masking: Decrypt and share only what is necessary, when it is necessary — while reducing sensitivity through masking or other techniques.

4) Treat prompts, and retrieved data as untrusted: Always treat user prompts and publicly indexed data as untrusted inputs to avoid potential security vulnerabilities. This includes validating, sanitizing, and carefully monitoring any input before using it in downstream processes.

These targeted mitigations would minimize data exposure and reduce compliance burdens.

Piiano Vault can help you implement the mitigations revolving data security effectively, providing built-in tokenization, encryption, and access control solutions.

‍

Summary

The future of using generative AI in your apps is awesome, but it requires awareness toward data security and privacy. Specifically, how to make sure customer data is secure and can't be leaked during the use of external LLMs.

You can learn more about how Piiano Vault can secure your RAG workflows.

About the author

Ariel Shiftan

CTO & Co-founder

Ariel, despite holding a PhD in Computer Science, doesn't strictly conform to the traditional academic archetype. His heart lies in the realm of hacking, a passion he has nurtured since his early years. As a proficient problem solver, Ariel brings unmatched practicality and resourcefulness to every mission he undertakes.

# Tags:

No items found.

Powering Data Protection

Skip PCI compliance with our tokenization APIs

Skip PCI compliance with our tokenization APIs

hey

h2

dfsd

link2

It all begins with the cloud, where applications are accessible to everyone. Therefore, a user or an attacker makes no difference per se. Technically, encrypting all data at rest and in transit might seem like a comprehensive approach, but these methods are not enough anymore. For cloud hosted applications, data-at-rest encryption does not provide the coverage one might expect.

John Marcus

Senior Product Owner

const protectedForm = 
pvault.createProtectedForm(payment Div, 
secureFormConfig);

This is some text inside of a div block.

Continue your reading

See all articels

No items found.

Piiano offers developer-friendly privacy and security products.

Piiano Privacy Solutions US, Inc

135 W. 50th St. Suite 200 
New York, NY 10020

Product

Piiano Vault PCI Tokenization

Company

About us

Resources

Docs Blog Privacy by Design Data Tokenization

What is PII PII Protection Column - Level Encryption 101

PII By Design ^TM Cheat Sheet

Compared to Hashicorp Vault

Compare to other data vaults

Piiano offers developer-friendly privacyand security products.