Piiano logo

Privacy, Privacy Engineering

Practical Pseudonymization by Tokenization

November 14, 2021

By Piiano Team

Data leaks hurt consumers. Though the true number of breaches and compromised data remains unknown, those we know of have compromised billions of records–including highly sensitive customer data. Even the largest corporations with the most competitive security teams have failed to prevent data leaks. Until this is properly addressed, consumers cannot trust enterprises to keep their information safe and will continue to see the private information they share with enterprises dangerously exposed, stolen, and shared.

Mounting public concern over how enterprises interact with our personal information has led to the development of laws and regulations, such as GDPR and CCPA, that define consumer privacy rights. Personal data now officially requires special handling and care to meet today’s compliance standards and consumer expectations.

This post will discuss one of the most effective techniques to mitigate and reduce the risk of compromised consumer data.

Data Pseudonymization and Data De-Identification

Personal data de-identification is the process of removing identifiers from a data set to prevent any possibility of linking individuals to its information. When de-identification is applied in a way that makes it impossible to link it back to individuals, to re-identify the data, the data is considered anonymized. Today’s business analysts and data scientists require access to data. However, due to lacking security awareness and problematic infrastructure, direct access introduces unacceptable risk–especially if the data-in-use is subsequently copied outside of controlled systems. Applying the best practice of anonymization or pseudonymization at the very least dramatically reduces these risks.

What is Pseudonymization

Anonymized data sets are out of the scope of privacy regulations like GDPR. However, full anonymization can disrupt many legitimate data uses that require identifiers (e.g., your bank probably needs to know who you are as a user). GDPR proposes pseudonymization as a practical alternative for reducing the risk of data exposure, relieving compliance obligations, and maintaining optimal data utilization:

‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person (link)

Pseudonymization is the process of replacing all sensitive identifiers with pseudonyms/aliases/tokens in a specific data set (e.g., a table in a database or a CSV file). The original identifiers have to be kept securely elsewhere. C’est tout.

Unfortunately, it’s much more complicated to execute. Building systems that work with pseudonymized data requires a whole lot of designing from the get-go. This is incredibly difficult to do–so much so that GDPR can only recommend it due to unenforceability.

What is Tokenization

Tokenization is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system.

Data Anonymization Techniques

Pseudonymization is typically implemented by using a tokenization technique (see below). Note that, like anonymization, pseudonymized data cannot be linked to a person’s specific identity on its own. However, unlike anonymization, it is possible to re-identify pseudonymized data using the additional piece of information kept secure elsewhere. Effectively, when systems require the original plaintext identifiers, they can still translate the pseudonyms back.

Example of an original table (with PIIs of emails and SSNs):

user_id email_address ssn salary
1 liz@example.com 000-11-1111 25K
2 darcy@altostrat.com 000-22-2222 39K
3 fiona@example.com 000-33-3333 32K
4 Jon@examplepetstore.com 000-44-4444 29K
5 tomL@test.com 000-55-5555 41K

 

SSN is now tokenized:

user_id email_address ssn salary
1 liz@example.com 1ffa0bf4002a968e7d87d7dc8815f551895378ac 25K
2 darcy@altostrat.com 1d55ec7079cb0a6aca2423ebb5c2b08e8a3fa1d8 39K
3 fiona@example.com be85b326855e0e748a6e466ffa92dde8e34b3e5c 32K
4 Jon@examplepetstore.com a8018df9bf9b78a98da20058e59fe8d311fbfbf7 29K
5 tomL@test.com 39512b47a68f4c3fb03845660fca79234270e946 41K

 

Both emails and SSNs are tokenized, and the table is now pseudonymized using PII tokenization.

user_id email_address ssn salary
1 ccecd98685a699bd@0fe3485.com 1ffa0bf4002a968e7d87d7dc8815f551895378ac 25k
2 e74c15f3f39db602@4706c986.com 1d55ec7079cb0a6aca2423ebb5c2b08e8a3fa1d8 39k
3 55715565ab5e1378@f3b1c1bb.com be85b326855e0e748a6e466ffa92dde8e34b3e5c 32k
4 5847be2298b245a@3970680.com a8018df9bf9b78a98da20058e59fe8d311fbfbf7 29k
5 E85d62b7520b21@6eae1.com 39512b47a68f4c3fb03845660fca79234270e946 41k

* Email addresses are tokenized using a format-preserving tokenization mechanism.
* This table is now either anonymized or pseudonymized depending on whether it’s possible to restore its original email or SSN values.

This is a 1:1 translation (tokens) table matching tokens to the original plaintext email address:

email_address_token email_address
ccecd98685a699bd@0fe3485.com liz@example.com
e74c15f3f39db602@4706c986.com darcy@altostrat.com
55715565ab5e1378@f3b1c1bb.com fiona@example.com
5847be2298b245a@3970680.com Jon@examplepetstore.com
E85d62b7520b21@6eae1.com tomL@test.com

* Authorizing, auditing, and monitoring access to this table are critical for preserving the anonymity of the users in the original table.

Data Tokenization

Tokenization is the process of substituting a single piece of sensitive information with non-sensitive information. The non-sensitive substitute information is called a token. It can be created using cryptography, a hash function, or a randomly generated index identifier and used to redact the original sensitive information. For example, tokenizing sensitive information like PII or credit card numbers is important when logging them into a file.

The operation of detokenization, or translating a token to its corresponding data, should only be done on a need-to-know basis (using the right permissions).

To support various use cases, tokenization engines that are used for pseudonymization must account for the following possible requirements:

  • Format-preserving tokens: These tokens preserve the format of the original data. They are often required in situations with strict storage formatting, such as when changing the database scheme is impossible.
  • Deterministic vs. unique tokens: This refers to whether the same value is always mapped to the same token or if each value is mapped to its own unique token. Deterministic tokenization enables users to lookup exact matches and performs joins between tokenized directly on the columns of a pseudonymized dataset. However, this means that it still leaks some information about original identifiers and consequently reduces the protection provided by the tokenization engine. For example, it exposes the fact that two different identifiers are the same and, using statistical methods, can be used to reveal the original data.
  • Ephemeral tokens: These tokens are valid for a limited amount of time. Their expiry limits exposure. They are often used for regular tasks, for example, identifiers exported on a nightly job from a transactional system to an analytical data pipeline.
  • Querying the data: The ability to perform lookups on the tokenized data, such as searching for all emails with the domain “@example.com” in the example above.
  • Efficiently updating / deleting identifiers: The process of only updating referenced identity (e.g., inside the translation/tokens table) instead of updating multiple tables holding a specific token while keeping the token as-is.

Tokenization engines for pseudonymisation can be implemented in two main ways: based on a translation table or based on encryption. Table-based tokenization maintains a mapping between identifiers to randomly generated tokens in a table that is stored in a centralized location. Encryptionbased tokenization uses a cryptographic algorithm and corresponding key to translate identifiers to tokens and vice versa. Both the table and the key must be secured and stored separately from the original database.

The following table summarizes important considerations for both approaches:

Table Encryption
Operational cost High – requires maintaining an always-available table Low – simply requires a key that can easily be backed up and copied if needed
Security risk Moderate – the table requires a high degree of protection. It contains all identifiers and represents a potential single point of failure Low – requires protection for the key. Access to identifiers requires both the key and the tokens
Queries Any query can be supported Only supports exact matches for deterministic tokenization
Efficiently updating / deleting identifiers Can update or delete identifiers without requiring a token change Tokens must be updated or deleted
Format preserving Supported Supported
Deterministic tokens Supported Supported
Unique tokens In some cases, it’s not possible to generate unique tokens for format-preserving tokens Requires extra context.

And anyway, in some cases, it’s not possible to generate unique tokens for format-preserving tokens

Ephemeral tokens Supported Possible for non-format preserving tokens in a trusted execution environment

Sometimes it’s hard to tell which model is better for you. In such cases, it’s best determined by the use case at hand. Typically, it’s best to use a table when you need more searchability over the data. If you need more performance and security over the data, but less searchability, then encryption may be best. Sometimes, if you know the architecture and how it uses the data, you can roll your own solution to play around with these characteristics.

People who read this post also viewed these ones:

Give us 15 minutes to show you
the future of privacy engineering
This website uses cookies. Learn more