Your privacy is important to us, privacy policy.
Every non-trivial software application prints and collects logs. Large applications log an insane amount of data, so much so that log management costs are a significant subject of the FinOps movement that promotes best practices for cost management in the cloud.
Logs are essential for troubleshooting, auditing and incident response, yet developers often tend to log more than necessary for these purposes. They do have a point: when you’ve experienced troubleshooting critical issues in stressful conditions, you’re inclined to log more to be on the “safe” side. However, this practice results in logging data that shouldn’t be logged, including credentials, credit card numbers and personal data, a.k.a. PII.
This is known as a data leak: exposure of sensitive information from a system or application code. In contrast to a data breach, data leaks don’t have to be intentional. More often than not, they are caused by developers unknowingly introducing vulnerabilities to their code bases or forgetting to erase diagnostic code that they previously used in debugging sessions.
A log leak doesn’t necessarily result in PII made available to the general public. However, considering that access to log aggregators and monitoring systems is often provided to a wide range of employees and contractors, this creates a confidentiality risk. Companies tend to treat even internal PII leaks as incidents that necessitate notifying and requesting action from customers. This happens to the best of us, just ask GitHub that once found themselves internally logging plaintext passwords.
What’s PII, Again?
PII stands for “Personally Identifiable Information”. This umbrella term covers all kinds of data that can identify a natural person in one of two ways:
- Directly. This includes any data that uniquely identifies a person: for example, social security numbers (SSN) or national identification numbers used in other countries, passport information, driver’s license numbers, biometric records, bank account numbers, debit and credit card numbers.
- Indirectly. This includes data that can’t help identify a person when used separately but can provide an unmistakable identification when used along with other pieces of data. Examples include full names, dates of birth, zip codes, phone numbers, and home addresses.
“Personally Identifiable Information” is a term that’s more in use in the US whereas in Europe, “personal data” is the preferred term.
PII is a broad and dynamic concept that can be broken down:
- By industry. Some industries govern PII more strictly than others. For example, the Health Insurance Portability and Accountability Act (HIPAA) in the United States defines and establishes rules of maintaining protected health data (PHI), which includes patient names, geographical identifiers, phone numbers, email addresses, SSNs, and medical health records.
- By region. Data protection laws vary around the world. According to the UN Trade and Development (UNCTAD) organization, 137 countries have introduced legislation to protect data and privacy. Perhaps the most widely known PII regulations are The General Data Protection Regulation(GDPR) in the EU, The California Consumer Privacy Act (CCPA) in the US, as well as The Data Protection Act 2018 (DPA) in the UK.
Each country or region’s legislation could treat PII in a wider or in a more limiting manner. If you’re an international business, it’s better to err on the safe side and protect even the data that is not treated as PII universally. In other words, when in doubt whether you should protect some kind of data, you probably should.
Curious to dive deeper? Learn more about what PII is, which PII you should protect, and the best ways to protect it.
What Happens if PII Leaks Out?
Depending on the severity of a PII leak from your logs and how widely the leaked data has spread, the consequences for your company may include:
- Having to spend time on incident response, notifying customers and regulators.
- Further attacks that take advantage of leaked PII to gain access.
- Privacy law enforcement and fines.
- Too much complexity in executing “the right to be forgotten” when customers submit requests to delete their PII but the PII has propagated too far into your current and retained logs that may be managed by other data processors.
- Loss of customer trust.
One attack scenario involving logs is referred to as unprotected logging. OWASP describes the scenario as follows:
The mobile application logs sensitive data, including user actions, API responses, or error messages, without proper security controls. This can lead to unintentional exposure of sensitive information if an attacker gains access to the device or intercepts the log files.
Is it possible that your company has such an impeccable level of protection against security threats that you can guarantee data will never leak out? Unfortunately, this is not the way the world works. Security incidents will occur. It’s not “if”, it’s “when” your company’s data is compromised in one way or another.
When it happens, you should be ready. One way of being ready is to limit the exposure of PII into your logs. Here are the basic rules to follow:
- Don’t log credentials and PII, including payment card information, unless you absolutely have to. In most cases, logging internal identifiers is just enough for troubleshooting purposes. If and when internal identifiers are leaked, there are less notification requirements, less costs, and less risk.
- Refrain from logging free-form data fields that are not designed to hold PII but still may have it due to user input. A credit card number that someone has mistakenly entered into a free-form “Description” field ends up unprotected when stored and logged because you didn’t expect it to be there.
- If you absolutely have to log PII, obfuscate it before it gets in your logs with techniques such as masking or tokenization.
How To Avoid Spilling PII to Logs
Now, let’s get from concepts to code and see what you as a software engineer can do to avoid, or at least minimize, spilling PII to your application’s logs.
We’ll use Java code samples, but similar techniques are available in other programming languages and frameworks as well. Code samples are based on a popular sample Spring application called PetClinic but contain additional code to demonstrate logging practices.
We’ll cover options that are available both on the source code level and by customizing your logging frameworks. You’re free to choose an approach that works best to you, but for better overall results, it’s best to take measures on both of these layers.
Do You Really Need To Log This?
You don’t need to protect something that you don’t have. Every time you see a log statement in your code that captures PII or credentials, ask yourself: do you really need to log PII here?
Let’s take this log statement from a Spring Boot controller for example:
@PostMapping("/owners/{ownerId}/edit")
public String processUpdateOwnerForm(@Valid Owner owner,
BindingResult result, @PathVariable("ownerId") int ownerId,
RedirectAttributes redirectAttributes) {
…
owner.setId(ownerId);
this.owners.save(owner);
logger.info("Updated owner {} ({} {}, phone number {})",
owner.getId(), owner.getFirstName(), owner.getLastName(),
owner.getTelephone());
…
}
This controller action updates a database entry for a pet owner and logs updated registration data at the INFO level:
2024-08-25 20:57:28,748 INFO [http-nio-8999-exec-10] o.s.s.p.o.OwnerController: Updated owner 11 (John Doe, phone number 8454066210)
Out of the data logged, the name and the phone number are both PII. Do you really need to see real personal information of users who created or updated their profiles in your logs? If the answer is you actually don’t, then removing the logger call or modifying it to only display the internal ID of the person would be best.
If you believe that logging anything other than the internal ID is justified, then you have a few tools at your disposal.
Overriding toString() Directly
The most obvious way you could prevent PII from leaking is to override the toString() methods in your data classes. You can then apply masking to every sensitive field that takes part in your toString() override.
Here’s an example of a toString() override in the Owner data class that masks or redacts various fields that represent PII:
@Override
public String toString() {
return new ToStringCreator(this).append("id", this.getId())
.append("new", this.isNew())
.append("lastName", this.getLastName().toCharArray()[0] +
"*".repeat(this.getLastName().length()-1))
.append("firstName", this.getFirstName().toCharArray()[0] + "*".repeat(this.getFirstName().length()-1))
.append("address", "Address hidden")
.append("city", this.city)
.append("telephone", "*".repeat(10))
.toString();
}
With this override, when you log an Owner object like this:
logger.info("Updated owner: " + owner);
Then the resulting log entry hides the sensitive values that you’ve chosen to override:
INFO [2024-08-26 05:57:09,769] [http-nio-8999-exec-2] org.springframework.samples.petclinic. owner.OwnerController: Updated owner: [Owner@1058f6b4 id = 11, new = false, lastName = 'D**', firstName = 'J***', address = 'Address hidden', city = 'New York', telephone = '**********']
This approach has a quite a few downsides though:
- You’ll need to provide overrides for every data class that may contain sensitive data.
- You need to remember to maintain the override when your class changes.
- It doesn’t apply when you’re logging individual fields instead of the entire Owner object.
- Its impact is not limited to logging. For example, when you evaluate Owner objects in debugging mode, you’ll be seeing masked data, too.
Overriding toString() With Lombok and Providing Masked Versions of Data
If you’re using the Lombok library and its wonderful annotations, you may have an easier time providing PII-safe toString() overrides for your data classes.
Specifically, you may want to make use of the @ToString.Include annotation to provide an alternative representation of a PII field that could mask or stub out the actual field value.
For example, here’s a part of the Owner data class:
@Table(name = "owners")
@ToString(callSuper = true)
public class Owner extends Person {
@Column(name = "address")
@NotBlank
private String address;
@ToString.Include(name = "address")
private String addressMasked() {
return "[Address hidden]";
}
And here’s the Person class that Owner inherits from:
@ToString
public class Person extends BaseEntity {
@Column(name = "first_name")
@NotBlank
private String firstName;
@ToString.Include(name = "firstName")
private String firstNameMasked() {
return firstName.toCharArray()[0] +
"*".repeat(firstName.length() - 1);
}
Here’s a breakdown of what happens in the two code samples above:
- The Owner class has a class-level @ToString annotation that generates a toString() override for you. The annotation comes with a callSuper = true argument that makes sure to include the fields of the parent class (Person) into the toString() override of the current class (Owner).
- The @ToString.Include(name = “address”) annotation on the addressMasked() method inside Owner makes sure that when toString() is generated, the value of the sensitive address field is replaced with the value returned by the addressMasked() method.
- In a similar fashion, inside the Person class, the @ToString.Include(name = “firstName”) annotation on the firstNameMasked() method makes sure that all toString() overrides in this class as well as in all descendent classes will use the masked version of the first name instead of the actual value of the firstName field.
As a result, when you’re logging an object of class Owner, you’re getting values from the methods that provide masked versions of PII:
2024-08-26 17:53:16,834 INFO [http-nio-8999-exec-10] o.s.s.p.o.OwnerController: Updated owner: Owner(super=Person(firstName=J***, lastName=D**), address=[Address hidden], city=New York, telephone=[Phone hidden], pets=[])
Using Lombok is easier than customizing toString() directly, but it still comes with most of the same drawbacks: it doesn’t apply to log records that contain separate fields, and its impact is not limited to logging.
Curiously, there’s a popular request for data masking annotations in Lombok’s issue tracker, but so far it’s been rejected by the library’s maintainers.
Filtering PII Using Your Logging Framework
Since you’re focused on stripping PII out of your logs specifically, a reasonable way to do this is to utilize the features of your logging framework of choice.
In the Java world, let’s see what kind of configuration could help you strip out PII with the two most popular logging frameworks: Logback and Log4j.
Filtering PII with Logback
If you want to mask PII in log records collected using Logback, consider using the approach documented in detail here. In essence, it involves two steps:
- Updating your Logback configuration file (such as logback-spring.xml) with a custom appender that defines specific masking rules using regular expressions.
- Extending Logback’s pattern layout to execute the regular expressions defined in the configuration file.
The custom appender in your Logback configuration file could look as follows:
<appender name="MaskPII" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="org.springframework.samples.petclinic.logging.MaskingPatternLayout">
<maskPattern>firstName\s*=\s*\'(.*?)\'</maskPattern> <!-- First name -->
<maskPattern>lastName\s*=\s*\'(.*?)\'</maskPattern> <!-- Last name -->
<maskPattern>(\d+\.\d+\.\d+\.\d+)</maskPattern> <!-- Generic: IPv4 address -->
<maskPattern>(\d{10})</maskPattern> <!-- Generic: phone -->
<maskPattern>(\w+@\w+\.\w+)</maskPattern> <!-- Generic: email -->
<maskPattern>address\s*=\s*\'(.*?)\'</maskPattern> <!-- Address -->
<maskPattern>\"address\"\s*:\s*\"(.*?)\"</maskPattern> <!-- Address in JSON -->
<maskPattern>\"ssn\"\s*:\s*\"(.*?)\"</maskPattern> <!-- SSN in JSON -->
<pattern>%-5p [%d{ISO8601,UTC}] [%thread] %c: %m%n%rootException</pattern>
</layout>
</encoder>
</appender>
The regular expressions that will be used to try to locate PII in log records are defined in <maskPattern> elements.
Don’t forget to add this appender to the <root> element and any <logger> elements that your configuration file may have:
<root level="info">
…
<appender-ref ref="MaskPII" />
</root>
<logger name="org.springframework.samples.petclinic" level="trace" additivity="false">
…
<appender-ref ref="MaskPII" />
</logger>
Then create a class that extends Logback’s pattern layout and make it available by the fully qualified name that is defined in the appender (in this case org.springframework.samples. petclinic.logging.MaskingPatternLayout):
package org.springframework.samples.petclinic.logging;
import ch.qos.logback.classic.PatternLayout;
import ch.qos.logback.classic.spi.ILoggingEvent;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class MaskingPatternLayout extends PatternLayout {
private Pattern multilinePattern;
private List<String> maskPatterns = new ArrayList<>();
public void addMaskPattern(String maskPattern) {
maskPatterns.add(maskPattern);
multilinePattern = Pattern.compile(maskPatterns.stream().collect(Collectors.joining("|")), Pattern.MULTILINE);
}
@Override
public String doLayout(ILoggingEvent event) {
return maskMessage(super.doLayout(event));
}
private String maskMessage(String message) {
if (multilinePattern == null) {
return message;
}
StringBuilder sb = new StringBuilder(message);
Matcher matcher = multilinePattern.matcher(sb);
while (matcher.find()) {
IntStream.rangeClosed(1, matcher.groupCount()).forEach(group -> {
if (matcher.group(group) != null) {
IntStream.range(matcher.start(group), matcher.end(group)).forEach(i -> sb.setCharAt(i, '*'));
}
});
}
return sb.toString();
}
}
With this configuration, if you now add the following log statement to log the entire Owner object:
logger.info("Updated owner: " + owner);
then the resulting log entry is going look like this:
INFO [2024-08-26 07:11:12,301] [http-nio-8999-exec-2] org.springframework.samples.petclinic. owner.OwnerController: Updated owner: [Owner@592e09f2 id = 11, new = false, lastName = '***', firstName = '****', address = '******************', city = 'New York', telephone = '**********']
However, if you log by concatenating selected fields of the Owner object:
logger.info("Updated owner {} ({} {}, phone number {})",
owner.getId(), owner.getFirstName(), owner.getLastName(),
owner.getTelephone());
hen some of the PII in the resulting log entry will not get properly masked:
INFO [2024-08-26 07:11:12,300] [http-nio-8999-exec-2] org.springframework.samples.petclinic. owner.OwnerController: Updated owner 11 (John Doe, phone number **********)
That’s because most of the mask patterns in the Logback configuration only match PII in a structured format, whether in accordance with their toString() representation or contained in JSON properties. For example, mask patterns for first and last names match only when these pieces of PII are provided as parts of an object:
<maskPattern>firstName\s*=\s*\'(.*?)\'</maskPattern> <!-- First name -->
<maskPattern>lastName\s*=\s*\'(.*?)\'</maskPattern> <!-- Last name →
which aligns with the way toString() is overridden in the Owner class:
@Override
public String toString() {
return new ToStringCreator(this)
…
.append("lastName", this.getLastName())
.append("firstName", this.getFirstName())
…
.toString();
}
That’s why when a last name or a first name is logged as separate fields and are not formatted as part of an object, the Logback mask pattern doesn’t match, leaving both fields unmasked. Coming up with a regular expression generic enough to capture names and only names is hardly feasible.
At the same time, the mask pattern for a phone number is generic, which is why phone numbers are masked even if logged apart from the containing object.
<maskPattern>(\d{10})</maskPattern> <!-- Generic: phone -->
Therefore, if you choose to go down this path, note that you’ll need to align your mask patterns with whatever format is defined by toString() overrides, and probably with data validation logic in your application. In this example, a simple mask pattern for phone numbers is only made possible by a validation rule that only accepts phone numbers in a number-only format of exactly 10 characters long.
Filtering PII with Log4j
If you’re using Log4j, you can implement similar PII masking logic as described here. This involves creating a custom converter class and adding a pattern layout to the Log4j configuration class.
Here’s the custom converter class:
package org.springframework.samples.petclinic.logging;
import org.apache.logging.log4j.core.LogEvent;
import org.apache.logging.log4j.core.config.plugins.Plugin;
import org.apache.logging.log4j.core.layout.PatternLayout;
import org.apache.logging.log4j.core.pattern.ConverterKeys;
import org.apache.logging.log4j.core.pattern.LogEventPatternConverter;
import java.util.Arrays;
@Plugin(name = "MaskingConverter", category = "Converter")
@ConverterKeys({"mask"})
public class MaskingConverter extends LogEventPatternConverter {
private final PatternLayout patternLayout;
protected MaskingConverter(String[] options) {
super("mask", "mask");
this.patternLayout = createPatternLayout(options);
}
public static MaskingConverter newInstance(String[] options) {
return new MaskingConverter(options);
}
private PatternLayout createPatternLayout(String[] options) {
System.out.println("Options: " + Arrays.toString(options));
if (options == null || options.length == 0) {
throw new IllegalArgumentException("Options for MaskingConverter are missing.");
}
return PatternLayout.newBuilder().withPattern(options[0]).build();
}
@Override
public void format(LogEvent event, StringBuilder toAppendTo) {
String formattedMessage = patternLayout.toSerializable(event);
String maskedMessage = maskSensitiveValues(formattedMessage);
toAppendTo.setLength(0);
toAppendTo.append(maskedMessage);
}
private String maskSensitiveValues(String message) {
// Replace sensitive values with masked value
message = message.replaceAll("(?<=firstName = ')[^']+?(?=')|(?<=\"firstName\":\")[^\"]+?(?=\")", "****");
message = message.replaceAll("(?<=lastName = ')[^']+?(?=')|(?<=\"lastName\":\")[^\"]+?(?=\")", "****");
message = message.replaceAll("(?<=creditCardNumber=)\\d+(?=(,|\\s|}))|(?<=\"creditCardNumber\":)\\d+(?=(,|\\s|}))", "****");
return message;
}
}
The class passes every logged message through a method called maskSensitiveValues() . This method defines regular expression based replacement rules for various kinds of PII data.
The class is annotated with @ConverterKeys({"mask"}), which enables using the mask identifier to reference the custom class from the Log4j configuration file:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
<Appenders>
<Console name="LogToConsole" target="SYSTEM_OUT">
<PatternLayout pattern="%mask{%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n}"/>
</Console>
<File name="LogToFile" fileName="logs/app.log">
<PatternLayout pattern="%mask{%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n}"/>
</File>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="LogToConsole" />
<AppenderRef ref="LogToFile" />
</Root>
</Loggers>
</Configuration>
As a result of applying this custom converter, when logging the Owner object, both firstName and lastName values are masked:
19:19:32.947 [http-nio-8999-exec-2] INFO org.springframework.samples.petclinic. owner.OwnerController - Updated owner (object): [Owner@743e6525 id = 11, new = false, lastName = '****', firstName = '****', address = '3805 Benedum Drive', city = 'New York', telephone = '8454066214']
The drawbacks of using this approach are the same as with Logback customization. PII fields logged separately from their objects remain unmasked.
Additional Layers of Defense on the Way to Your Log Aggregator
If you’re developing a production application that has a substantial user base, chances are that you’re collecting all kinds of logs and sending them to a log aggregator like Datadog.
Log aggregators may provide their own sensitive data scanner features that have predefined PII detection rules and allow creating custom rules. These scanners apply to all ingested logs before they’re available in the log aggregator’s UI.
Additional log filtering capabilities may also be available in data collectors that you plug your applications to. If you have a sizable infrastructure with dozens of systems and containers that supply logs, you may additionally employ observability pipeline tools like Vector or Fluentbit that sit between your infrastructure and your log aggregator and also enable a layer of filtering.
If your company uses these tools, they, along with your code-level or logging framework-level measures, contribute to a multi-tier PII obfuscation system that minimizes data leaks and limits the impact of potential data breaches. However, the further PII filtering tools are from your code the more expensive they tend to be. This is why it’s important to apply as much masking as you can within the confines of your applications, and leave the tools available further down the pipeline as the last frontier.
How To Evaluate PII Exposure in Your Current Logging Infrastructure?
If your company has established procedures with regard to security and privacy, you probably hold a security review session every once in a while. An auditor and a developer would sit together, search logs for personal information, try to identify where it’s coming from, and update offending log statements.
While a positive process, it’s purely manual, prone to error, and not continuous.
A way more automated, instantly and continuously available way to reveal PII log leaks in your application’s source code is Piiano Flows.
Piiano Flows is a code scanner that discovers where in your source code data leaks are coming from. You can use it on-demand or as part of your CI/CD pipeline, and it will send detailed remediation guidance to your dev team. Piiano Flows will detect sensitive data in arguments passed to your logging APIs, recursively checking for them inside objects and their fields.
Using Piiano Flows drastically reduces the lag in discovering data leaks that is inherent to manual security review processes. You will know about new potential data leaks the moment the offending code is pushed to your repository, which will enable you to apply fixes even before the leak origin reaches production.
Summary
After reading this blog post, you are hopefully better aware of the importance of avoiding PII leaks to your logs.
You have learned a few techniques that you could apply to mask or redact PIIs: both in your source code and in the configuration of your preferred logging framework. You know that additional PII filtering tools may be available further down your monitoring pipeline, but you also realize that these tools are best used as the last frontier, and core PII obfuscation should be performed earlier.
That said, preventing PII leaks in your code is challenging, and none of the techniques shown above cover all potential leaks. The best way to address this problem is to continuously reduce logging sensitive data as you evolve your code base.
To get quick and proactive in identifying potential log leaks in your code base as it evolves, check out Piiano Flows.
It all begins with the cloud, where applications are accessible to everyone. Therefore, a user or an attacker makes no difference per se. Technically, encrypting all data at rest and in transit might seem like a comprehensive approach, but these methods are not enough anymore. For cloud hosted applications, data-at-rest encryption does not provide the coverage one might expect.
Senior Product Owner