Your privacy is important to us, privacy policy.
Introduction
I’ve been working for high-tech companies for two decades. It's no secret that organizations are putting a significant focus on perimeter security. However, they may have inadvertently left vulnerabilities open within their web application code. Duolingo's recent data leak incident really scared me. It's crazy how despite all these safeguards, the security landscape is far from bulletproof and serves as a stark reminder that, despite these safeguards, application data security is far from bulletproof.
These sensitive data leaks often happen due to casual, honest coding issues, which are very common, yet very difficult to detect - for example, log leaks. My current focus is developing a proactive, automated way to detect these log leaks before they reach production. In an attempt to figure out how common this problem is, and if the solution I’m working on is effective, I tried to use the new scanner we’ve developed, to see if I managed to find leaks in a few chosen open source projects. This article describes how I did it and what I found.
But first, what are application data leaks
Application data leaks are where sensitive or confidential information is unintentionally exposed or transmitted from application software.
These leaks can occur for various reasons, including programming errors. Such leaks can involve a wide range of data, such as personal user information, financial records, credit card holder information, or any other sensitive data that the application processes or stores.
Application data leaks pose a significant risk to individuals and organizations as they can lead to privacy violations, financial losses, and reputational damage. Protecting against these leaks requires rigorous security measures, regular and manual code audits, and proactive monitoring to identify and mitigate vulnerabilities before they lead to data exposure.
Why are application data leaks hard to detect, yet very common
Detecting data leaks, particularly through logs and third-party APIs, can be challenging due to several reasons:
- Log leaks often involve large volumes of unstructured data, making it difficult to distinguish between normal and potentially sensitive information. Identifying patterns or anomalies within this data can be a complex and time-consuming process. Also, these leaks look like expected application behavior. They’re not like a data breach where you see abnormal activity and someone downloading a GBs of your data.
- Third-party APIs can introduce additional layers of complexity because organizations often lack full control over these external services. This lack of control makes it harder to monitor and secure the data flow. Moreover, the APIs may be used for legitimate purposes, making it tricky to differentiate between authorized and unauthorized data transfers. Also, the use of third-party APIs may involve additional stakeholders and organizations, further complicating the detection process.
These challenges, combined with the growing complexity of modern software ecosystems, make data leaks through logs and third-party APIs all too common and a persistent concern for cybersecurity professionals.
During a recent conversation with a colleague, they revealed that their organization endured a staggering 50 data leak incidents annually. Each of these incidents triggered a high-pressure response, with multiple teams collaborating in a war room to rectify the situation swiftly. To tackle this issue, they implemented a dynamic tool designed to sample a portion of the data entering their pipelines and scan for sensitive information, such as personally identifiable information (PII) or credit card numbers. Despite their efforts, they remained aware of potential blind spots and understood that by the time they detected an issue with this limited sampling, it was often too late to prevent harm.
Exploring Solutions - Application Data Leak Solutions
Solutions are starting to emerge that make it easier for companies to move away from reactive approaches to identifying data leaks. These tools scan a codebase to identify potential issues with stored, incoming, and outgoing data and log entries.
I used Piiano Flows which offers free scans for projects in a range of hosted Git services, such as GitHub and Bitbucket. It can also scan private repositories on GitHub. A commercial version is available for scanning projects in your infrastructure using a CLI. It's currently limited to scanning Java code, but support for Go and Ruby is in beta.
I thought it would be an interesting exercise to see what issues this tool could tell us about some public open source projects.
To get started, I visited scanner.piiano.io where I took the option to sign up with my Google account. I could’ve signed up with a GitHub account or simply used my email address.
When the scanner dashboard opens, you see that Piiano have already loaded a scan for Shopizer, an open source project for headless commerce that can be used to create online stores, marketplaces, product listings, B2B applications, transactional portals, and alike.
Adding projects is easy; select Add Project and provide your project’s URL and a custom name, if you wish.
You can also specify a directory within the repository to scan, a useful option if you have a monorepo. Then you simply select scan to get things going. If you’re scanning a private repo on GitHub, you’re prompted for your credentials before the scan starts.
A scan can take a few minutes to complete, depending on the size of the repository.
So what do you get as the result of a scan? To find out select the project name or, for projects you scan, View Report from the icon in the Actions column.
The report has seven sections:
- Dashboard, a summary of the scan findings.
- Storage, details of sensitive data types stored by your application.
- Log Leaks, details of code that writes sensitive data to external logs.
- Outbound, details of third-party API calls that access sensitive data.
- Inbound, details of the declaration (class member) and use of sensitive data in your code.
- Report, a format version of the report that you can download as a PDF.
- Exclusions, details of any files not scanned.
So, what do the scan reports help you do?
Identify sensitive data storage
Where your project defines its database in code, this section of the report details the content of tables that may contain personal or sensitive data. For example, from the Shopizer scan you see that the database includes a table for Billing details. This table includes the customer's first and last first and phone number among other details.
This report Is of great use to your privacy officer. With it, they can see what sensitive or personal data the application stores without needing to read the code or find a developer to extract the information.
Identify sensitive data leaking through logs
We've probably all done it, been tracking down a bug during development and created a log entry. We fix the bug but then forget to remove the log. What if that log included personal or sensitive data and the entry makes its way to production?
You probably don't secure your logs in the same way as your database or other transactional data. Indeed, you may use a third-party service to store and analyze logs.
The logs leaks scan looks for this type of potential leak. Take this example from Teammates.
Here a log entry is written that includes the student ID and their unsanitized email address.
This is a good example of why log leaks can be difficult to track down. In the right-hand section of the report you can see the destination for the log record.
Nothing obviously problematic here, just a log line being output.
However, when you look at the flow on the left-hand side you can see the creation of the log line details, including the email address:
Depending on the scope of the issue that resulted in these unsanitized email addresses, the log could be a trove for a hacker.
Identify outbound data leaks
This section of the report looks at third-party API calls that could be accessing sensitive data.
You might assume that the production version of your app only sends data to third parties that are approved and known to be secure. However, oversights and omissions are always a risk. For example, as with logs, someone could forget to remove third-party calls used during testing or development. And, even if all the calls are planned, having an audit to use in checking that all third-party libraries are secure could be invaluable.
Here is an outbound scan result from Teammates.
In this example, the application is sending a person's name to an external email service. This may be perfectly legitimate and part of the system design, but there is a need to flag this and make sure it is by design. The benefit of Flows is that the egress of data from the system can be quickly and easily identified, allowing effort to focus on determining its legitimacy.
Identify inbound data leaks
The inbound section of the report provides details of the declaration (class member) and use of sensitive data code.
This report is useful because, for example, GDPR requires you to monitor personal and sensitive data being processed by your application.
This example shows how Shopizer obtains a billing telephone number.
From this, you can see that Shopizer is obtaining order information from a shopping cart through a REST API.
Flows also identifies data stored in the database (see the “Persistent” label). This is something you should review to ensure it was implemented as designed.
Conclusion: Application Data Leaks
Locking the stable door after the horse has bolted is not a particularly effective approach to managing the security of personal and other sensitive data. However, for many organizations, this is the approach they take.
The emergence of tools, like Piiano Flows, enables organizations to take a proactive and automatic approach to data security and take practical steps to ensure their system embodies the principles of security by design. This approach secures the data from the source, making sure your customers' sensitive data is much less likely to be unwittingly exposed.
It all begins with the cloud, where applications are accessible to everyone. Therefore, a user or an attacker makes no difference per se. Technically, encrypting all data at rest and in transit might seem like a comprehensive approach, but these methods are not enough anymore. For cloud hosted applications, data-at-rest encryption does not provide the coverage one might expect.
Senior Product Owner