When security and compliance teams talk about data classification, they speak in the language of regulations and standards. Personal Identifiable Information needs to be protected one way. Health data another way. Employee information in yet another way. And that’s before we get to IP and developer secrets.
And while approaching data security this way is obviously necessary, I want to introduce a different way of thinking about data security. When we look at a data asset, we should be classifying them by asking “Is this asset valuable for the amount of data it contains or the quality of data it contains?”?
Here’s what I mean. If an attacker is able to breach a bank’s records and steal a single customer record and credit card number, it’s an annoyance for the bank, and a major inconvenience for the customer. But the bank will simply resolve the issue directly with the customer. There’s no media attention. And it certainly doesn’t have an impact on the bank's business. On the other hand, if 10,000 customer records and credit card details were leaked… that’s another story.
This is what we can call ‘quantitative data’.
The risk to the company if the data is leaked is due to the type of data, yes, but perhaps more important is the amount of data.
But there’s another type of data that works in exactly the opposite way. This is what we can call ‘qualitative data’. This is data that is so sensitive that even a small amount being leaked would cause major damage. A good example of this type of data is intellectual property or developer secrets. An attacker doesn’t need a huge file to affect the business with a breach. Even a piece of source code or a single certification could be catastrophic.
Most data an organization has can be classified as either ‘qualitative’ or ‘quantitative’. And there’s very little they have in common, both in how they’re used and how they’re secured.
Quantitative data moves. Its schema changes. It evolves, moving in and out of the organization. It’s used in BI tools, added to data pipelines, and extracted via ETLs. It’s also subject to data retention policies and regulations.
Qualitative data is exactly the opposite. It doesn’t move at all. Maybe certain developer secrets have a refresh token, but other than that there’s no real data retention policies. The data does not move through pipelines, it’s not extracted, and it certainly doesn’t leave the organization.
So how does this affect how we approach data security?
When it comes to quantitative data, it’s much easier to find and classify this data - and not just because of the size of the asset. A classification tool uses probabilities to identify data - if it sees that 95% of the data in a column contains email addresses, it will label it as ‘email’. And it might have varying degrees of certainty based on each asset.
Qualitative data classification and security can’t work like this. There’s no such thing as ‘50% of a certificate’. It’s either a cert or not. For this reason, identifying qualitative data is much more of a challenge, but securing it is simpler because it doesn’t need to move.
This method of thinking about data security can be helpful when trying to understand how different security tools find, classify, and protect sensitive data. Tools that rely *only* probabilities will have trouble recognizing when qualitative data moves or is at risk. Similarly, any approach that focuses on keeping qualitative data secure might neglect data in movement. Understanding the differences between qualitative and quantitative data is an easy framework for measuring the effectiveness of your organization’s data protection tools and policies