This is a guest post by Cameron Laird.
Testing in production is like walking a tightrope: It takes skill to keep everything in balance. One crucial technique you have to master for testing in production is the anonymization of proprietary end-user data. Let’s examine why you need to understand this technique, and how to start with it.
The problem
Operators and system admins have recognized for decades an ethical obligation to solve technical problems without prejudice or interference. While it might catch our attention that employee Hector Jones received three email messages one week from what appears to be a divorce lawyer, as IT professionals we do not spread that information past the debugging session where it is pertinent. We strictly separate technical investigations from the content that turns up during them.
But with the pervasive digitalization of modern lives, new legislation regarding privacy, and increasingly specialized and capable hostile forces in pursuit of personal information, it’s now time to be even more careful. Think for a moment about something as mundane as production logs. You probably have a log near you that looks something like this:
2020-01-06 09:24:40,698 [build-graph] [__main__] [INFO] Arguments: Namespace(cache_dir=None,
cached_docs_only=False, customer_case='Customer #1',
customer_query='ReviewStatus = "Initialize"',
directory_path=None, doc_max_bytes=None,
doc_min_bytes=None, graph_dir=PosixPath('/tmp/experiment/Query#1/graph'),
ingestion_source='internal', max_filter_workers=None, max_ingestion_workers=None, ...)
That doesn’t present a privacy problem, does it?
It certainly can. Consider the Strava incident, when clever correlation of what appeared to be benign public information from a fitness tracking app revealed the locations of secret military operations.
We don’t yet have a comprehensive theory of what’s private and what’s not, or even the certainty that the boundary between the two can be clear. That’s why it’s safer for the log entry above to appear as such:
2020-01-06 09:24:40,698 [build-graph] [__main__] [INFO] Arguments: Namespace(cache_dir=None,
cached_docs_only=False, customer_case='9249E40CA14FFC36304C',
customer_query='ReviewStatus = "Initialize"',
directory_path=None, doc_max_bytes=None, doc_min_bytes=None,
graph_dir=PosixPath('/tmp/experiment/Query#1/graph'),
ingestion_source='internal', max_filter_workers=None, max_ingestion_workers=None, ...)
Or it could even be made to appear like this:
20YY-XX-06 09:24:40,698 [build-graph] [__main__] [INFO] Arguments: Namespace(cache_dir=None,
cached_docs_only=False,
customer_case='9249E40CA14FFC36304C', customer_query='ReviewStatus = "A7595B6"',
directory_path=None, doc_max_bytes=None,
doc_min_bytes=None, graph_dir=PosixPath(‘843D23BFB50'), ingestion_source='internal',
max_filter_workers=None, max_ingestion_workers=None, ...)
This last version still has enough detail to permit realistic debugging to progress. By hashing different details, though, we lower the chance that an entry can fall into the wrong hands, where a bad actor could put details like “Someone used Customer #1’s account at 9:24 in the morning to begin a request for …”
Even if this only represents a test, and “Customer #1” isn’t a real name or account, and even if logs should only be available to a small team of operators, consider the risk. What if a fragment of your logs shows up at a conference (as frequently happens), displaying that “Customer #1” is in use by someone, somewhere, and that the attention is focused on the real person behind “Customer #1” as a result? Do you see the damage that can result?
This is the benefit and even necessity of log anonymization. It’s the right thing to do, and I predict it will become a compliance requirement in many contexts. Get ahead of the trend and start to implement your own anonymization now.
Anonymization tactics
How can you do that? First, focus on “data at rest.”
Presumably, your databases are encrypted, you employ good security practices throughout your network, your organization understands the application of “need to know” in regards to private data, and you have good practices and records regarding access to different categories of information. That leaves logs, dashboards, and monitors as the biggest vulnerabilities. If they’re at all useful technically, they’re likely to embed enough private information to sicken your compliance officers.
That’s the situation of a majority of IT operations. A quick fix is available, though: Hash revealing data before it’s written.
By “hash,” I mean a lightweight, fast function that maps data, such as:
John Smith -> 249E40CA14FFC3 divorced -> 6304C9A759 …
While not cryptographically secure, such hashes are hard enough to decipher that they encourage casual crackers to take their criminal attempts elsewhere, yet simultaneously reversible readily enough to permit serious debugging when needed.
Hashes can be applied to logs in at least a couple of distinct ways. At one extreme, it’s possible to build them into the log writer in a single location with the intent that one filter obscures all private data from any possible source. The opposite approach is to leave the definition of the log writer alone and inspect each use of the logging entry point for the potential to leak private information.
In pseudocode, the first looks like this:
… logline = sanitize(logline) fp_log(logline) …
def sanitize(this_string): for found in search_for_private(this_string): this_string.replace(found, hash(found)) …
The second is more like this:
… log(“This is a log entry about {datum1}, {hash(datum2)}, and {datum3}.” … log(“This is a different log entry about {hash(datum4)} and {datum5}.”) …
The weakness of the latter approach is that it involves expert search and update abilities by programmers of all instances of log(); a substantial application might have thousands of these. The former is hard, though, because even though it only uses hash() and sanitize() in one location each, search_for_private() has to be rather sophisticated to find all the possible private data in an unstructured logline.
I’ve had success with the former approach in recent years. A typical implementation involves the construction of a dictionary of private information taken from the application’s own databases. Log entries scrubbed clean of all those values are orders of magnitude safer than their originals.
Two tips
Keep two more ideas in mind as you anonymize your logs and other operational records: tooling and culture.
The support staff will likely need plaintext versions of logs occasionally when solving problems. That doesn’t necessarily mean an unhash function, which operates on an entire log at a time — although debuggers certainly would find it convenient. For practical cases, though, it might be enough just to have a small tool that confirms that:
20YY-XX-06 09:24:40,698 [build-graph] [__main__] [INFO] Arguments: Namespace(cache_dir=None,
cached_docs_only=False, customer_case='9249E40CA14FFC36304C' …
originated as:
2020-01-06 09:24:40,698 [build-graph] [__main__] [INFO] Arguments: Namespace(cache_dir=None,
cached_docs_only=False, customer_case='Customer #1' …
Also, anonymization is necessarily a cultural fit: It will require ongoing maintenance and attention to new threats and leaks. Just as with other security issues, team members deserve encouragement to find and report problems they notice. Logs might be perfectly anonymous in one release, then, without anyone intending it, they may begin to leak a few private details the next day because of an apparently unrelated enhancement. A strong and supportive privacy culture recognizes that all staff can help detect and solve such problems.
The anonymization story goes far beyond this brief glimpse. It’s big enough to support specialists, consultants, commercial providers, books, and the rest of a growing ecosystem devoted to the subject. This write-up on database anonymization hints at the depths possible on the subject.
The point for today, though, is that log sanitization is a natural and rewarding place to start — and you’ll almost certainly be happier if you start before privacy violations become an emergency for your organization.
Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron’s favorite applications are for farm automation.