Let’s Discuss Zero-Knowledge Data

Best Practices

Written by Chenxi Wang

This is the second article in a series: you can read the first article on data security here.  

In cryptography, zero-knowledge proof is a method by which one party can prove to another party that a statement is true, without revealing information about the statement. Goldwasser, Micali, and Reckoff from MIT first proposed the concept in their 1980’s paper. A practical example of zero-knowledge proof, in layman’s terms, would be that A can prove to B that A knows a secret X without telling B what the secret is.zero knowledge data

Pretty cool concept, right? (Here is a link to a famous description of what a zero-knowledge proof is, told in a children’s story format, for those of you who want to dig into this a bit more. Matthew Green also has a great post on the concept that is well worth reading.)

Turns out zero-knowledge proof is incredibly powerful for a variety of security and privacy-related reasons.

Consider for instance, with zero-knowledge proof, a bank might be able to prove that an individuals’ records do not constitute a money-laundering threat without revealing the records to the government.

Zero-knowledge proof, as its name indicates, allows the delivery of a particular function to a party that you don’t trust implicitly – a government, a service provider, a partner — without having to reveal every piece of information that went into the function.

This is exactly what many organizations are looking for both in terms of data protection and the desire to support operations on the data. If you search for database protection on different security sites, there are many threads discussing protecting data from privileged (but potentially malicious) access yet still maintaining performance and functionality. One thread said: “How do I protect my data from ‘a corrupted DBA – or someone who has compromised the DBA privileges’?” This is a perfect example of a use case for zero-knowledge data.

In the same principle as zero-knowledge proof, zero-knowledge data is data that supports certain standard data operations without revealing the actual clear-text data to database servers, applications, or even local admins.  A corollary is that the physical form of the data (the bits) should not reveal information either.

What types of data operations should zero-knowledge data support? For starters, search and sort, and perhaps we can also think about “Join” and limited set operations. Set operations will be, by definition, more challenging if you want to prevent absolute information leak.

Of course this isn’t something for the faint-hearted; you should not be transforming all your data to zero-knowledge data – that would neither be efficient nor wise. You should only do this for your most critical and most toxic data – “toxic” as in if leaked would create something akin to a “toxic spill” – a notion coined by Forrester Research. Customer PIIs, such as credit card numbers, SSNs, are good examples of toxic data.

If you have zero-knowledge data, you can reliably protect it from prying admin eyes, insecure applications, and security compromises. At the same time, you can still satisfy legitimate data access requests such as those by authorized users and law enforcement without revealing what the data is.

The cryptographically astute readers will point out that this isn’t exactly “zero-knowledge” as in the strict sense of what’s in Goldwasser, Micali, and Reckoff’s paper. But we argue that for practical applications, probabilistically “zero-knowledge” is good enough.

Given that 2014 was a busy year for breaches, and 2015 hasn’t exactly been quiet so far, the power of zero-knowledge data cannot be ignored. So how do we get “zero-knowledge” data? We’ll need some creative cryptographic engineering techniques, which we’ll cover in the next post in this data security series.