haveibeenpwned.com, k-anonymity and zero-knowledge proof

in #science5 years ago (edited)

What is the k-anonymity, what is the zero-knowledge proof and how does it work? Today I would like to introduce you to these interesting (at least for me) ideas that are connected to the ideas of privacy, anonymization and secrets. We will start with k-anonymity, which was implemented in haveibeenpwned.com website to allow you checking if your password was leaked, without passing your password to haveibeenpwned.com website.

I am assuming that my readers are not all IT people, not every one of you is a technical person, so we will start with a primer on authentication and how does it work. In other words - what is happening when you're logging into your favourite website. Note, that it is about regular websites, not services like steemit.com, where things are more complicated.

Passwords, hashes and hackers

We all know how a passwords work. I have a password that no one else have, and so only I can login to the service that I use, by passing to this service unique combination of my login and password. Simple, right? But what is happening in the background, while we are logging in? In most basic version it would be like this: you're opening yourfavouritewebsite.com in a web browser, type in login and password, and your browser would send that data to the web server on which yourfavouritewebsite.com is located. Then this web server would check if the password that he got is equal to the password that he stores in database. As pictured below:

Usually that is not how it works, at least it shouldn't. That would mean, that our password is stored on that web server in so called "plaintext", in a "not encoded" form. So if your password is "ilovemymum123" then exact same string is stored in the database, that is "ilovemymum123". If a hacker, or someone else, gain the access to that database, then he will know our password. That is very dangerous, especially if we are using the same password on other websites (which is not recommended), and maybe we are using the same e-mail address or the same login to log into those websites. So, what is usually done? Password is stored in the database in "hashed" form, hashed using hashing function. Everyone who knows the basics of how blockchain works, knows also what is and how hashing function works. Hashing function transforms some string, in this case a password, to other string, but while maintaining few important rules:

  • transformed string (hash) can not be easily reverted to the password form. So if I have a hash XyZ14I7d94rf#sk2, I can not easily guess what is the password that has been transformed to that hash, I can not guess that what is behind this hash is "ilovemymum123"
  • in ideal situation only one string will result in a certain hash. That means that "ilovemymum123" results in XyZ14I7d94rf#sk2, but no other string should result in the same XyZ14I7d94rf#sk2 hash
  • hash is always of the same lenght, doesn't matter what is the length of hashed password
  • if we have 2 similar passwords, let's say "ilovemymum123" and "ilovemymum1", the hashes for those two should be very different. For example XyZ14I7d94rf#sk2 and 9321dsklcv23d$fe. This way you can not see if passwords behind these hashes are similar or not

As some of you might suspect, or already know, this means that a pair of login+hash can be used by the server as a equivalent of login+password, without storing our password in the database permanently. So, how does logging looks like now? We are opening the browser, typing login and password, browser is sending the data to the server. The server is transforming the password that we sent using hash function, and compares resulting hash with the hash that is stored in the database (it has been created and stored there when we were registering). If the hashes are equal, then we are authenticated. As presented in the below picture:

The password in not in hashed form while traveling to the server. It's being hashed after the server received it, that's why it could be "sniffed" on it's way to the server by malicious entities. But that's the topic for other post, and something that for example SSL is trying to protect us from.

Leaks, leaks everywhere

Quite often we hear, that there has been a "leak" of data from some website or service. In other words, something got hacked, and that something might store data about us. So we worry. If the website that has been hacked stored passwords in "plaintext" form, then it's bad. If not, then still hashes of the passwords might be cracked after some time, depend on the exact situation. It's even worse if there were some sensitive information about us, for example as was the case when there was a leak from a website created for people in relationships that were looking for someone to cheat their spouse with.

So, there are leaks. It would be convenient to check if information about our accounts are among that leaks. It would be nice if we would not have to manually search leaked data all over the Internet. That's why haveibeenpwned.com has been created. It lets us check two things:

  • if our e-mail address is assigned to some of the accounts that has been leaked?
  • if our password is among the ones that has been leaked as plaintext, or as hashes that has been broken?

First option does not worry the user. He can just type his e-mail into the text box on https://haveibeenpwned.com and the website will check if he is in leaked databases. The worst that could happen is that the owner of haveibeenpwned.com would send you some spam to that e-mail. But it is popular and well regarded website, so you are not worrying.

If you want to check if your password is among the database, then it's different. You can type your password on https://haveibeenpwned.com/Passwords and check if it's been leaked. But what if the owner of the website will store my password, and will be hacked, or if he himself is malicious? Something like picture below presents:

But haveibeenpwned.com used some nice trick, thanks to which we can ask this website if our password has been leaked, without worrying that the haveibeenpwned.com server will even see our password. It is based on something that is called "k-anonymity".

Enter k-anonymity

Let's just wait with that k-anonymity thing for a second and think what could be the solution to this problem without using k-anonimowość. One of the solutions is to hash our password, on our end, and then send only the hash to the haveibeenpwned.com. Then the server would hash his passwords using the same hash function as we did, and compare his hashes to our hash. If he got the same hash in his database, then our password has been leaked. But, wait a minute! If out password wasn't leaked, then the server would get only our hash and would not know our password, but if it was leaked, then the server could check which of the leaked passwords, when leaked, create the same hash that we've sent, and could know what is the password that we entered. So it's not ideal:

k-anonymity, is a solution here. It is a concept introduced in 1998 roku by Latanya Sweeney and Pierangela Samarati. It was created as a way to deal with the questions "how to anonymize data, but still be able to use this data in a meaningful way, in some research".^ Data is easy to anonymize, you might say, let's just remove columns named "Social Security Number", "Name and surname" and "Exact address" in our excel sheet, right? It's not that easy.

Let's say that a hospital wants to share the data with researchers. Data about government workers that were patients of that hospital. It removed names, surnames, social security numbers, exact address and other things like that from the data. What has been left are "gender", "date of birth" and "ZIP code" because it was needed by researchers. Nice data, so we could check for example if government workers living under given ZIP code are more likely to deal with a flu, or sexually transmitted diseases... This is what happened in 1997 in Massachusetts. Latanya Sweeney noticed that. The governor of Massachusetts stated that data is safe and anonymous. Latanya bought public voter records from Massachusetts. In that record she spotted the governor, the same that told peple that data is anonymized, and it's safe. She checked what is his ZIP code and date of birth, and coordinated it with data that hospital shared with researchers. In hospital's data there was only one person living under exact same ZIP code, with the same date of birth, and with a gender "male". Now she knew that everyting related to that "anonymous" person in hospital's data was about governor - all data about his medical history in that hospital.^

This is the mentioned Massachusetts governor - Bill Weld

What went wrong, what could be done different? We could remove ZIP code from data also, but as was said, it would render the data unusable for researchers. We could make sure that there are more than 2 people with the same ZIP code, gender and birth date, then Latanya would not be sure which one of those two is the governor. In that case we could say that data was 2-anonymous. If there will be k people like that, the data is k-anonymous. Is the k is high, then we can sleep well - the data was safely anonymized. How to achieve that? For example by generalizing - instead of mentioning exact birth date, we could group people into groups "aged 18 to 25", "aged 25 to 35" et cetera. Then John Kowalsky aged 19 and John Kowalsky aged 24, living in the same neighbourhood, under the same ZIP code, would be seen in the anonymized data as exact same people and we could not differentiate between the data about John aged 19 and John aged 24.

Ok, so, how does it relate to haveibeenpwned.com? I mentioned that sending the hash to the server of haveibeenpwned.com would not be secure enough, because then if our password was leaked and was among databases of haveibeenpwned.com, the server would know what is the password that we typed on that website. But what if we send only first characters of our hash, the server will send us back all hashes that starts with the same sequence and than we will check on our side, if among that hashes if the hash of our password? Voila, we can check if our password is in haveibeenpwned.com database of leaked passwords, without sending to that server our password nor hash, and without downloading a whole database of passwords or hashes.

Of course we don't have to trust that haveibeenpwned.com works as I explained. If we are not scared of technology, we can manually generate hash of our password, and ask the haveibeenpwned.com about hashes starting with beginning characters of our hash, and then check if the list that we will get contains our ful hash. Just a explained in this video:

What is the moral of this story? Underestimating metadata is dangerous. Would you mind if your phone operator would record your phone calls? Of course you would mind. But what about history of your phone calls (when you called and whom) and history of relay stations/BTS that your phone has been connected to? This is not so worrying, right? So imagine hypothetical edge-case scenario. We are checking history of phone calls of some citizen, we can see that he was calling a number under which there is a hot line for suicidal people, and his phone call was connected at that time to relay/BTS that covered region around some river, mostly bridge on that river. The metadata is telling us that this person probably wanted to jump from a bridge...

Zero-knowledge proof

Another topic that is related to data anonymization, while keeping the data usefull, is "zero-knowledge proof". This term was coined in 1989, in a whitepaper called "The Knowledge Complexity of Interactive Proof-Systems"^. Zero-knowledge proof means that we are able to demonstrate to someone that we knows something, without revealing what is the content of that knowledge. It's very easy to prove to my friend that I know what is the capital city of Argentina, by saying "The capital city of Argentine is Buenos Aires, my friend". But it's harder to prove it without revealing what is the capital city of Argentina that we have in mind. Why would I want to do that, you ask? Sometimes we are forced to reveal some data about us, and we might not want to. If you are taking a loan in the bank, you might need to share your history of transactions. It would be easier to prove that you are solvent and the bank can give you a loan, without sharing your transaction history. Or let's say that you're a country and struck a deal with other country that you both will destroy 10% of your nuclear power. How to prove to the other country that you destroyed 10% of your bombs, and not some dummy that was pretending to be the real bomb, but without revealing how your bombs are constructed and how do they look exactly? It is your secret how you're engineering your bombs.

So, how does the zero-knowledge proof might work. Let's imagine, that we have a friend who is colorblind. You have two balls, red and green. You want to prove to your friend that they are of different colour, without saying to him what colour they are. Your friend might take the green ball to the left hand, red to the right hand, hide them behind his back and switch them between hands (or not), then show you the ball and ask - did I switched them when they were behind my back? You would be able to say if they were or not, because you can see the colours. So, the balls are of different colour. But your friend might not believe you, maybe you had a luck and balls are actually both green? So you do it again, and again and again. After few guesses your friend is convinced that the balls are of different colour, because you was able to recognize if he switched them or not.


That example was an interactive zero-knowledge proof, because we had to do some interaction with out friend to convince him. It has its limitations, for example if we will meet another colorblind person (one that do not recognize red from green, because there are many versions of colorblindness), then we need to start all over again to convince that person. There are also non interactive versions, one of them is zk-snarks. Zk-snarks is implemented in zcash cryptocurrency, to anoynmize transactions on blockchain. Team of researchers working on blockchain in ING bank came up with updated version of so called zero knowledge range proof^. What it does? It let you prove that you are earning enough to get a loan from the bank, that what you earns is between a range of X and Y, without actually telling the bank what you exactly earn. (if the range if big enough)

There are disadvantages to zero knowledge proof. One of them is that we will never be 100%. Example with a colorblind person show us that the colorblind friend is never 100% sure. he can be 99.99999% sure, because we were able to guess if he switched the balls or not many times in a row, but it's not 100%. Of course the 99% certainty is often enough. Another disadvantage, in case of zk-snarks and zcash is that it needs big amount of computing power.

There is another disadvantage, and it's because of how zero knowledge proof might be. We could lose information using too often zero knowledge proof. Let's say that we have 3 people who knows something. We don't ask them about that something, only about proof of that something. If that 3 people will die or we will lose touch with them, we could also lose that information that those people held. Just as if we had some safe, and forgot the combination to that safe because we were not opening it. Just knowing that expensive things are in that safe was enough for us. End then we forget the key, and expensive things are lost...

What about nuclear bombs that I mentioned? In 2016 it was proposed to use ZKP method to verify if a country destroyed real or fake nuclear bomb, without forcing that country to reveal their engineering secrets of those bombs. That country would need to prepare some kind of "template" with a "negative" of characteristic of their bombs. Something that would have physical characteristics of their bombs, but negated. Then if we line up that template with a real bomb (hidden in a black box, for example) and let some neutrons past that template and the bomb, we could obtain a resulting "image" that would be blank, because things in atomic bomb and template negate each other. We can check many bombs with that one negative, if some bomb will be a fake one, then the "image" obtained by passing neutrons through template and bomb will actually reveal some characteristics of a real bomb (negation of those characteristics), that's why this method even discourages from cheating and trying to convince someone that the bomb is real one when it's fake, because then that person might have a peek into how our real bomb is constructed.^.

Mentioned template/negative is not literal template showing physical shape of a bomb. It's more of a metaphore. The template is a negative of some physical features of the payload of atomic bomb, of properties of isotope used et cetera. I am not getting into the details intentionally, because physics is not my specialty. But it is a nice example of how ZKP could be implemented in practice, not even as some mathematical equation but experiment with real physical things. You could read more about that here.


source of images: main, vectors that I used to create images with server and PC "talking", vectors part 2, governor, zcash logo, safe, separator
other sources and inspiration not mentioned in the text: 1, 2