[Editor Note: This article is contributed by Jonathan Palley <jonathan At brainpage.com> who is the co-founder of Brainpage. He started his first Company focused on Virtual Flash Cards in high school and successfully obtained half million paid users globally. In 2006, he founded Idapted, a company focused on e-education and sold in 2011. Residing in beijing, his current company, Brainpage, is aiming at bringing cutting edge data technologies to Chinese market.]
There is turmoil in the Middle East, global economic uncertain and imminent high-stakes political decisions in the world’s two largest economies (the U.S. and China) – but none of these would be more popular than a simple math problem on the New York Times’ website this past week. Yes, nothing beats the magical mathematical fun of The Birthday Problem – the surprising statistical insight that if you have 23 people in a room, there is a 50% chance that two of them have the same birthday.
But perhaps, somehow, its appropriate. This problem – and the basic statistical principles behind it – is an essential ingredient to our digital and increasingly “big data” world. Without having this problem, you can’t securely send information over the internet nor can you instantly access massive amounts of information. We often forget these low-level discoveries that enables our modern lives; and major technical announcements, like the release of a new cryptographic standard called SHA-3 this week, are rarely covered by the news. Thus, in honor of the NY Times’ article and SHA-3 (more on this news from the security industry below), we thought we’d take a moment to explain just why these types of problems are so significant to the technology world.
THE LOSSY CONNECTION:
Imagine I’m sending you a message. But instead of writing the message on paper, I send it as a series of Scrabble blocks. How can we make sure you get the message completely intact – that blocks weren’t switched or missed in transport?
One method might be to send the message twice. In this way you can compare the two messages and see if something got screwed up. But that is too laborious. So instead, I decide on something clever. I’m going to give each character a number (say A is 1 and B is 2 and so on) and I will add up all of the characters in the message. At the end of the message I will also send this sum of the characters. Then, you can do the summation yourself, compare to my sum and know whether you got the message intact. It works!
“But wait”, you say, “couldn’t different combinations of letters have the same sum? For example, if I have the message ‘HI’ (8+9=17) it would be the same as ‘BO’.” This is true. But its a “risk” we are willing to take. More specifically, I could ask: what is the probability that two messages would have the same sum? In fact, this is a very similar question to the birthday problem!
What I described here is nearly exactly what is done when data is transferred across the internet or stored on a hardisk. It is transferred/stored with what is called a “checksum” or hash sum – a number calculated from the data designed to be statistically robust using the birthday problem. Without it, data storage and transmission would be unreliable to the point of being unusable.
THE SECURE CONNECTION:
Let’s take this this one step further. Imagine that instead of sending you a message, I’m sending you an operating system for your computer. While you trust me, its an ugly world out there and we can’t trust how this operating system gets delivered to you. Specifically, we are afraid that someone will insert something in the operating system, while it was being transported to you, that secretly captures what you do on your computer.
“But wait”, you say again, “we can use the method above to see whether someone has messed with this operating system while it was being transported!” You are right – almost. Imagine if this malicious person was really determined and had a lot of computing power at their disposal. They could, in theory, find a way to modify the operating system to their malicious intent while also keeping the checksum exactly the same (called “finding a collision”). Now, you would have no idea that the operating system you received was not what I intended to send.
There is, of course, a solution to this problem. First, using much of the math found in the birthday problem, we need to choose a checksum that is long enough such that it takes a HUGE amount of guesses to have a reasonable chance of finding a collision. This makes the theoretically probability of collisions very small.
Then, we need to calculate the “checksum” using what is called a “cryptographic hash function”. A cryptographic hash function has two very special properties unique to our secure needs. First, it is impossible to use a “mathematical shortcut” to find a collision in a cryptographic hash function. In other words, someone trying to find a collision MUST play the probability game and try to “guess” every combination. Second, it is computationally “expensive” to calculate – based on the way computers process instructions and data it will take a measurable amount of time to compute each checksum.
We can then put these numbers together – we can multiple the number of guesses it would take to have a chance of finding a collision by the amount of time it takes to compute each checksum. The result is the total amount of computing power you need to find a collision; and here’s the magic: this number for modern cryptographic hash function, by orders of magnitude, far exceeds the total amount of computing power in the entire world. In other words, even if you had ALL of the computing power in the world your chances of find a collision is STIL much, much smaller then winning the lottery. The hash is, practically, secure.
Which brings us to the SHA-3 function. Many “cryptographic hash functions” are internationally recognized standards – allowing different systems to talk to each other. Probably the oldest still in use hash function is called MD5, created in the early 1990s. However, with increased computing power and increased understanding of the math behind hash functions, researchers have found ways to take mathematical shortcuts and quickly find “collisions” in MD5. Indeed, Microsoft Windows’ continued use of MD5 in certain areas was a key enabler of the Flame virus.
Thus, competitions have been held to find the successor. First was SHA-1, then SHA-2 and now SHA-3. Each one successively more robust then the previous. (SHA-1 is still widely in use but has a “theoretical” mathematical shortcut that no one has actually been able to implement yet. This has given rise to the need for SHA-2, and it its broken, SHA-3 as a backup).
One final thought. The awesome and disruptive power of quantum computing is most stark in the context of our above competition. In quantum computing, the rules of classic probability are thrown out in favor of “quantum probability”. The Birthday Problem is irrelevant. The result is it becomes (theoretically) possible to find a collision nearly instantly.
RE-EMERGENCE OF THE ELEGANT:
If you’ve never heard your technology friends talking about hash functions or the birthday problem, you can be forgiven. Many of these low-level concepts have been effectively abstracted under higher-level programming paradigms. However, with the emergence of big-data and ubiquitous computing, engineers are finding these fundamental concepts have new applications and uses (see technologies like Redis, Riak and Hbase) and an understanding of hash functions is critical to building modern secure applications.
I’ve titled this post: steal my money. Below is the password to my bank account, encoded in the new SHA-3 standard. If you can guess it, all my money is yours. Good luck – it will take you approximately 5e153 (5 with 153 zeros after it) guesses to get it right!
My banking password: ee25979a0f91058e6952d09593a4ad6ccdfe2f96b3a35fdba8203c9a4cda3e078b3674530b8bfdda6f369f095b9947f268839c70115ac19f47b9f2e2916681e