Millionaire’s Problem

October 28th, 2007

Lets assume there are two millionaires, named Alice and Bob. Alice and Bob what to know who is richer without revealing how much money either one of them have. This is what is known as the millionaires problem, originally introduced by Andrew Yao [paper link]. Why am I talking about this? Well I’m interested in the ability to share information among people while preserving their data privacy. In this case the two millionaires will know which is richer (information) without sharing how much either is worth (their private data).

The solution we will be exploring relies upon two assumptions. First, it is assumed that the two millionaires knows approximately how rich each are. In our example we’ll assume that Alice and Bob knows that they are both worth somewhere between 1 and 10 million dollars. The second assumption is that Alice and Bob are using a public key encryption system, in this example we’ll be using RSA. With those assumptions lets begin working through an example.

We’ll give Alice 8 million dollars (i=8) and Bob 6 million dollars (j=6). Then lets assume that Alice’s public key is (17, 1591) and her private key is (89, 1591). These keys are really small to make the math easy, also note that the maximum number that can be encoded with this pair is limited (actually the maximum is 1590).

STEP 1: Lets have Bob start the process. To do this Bob first generates a N bit random number called U, we’ll choose 1180. Bob then takes that number and computes C = 1180^17 mod 1591 = 749. Note this is actually the encrypted version of Bob’s random number using Alice’s public key.

STEP 2: Bob takes the C value he computed and adds his wealth value J and subtracts one, C - J + 1 = 749 - 6 + 1 = 744. Bob then sends Alice this value.

STEP 3: After receiving this value Alice creates values Y1, Y2, …, Y10. Where Y1 is the RSA deciphering of (C - J + 1), Y2 is the deciphering of (C - J + 2), Y3 is the deciphering of (C - J + 3), etc. Although Alice doesn’t know C or J, Alice can still compute this because Bob sent her (C - J +1). This produces the following table of Y values.

x (C - J + x) RSA Function Yx
1 744 744^89 mod 1591 805
2 745 745^89 mod 1591 281
3 746 746^89 mod 1591 339
4 747 1311
5 748 1244
6 749 1180
7 750 1062
8 751 1359
9 752 219
10 753 753^89 mod 1591 942

STEP 4: Alice generates a random N/2 prime number called p, remember Bob’s random number was N bits. We’ll use 631, its prime and its close enough to N/2 bits. Using this number Alice generates Z1, Z2, Z3, … Z10. Where Z1 = Y1 mod p, Z2 = Y2 mod p, etc. This operation generates the following table of Z values.

x (C - J + x) RSA Function Yx Zx (= Yx mod 631)
1 744 744^89 mod 1591 805 174
2 745 745^89 mod 1591 281 281
3 746 746^89 mod 1591 339 339
4 747 1311 49
5 748 1244 613
6 749 1180 549
7 750 1062 431
8 751 1359 97
9 752 219 219
10 753 753^89 mod 1591 942 311

Detail Note: The random prime needs to be chosen such that the consecutive Z values differ by at least 2, the one we chose does and its easy to verify in the general case.

STEP 5: The rest of this is just the setup, heres the punchline. Alice then sends Bob the random prime she selected and the calculated Z values: Z1, Z2, Z3, … up to the i value (the number of millions Alice has i = 8). The remaining 10 values Alice sends but adds one to each value; Z9 + 1 and Z10 + 1. To recap Alice sends the first i Z values as they were calculated, she then sends the remaining Z values plus 1. Here are the numbers alice sends:

p Z1 Z2 Z3 Z4 Z5 Z6 Z6 Z8 Z9+1 Z10+1
631 174 281 339 49 613 549 431 97 219+1=220 311+1=312

STEP 6: Bob then computes G = U mod p = 1180 mod 631 = 549. Recall U is the random number that Bob selected in step 1. The value p is known because Alice told Bob the random prime she selected (p). Bob then looks at the jth value (his wealth in millions) of the 10 Z values sent to him by Alice (ignoring the random prime Alice sent first). If this jth value is equal to G then Bob knows that Alice is of greater than or equal wealth to himself. In our case G does equal the jth value (549 = 549), this means that Alice is either of greater or equal wealth. On the other hand if G is not equal to jth value it means that Bob is richer than Alice.

STEP 7: Bob then tells Alice that she is richer. Both parties now know which person is richer and was able to determine this without exposing their actual wealth. Note that Bob actually determines if Alice is greater than or equal wealth, to determine if its equal or greater than you would have to run the algorithm in reverse (swap the roles for Alice and Bob).

At the end they both know more, for example Alice knows that Bob is worth between 1 and 8 million and Bob knows that Alice is between 6 and 10 million. This is a result of just knowing who is richer than the other and the original range of values they already knew. Besides this no other information can be determined by the involved parties.

Still not convinced? Here is some python code that demonstrates how this algorithm works, it implements RSA public key encryption and then walks through one example (similar to this article). You can easily change the wealth of Alice and Bob and see how it works. How to play with this:

  1. Download million.py
  2. Execute this file using python. python million.py
  3. At the bottom of the file is the “main” body of the program. You can edit various values in this section and generally play with everything there. Specifically BOB_VALUE, ALICE_VALUE, and MAX_VALUE.

There are several implementation problems with this when in a real world problem. First, what if people lie? Bob and or Alice could simply not tell the truth about their values. By lying Alice and Bob could run the algorithm multiple times to and eventually determine exactly how rich the other party is. Yao actually addresses this issue in his original paper and proposes a few solutions. Second, computation cost is extremely high, since you are computing public key encryption values for several values the computation time could be very high. Third, the range of values, in most practical problems the range of values is greater than 10 which means lots of computation but also lots of Z values need to be sent between Alice and Bob. Fourth, if you implement this for multiple parties there can be collusion among the parties which could reveal private data about an individual in the group.

Its interesting to think that this was written in 1982 just a few years after public key encryption was discovered. Was it a problem looking for a solution or a solution looking for a problem? Which ever direction the solution came from the millionaires problem proves that you can share information yet protect ones private data. While this solution might have some issues at least there is a solution! Since this paper was published there has been several more sophisticated solutions but this one is elegant and is fairly obvious after working through a few examples.

CRLF, PHP, Postfix, and SPAM

October 21st, 2007

Recently we switched our production machines from running sendmail to postfix. In this process our emails started getting caught as SPAM and it took a bit of digging to figure out what exactly was happening. The error we got was:

BAD_HEADER: Improper use of control character (char 0D hex)

If you look up 0D hex character in your favorite ASCI character table you will find that this is a carriage return (CR). As specified in email standards document RFC 822, all of our email headers were separated by CRLF (a carriage return and an line feed - \r\n). At this point I began scratching my head, postfix won’t relay the message because our email conforms to the specs? After a bit more digging and looking at the actual message postfix was trying to send I noticed that instead of CRLF (\r\n) our email had CRCRLF (\r\r\n).

What is actually happening is postfix assumes that any line feed separated header must be missing its carriage return so it takes the liberty to replace any line feed character (LF) with a carriage return and line feed (CRLF). Because we were already formatting our messages correctly this was causing our email headers to be separated with CRCRLF. The spam detection software sees this and assumes we are trying to perform a carraige return injection attack.

The fix is to send email headers that are only line feed separated (\n) because postfix will fix the headers along the way. What we did was make this a configuration parameter in our software since our software runs machines with both sendmail and postfix (depending on the client).

I like Neil Diamond! Is that private data?

October 11th, 2007

Recently I purchased a CD that was recommended to me by Amazon. I was really impressed, although I’ve not listened to the entire CD I’m glad Amazon recommended it. I’m not the only one with this experience but it got me thinking about the privacy of my data (in this case my purchase data) and more importantly who actually owns it.

We are told to only allow trusted parties access to our private information but what if those trusted parties don’t have what I’m interested in? Or maybe they just can’t figure out what I like. Is there a benefit in allowing untrusted parties access to my private information so they can recommend new items or at least target their presentation towards me? I’m not talking about spam I’m talking about user-aware presentation logic. If I visit a new website selling shoes it would be useful for both of us to know that I usually buy brown shoes. Do you want to tell everyone in the world that you buy brown shoes? Maybe preferred shoe color isn’t violating my privacy but what if the site could know that I support gun rights or that I’m pro-choice. Where do you draw the line? Can they know my favorite color but not my preferred style of underwear?

I believe that a much better approach is to have a decentralized algorithm that allows me to use my peers to determine what items I might be interested in. The idea is that I want to tell a group of peers “Hey, I’ve bought these items do you have any suggestions for items I should buy?”. How might we do this? The constraint is that I want to keep both mine and my peers data private.

To recap I have to ask my peers about purchases without revealing that I made those purchases myself? How can I do that? How do you do that with your friends today? Ahh… we use the I have a friend that really likes Neil Diamond what should I buy her for her birthday? That’s right blame the friend for your bad music tastes :) How can the friend respond without admitting they too listen to Neil Diamond? Using the same trick; “I have a friend that loves Neil Diamond and he also likes Bread”.

There you have it a decentralized privacy preserving approach to creating personal recommendations. The actual implementation isn’t so simple but the general idea is easy to see. Get a bunch of people together then form groups. Ask around your neighbors and then have your neighbors ask the people in their group. As long as responses and requests hide the origin of the message no one will know who has this sickening music taste. But in the end everyone is happy because data is kept private but you also have recommendations. How might this work in the context of the web? That’s what I’m working on right now so I’ll let you know if I come up with a solution.

In the meantime if you want to learn more about privacy of data you might want to look at various papers in the data mining literature. There are also some papers dealing with peer-to-peer technologies and how to implement private data sharing, an interesting one is Friends Troubleshooting Network (FTN) out of the Princeton University (NDSS 2005). Over the next few weeks I’ll post some code samples that implement various ideas about privacy preserving data mining.

Information Dispersal Algorithms

September 25th, 2007

Information dispersal algorithms are used for splitting data into multiple pieces such that given some group of these pieces the original data can be still be read. In general, the goal of information dispersal is to divide data into f pieces so that a subset of k of those pieces can be used to recover the data. One of the widest known information dispersal algorithms is used in RAID level 5, known as a the parity drive. In this system there are at least 3 disk drives of storing data, the first two are storing different data but the third drive is storing the XOR value of the data on the other two drives. In this system, given any two of the drives one can read all the data being stored. This is simple to see and easy to implement making its popularity unsurprising.

There are many information dispersal algorithms but this article is going to focus the cleverly named and the very first algorithm known as The Information Dispersal Algorithm (IDA). This algorithm was proposed by Micheal Rabin way back in 1989 and involves a bunch of scary math. IDA will break data into f pieces such that any k of those f pieces can be used to read the data, where k is less than f. The really attractive feature of IDA is that all the pieces are equally important, so that given any k of the pieces the data can be recovered. This article will cover the high level details of this algorithm and present some python code that actually implements the algorithm.

IDA is rooted in two mathematical concepts, one is matrix multiplication and the other is finite field arithmetic. Lets start with the simple one, matrix multiplication.

When you multiply two matrices together (A \times B) recall that the number of columns in A must be the number of rows in B. Then the output of this multiplication will result in another matrix with the number of rows equal to the number of rows in A and the number of columns equal to the number columns in B. Watch out here comes the trick. Let A be a special encoding matrix made up of f rows and k columns then let B be the raw data separated into k rows. Can we multiple these two matrices? Yep, A has k columns and B has k rows. What is the outcome of this multiplication? A new matrix that is made of f rows and the same number of columns found in the data matrix (B). This new matrix is a our encoded data, it contains f rows of data where each row is a piece of data.

A_{encoding-matrix} \times B_{original-data} = C_{encoded-data}

Now that we have encoded data how do we decode it? This isn’t that complicated but does force us to add a special constraint to the encoding matrix used previously. The encoding matrix must be linearly independent, I’m not going into details about this but its necessary because we need the to be able to inverse the encoding matrix (A).

Let say we recover 7 out of 10 pieces of data, and assume we only need 7 to recover the data (k=7). Then we build a matrix with those 7 pieces of data we recovered, lets call it C_{recovered}. Next, build a new matrix using a subset of our encoding matrix, lets call this matrix D. Matrix D is the 7 rows of the encoding matrix that corresponds to the 7 rows of the recovered data. Once matrix D is create we need to take the inverse of that matrix, this is why the needed the encoding matrix to be made up of linearly independent rows. After we have this inverse matrix (D^{-1}) we can multiply it with the recovered data pieces (C_{recovered}). The result will be the original encoded data!

D^{-1} \times C_{recovered-data} = D_{original-data}

In reality this was the easy part. Just divide a bunch of data into rows in a matrix, create a linearly independent encoding matrix, multiple them, inverse a matrix, and multiple again. Viola, erasure coding on a platter. As with all things easy there is always a footnote, all the math must be done using finite field arithmetic.

The reason this is necessary is that we need to ensure that the individual cells inside the matrix represent bytes, always even when we multiple them together. No, we can’t just preform all math mod 256, this would violate various arithmetic properties that we depend on. Finite field arithmetic is really just a fancy way of saying we are going to do math assuming we have a limited set of numbers. Specifically, we are doing this math using the finite field known a GF(256)=GF(2^8), also referred to as a galois field. The tricky math part is ensuring that the arithmetic properties we depend on are preserved when we do mathematical operations This means we have to redefine what addition, subtraction, multiplication and division do. You can find details about this in various papers but the gist of it is that addition and subtraction is an XOR operation. To we define a log function allowing use to reduce the problem of multiplication and division to one of addition and subtraction.

Once we build matrix multiplication and matrix inversion using finite field arithmetic we just have to connect a few dots and we have IDA. I spent several weeks digging into various C++ implementations and came away more confused that I started. I decided the only way to learn it was to actually build it so I set out on the journey of re-implementing IDA in Python. Yes, its slow but this was about learning the algorithm not about writing efficient code.

The python code comes in the form of 5 main library files:

  • ida.py - Connects all the above libraries to actually preform the IDA algorithm
  • Splitter.py - Splits a file into a matrix
  • Matrix.py - Object to represent a matrix and do math with it
  • FiniteFieldMatrix.py - Overloads the matrix object and makes all the math happen using the finite field arithmetic
  • FiniteField.py - Object to represent numbers in a finite field and implements all mathematical operations
  • SmallField.py - Class that does the mathematical operations in the finite field.

Tar file available here: idapy.tar.gz
Where should I start? Thats a great question which I’ve been struggling with for a few days. First dump all these files in one directory, then try running python ida.py. This should execute a unit test found at the bottom of this file. You should start by looking at the unit test function called “testDecoder”. This function attempts to encode “/etc/hosts” file and properly decode it (you’ll need to change this path if you are running on a non *nix box). Then you can start by tracing the steps backward from here. This code isn’t the cleanest but it does work and I believe is fairly easy to understand. Feedback is welcome!

SCAR - Scatter, Conceal, and Recover

April 25th, 2007

After several painful months of writing my masters thesis it is finally done. I feel really good about the work, I believe its original and could have some commercial application.

I have learned that I’m not good at writing (I kinda knew that already). As a result I’m going to focus my energy on learning how to be a better writer. I have several other research topics that am going to pursue and force myself to write about them. I figured the only way to get better at writing is to write.

Back to the topic at hand, my thesis is done! Below is the abstract and you can download the pdf.

Abstract:

This thesis describes a secure and reliable method for storing data in a distributed hash table (DHT) that leverages the inherent properties of the DHT to provide a secure storage substrate. The framework presented is referred to as “Scatter, Conceal, and Recover” (SCAR). The standard method of securing data in a DHT is to encrypt the data using symmetrical encryption before storing it in the network. SCAR provides this level of security, but also prevents any known cryptoanalysis from being performed. It does this by breaking the data into smaller blocks and scattering these blocks throughout the DHT. Hence, SCAR prevents any unauthorized user from obtaining the entire encrypted data block. SCAR uses hash chains to determine the storage locations for the data blocks within the DHT. To ensure storage availability, SCAR uses an erasure coding scheme to provide full data recovery given only partial block recovery.

This thesis presents the details of SCAR. First, the framework, related protocols, and mechanisms are described. Second, a prototype implementation is presented showing the feasibility of SCAR. Third, analytical models are discussed that characterize SCAR’s behavior, the models are then validated using experimental results. Lastly, the models are analyzed to further understand the tradeoff between data security and data availability. The exploration of this tradeoff leads to the conclusion that SCAR can effectively balance this tradeoff when the nodes of the network are “sufficiently” available.