Information Dispersal Algorithms
Information dispersal algorithms are used for splitting data into multiple pieces such that given some group of these pieces the original data can be still be read. In general, the goal of information dispersal is to divide data into f pieces so that a subset of k of those pieces can be used to recover the data. One of the widest known information dispersal algorithms is used in RAID level 5, known as a the parity drive. In this system there are at least 3 disk drives of storing data, the first two are storing different data but the third drive is storing the XOR value of the data on the other two drives. In this system, given any two of the drives one can read all the data being stored. This is simple to see and easy to implement making its popularity unsurprising.
There are many information dispersal algorithms but this article is going to focus the cleverly named and the very first algorithm known as The Information Dispersal Algorithm (IDA). This algorithm was proposed by Micheal Rabin way back in 1989 and involves a bunch of scary math. IDA will break data into f pieces such that any k of those f pieces can be used to read the data, where k is less than f. The really attractive feature of IDA is that all the pieces are equally important, so that given any k of the pieces the data can be recovered. This article will cover the high level details of this algorithm and present some python code that actually implements the algorithm.
IDA is rooted in two mathematical concepts, one is matrix multiplication and the other is finite field arithmetic. Lets start with the simple one, matrix multiplication.
When you multiply two matrices together (
) recall that the number of columns in A must be the number of rows in B. Then the output of this multiplication will result in another matrix with the number of rows equal to the number of rows in A and the number of columns equal to the number columns in B. Watch out here comes the trick. Let A be a special encoding matrix made up of f rows and k columns then let B be the raw data separated into k rows. Can we multiple these two matrices? Yep, A has k columns and B has k rows. What is the outcome of this multiplication? A new matrix that is made of f rows and the same number of columns found in the data matrix (B). This new matrix is a our encoded data, it contains f rows of data where each row is a piece of data.

Now that we have encoded data how do we decode it? This isn’t that complicated but does force us to add a special constraint to the encoding matrix used previously. The encoding matrix must be linearly independent, I’m not going into details about this but its necessary because we need the to be able to inverse the encoding matrix (A).
Let say we recover 7 out of 10 pieces of data, and assume we only need 7 to recover the data (k=7). Then we build a matrix with those 7 pieces of data we recovered, lets call it
. Next, build a new matrix using a subset of our encoding matrix, lets call this matrix D. Matrix D is the 7 rows of the encoding matrix that corresponds to the 7 rows of the recovered data. Once matrix D is create we need to take the inverse of that matrix, this is why the needed the encoding matrix to be made up of linearly independent rows. After we have this inverse matrix (
) we can multiply it with the recovered data pieces (
). The result will be the original encoded data!

In reality this was the easy part. Just divide a bunch of data into rows in a matrix, create a linearly independent encoding matrix, multiple them, inverse a matrix, and multiple again. Viola, erasure coding on a platter. As with all things easy there is always a footnote, all the math must be done using finite field arithmetic.
The reason this is necessary is that we need to ensure that the individual cells inside the matrix represent bytes, always even when we multiple them together. No, we can’t just preform all math mod 256, this would violate various arithmetic properties that we depend on. Finite field arithmetic is really just a fancy way of saying we are going to do math assuming we have a limited set of numbers. Specifically, we are doing this math using the finite field known a
, also referred to as a galois field. The tricky math part is ensuring that the arithmetic properties we depend on are preserved when we do mathematical operations This means we have to redefine what addition, subtraction, multiplication and division do. You can find details about this in various papers but the gist of it is that addition and subtraction is an XOR operation. To we define a log function allowing use to reduce the problem of multiplication and division to one of addition and subtraction.
Once we build matrix multiplication and matrix inversion using finite field arithmetic we just have to connect a few dots and we have IDA. I spent several weeks digging into various C++ implementations and came away more confused that I started. I decided the only way to learn it was to actually build it so I set out on the journey of re-implementing IDA in Python. Yes, its slow but this was about learning the algorithm not about writing efficient code.
The python code comes in the form of 5 main library files:
- ida.py - Connects all the above libraries to actually preform the IDA algorithm
- Splitter.py - Splits a file into a matrix
- Matrix.py - Object to represent a matrix and do math with it
- FiniteFieldMatrix.py - Overloads the matrix object and makes all the math happen using the finite field arithmetic
- FiniteField.py - Object to represent numbers in a finite field and implements all mathematical operations
- SmallField.py - Class that does the mathematical operations in the finite field.
Tar file available here: idapy.tar.gz
Where should I start? Thats a great question which I’ve been struggling with for a few days. First dump all these files in one directory, then try running python ida.py. This should execute a unit test found at the bottom of this file. You should start by looking at the unit test function called “testDecoder”. This function attempts to encode “/etc/hosts” file and properly decode it (you’ll need to change this path if you are running on a non *nix box). Then you can start by tracing the steps backward from here. This code isn’t the cleanest but it does work and I believe is fairly easy to understand. Feedback is welcome!