Jump to content.

IFX Group


Vaporware: Compressible You

Loss-less data compression has been big business ever since it was discovered that storing and transporting the raw data could be made less expensive through a little CPU time. To date loss-less data compression technology has taken two main paths; patterns and tokens. The pattern method looks for repeated patterns in the data and uses one copy of the pattern to represent all of them. The token method takes a chunk of data and uses a smaller token to represent that chunk. Every method is at best limited to some finite subset of the data itself. When presented with truly random non-repeating data, all efforts at compression fail miserably.

This brings us to the concept of lousy (acceptable data loss) or approximate representational compression. Most people know that JPG and MPEG compression methods discard some of the less obvious visual data for the sake of smaller data size. This means it is impossible to get the original data from the compressed result which works well for images, but not well for storing the balance in your bank account where the data must be exact and very accurate. The world appears to be stuck with a choice between accurate or small but never both.

The future of data compression needs a fast way to get very high compression that is also exactly accurate. The solution will most likely look more like genetic science than computer science.

Consider for a moment the human genome with 32 chromosomes that hold all of the information required to make one cell into a complex collection of individualized and specialized cells each performing unique functions nothing like the original cell. In terms of raw data this is the ultimate compression. What would happen to data storage and transmission rates if the most complex data set could be compressed to a single genome?

The inner workings of the human genome are still mostly a mystery for scientists. Maybe it has something to do with the gene not needing to know the final result, but instead it only needs the generational ability to make the next part with the ability to make the next part beyond itself. In this chain of parts creating the next part, the complexity can grow exponentially with each new level adding more and more specialized branches. In data terms this is a little like storing the instructions to create a program to make the data.

Most programmers with access to a simple scripting language can make a tiny script create absolutely large data files. The idea with genetic compression is to use a language of genes to encode a sequence of instructions that make the engine to create the data. Additional levels of genetic compression are added as the data gets larger and more complex.

A functional genetic data compression method allows the need to determine the level of compression. Long term data archival storage needs the maximum compression even at the expense of long compression times while streaming data with a priority for throughput over compression can spend less time squeezing. Fortunately the dividing line between compression size and data throughput are moving targets thanks to the continual improvement in computing power available at consumer prices. As the computing power increases, so does the level of compression possible in any given period of time.

The effect would be an instant expansion of all available storage and transport bandwidth by many orders of magnitude. Imagine holding the complete collective works of all history in a single memory chip or transporting all that data over the slowest communication line in seconds.