History: Developed in 1999 by a company called Rocksoft (now part of Quantum) the concept of Variable length blocks revolutionized the way the data backups are performed. Most of the Backup software nowadays use Data deduplication technology built into their packages.
Here are two good definitions of Data Deduplication:
The term “data deduplication”, as it is used and implemented by Quantum Corporation refers to a specific approach to data reduction built on a methodology that systematically substitutes reference pointers for redundant variable-length blocks (or data segments) in a specific dataset. The purpose of data deduplication is to increase the amount of information that can be stored on disk arrays and to increase the effective amount of data that can be transmitted over
networks. When it is based on variable-length data segments, data deduplication has the capability of providing greater granularity than single-instance store technologies that identify and eliminate
the need to store repeated instances of identical whole files. In fact, variable-length block data deduplication can be combined with file-based data reduction systems to increase their effectiveness.
It is also compatible with established compression systems used to compact data being written to tape or to disk, and may be combined with compression at a solution level. Key elements of variable-length data deduplication were first described in a patent issued to Rocksoft, Ltd (now a part of Quantum Corporation) in 1999.
(read here for a complete white paper on it: http://www.gosignal.com/whitepapers/quantum1.pdf )
excerpt from the book: “Data Deduplication for Dummies”
…”Data deduplication is a really simple concept with very smart technology behind it. You only store the block once. If it shows up again, you store a pointer to the first one. That takes up less space than storing the whole thing again. When Data Deduplication is put into systems that you can actually use, however, there are several options for implementation. And before you pick an approach to use or a model to plug in, you need to look at your particular data needs to see whether data deduplication can help you. Factors to consider include the type of data, how much it changes, and what you want to do with it. So let’s look at how data deduplication works.
Making the most of the bulding blocks of data
Basically, Data deduplication segments a stream of data into variable-lenght blocks and writes those blocks to disk. Along the way, it creates a digital signature – like a fingerprint – for each data segment and an index of the signatures it has seen. The index, which can be recreated from the stored data segments, lets the system know when it’s seeing a new block.
When data deduplication software sees a duplicate block, it inserts a pointer to the original block in the dataset’s metadata (the information that describes the dataset) rather than storing the block again. If the same block shows up more than once, multiple pointers to it are created. Pointers are smaller than blocks, so you need less disk space.
Data deduplication technology works best when it sees sets of data with lots of repeated segments. for most people, that is a perfect description of a backup.. Whetgher you backup everything every day (and lots of us do this) or once a week with incremental backups in between, backup jobs by their nature send the same pieces of data to storage system over and over again. Until data deuplication there wasn’t a good alternative to storing all the duplicates. Now there is. …”
Example: Joe is really tall (this text document is stored on the hard drive)
Now the creator opens the document and makes a change to:
John is really tall
Now, see this graphical representation of Fixed Length blocks vs Variable Length blocks:
In our example The only change to the file was on the first block “a”: instead of “Joe” it changed to “John”. Note than in the Variable length block image, the whole data segment shifts but only segment A is rewritten; the other data segments (b, c and d) remain unchanged (“is really tall” where b is “is” c is for “really” and c is for “tall”) , therefore only the pointer is created for those three blocks instead of the whole data stream. If the backup software has data deduplicatin built in, that is how the data will be saved. That is data deduplication!