With the vast amounts of data growth, data optimization technology needs to be in place and integrated with all tiers of storage – primary, archive and backup. But there are the differences in dedupe and compression technologies and algorithms one can use, so what strategy do you use at each tier of storage? I’ll actually cover that part in my next post, but first - if you are wondering what the differences in dedupe and compression techniques were, let me list a few in this post:
Technologies to consider with Compression techniques:
Inline versus Post process strategies: Do you compress the data on ingest or after the data gets idle and already stored? This problem actually applies to dedupe as well. Thankfully, it usually makes sense to apply the two data optimization techniques together in a pipeline, with first performing dedupe and eliminating any redundant data altogether and then compressing the residual (truly novel data).
Statistical techniques versus dictionary compression approaches: Do you use code substitution algorithms or statistical compression techniques that learn the patterns of data over time to predict the next piece of information being stored? To complicate things further, a lot of data today is already compressed in some form or another (take JPEG images for example). So do you decode the data to further compress it with a superior algorithm?
How do you solve Random Access into compressed data? If the data is compressed, how do you access a byte of information at logical offset X? It is an issue because the data is now not physically at offset X since it has been compressed. What sort of impact does accessing compressed data have for the tier in question? Any data transformation technique (be is dedupe, compression or encryption) has some performance impact to data transfers.
How do I handle an already compressed data stream? Like an image... traditional compression algorithms will not be able to compress an already compressed image. You need content specific compression algorithms and those can be CPU expensive. So is it worth applying that to the tier of storage in consideration? To understand more about compression technology, take a look at Paq, an open source compression algorithm and suite started by one of our Data Protection Chief Scientists, Matt Mahoney.
Technologies to consider with Dedupe techniques:
What Chunk Size do I use? Dedupe works by chunking data and finding exact matches to the data and substituting it with pointers. So, do I use fixed block chunk sizes, where data is chunked at pre determined physical boundaries or do I use variable sizes where the data is chunked at variable sized chunks so as to get better data matches? See this article to understand the intricate details: http://virtualbill.wordpress.com/2011/02/24/fixed-block-vs-variable-block-deduplication-a-quick-primer/
How do I handle read back speeds? This is an issue because when you dedupe data, essentially you are replacing sequential data with pointers to other data chunks. These other data chunks will not be in sequential order when they were stored and all over the storage subsystem. If you have spinning disks under the hood, well this is a huge problem since spinning disks don’t like to be randomly accessed.
There are many more issues like this but I wont get into the entire minutia – today.
The point behind this post is that you need to understand that there are differences in compression and dedupe technologies and a product needs to apply the correct strategy in each tier. So what are the correct strategies? Stay tuned for my next post on that!