Compression and Dedupe Strategies for Online, Archival and Backup Tiers

In my last post, I discussed the differences in various dedupe and compression techniques. In this post, I'd like to talk about how a product should apply the correct technique in each data tier. People have different opinions on this, so here are my take on the correct technologies at each tier:

Online storage:

The nature of this storage is fast access. Data is used all the time and the importance is on speed, not really space savings. The data path directly impacts application performance, so don’t go too crazy on saving space.
Best techniques: Inline dictionary based compression coupled with fixed block dedupe. Fixed block dedupe requires less CPU processing to be done (the boundaries are pre determined and do not have to be computed in real time) and dictionary based compression techniques are nothing more than memory lookups. There is almost no CPU time being utilized. So these two are fast and save space for many applications such as virtual machine workloads (virtual desktops), sparse data bases, blogs and so on.

Archival storage:

Typically over here, the data is largely unstructured (just files) Data is seldom accessed, some latency can be tolerated – it is usually a human user accessing the data.
Best Techniques: My vote here is to go with very large chunk sized variable dedupe (in a future blog post I’ll explain the details around the implications of chunk sizes in more details).
The more important topic here is that I would go with file specific compression techniques that understand the type of file it is dealing with and apply a file specific algorithm. Additionally, I find that post process techniques work best here since dedupe and compression stages are more CPU intensive and you do not want to impact ingest speeds.

Backup storage:

The most important thing to keep in mind about backup targets (such as our DR4100 product) is that you want to minimize backup windows (how fast you can protect your data). To do this, you need the backup target to be as fast as the source can deliver that data… so we are talking the fastest ingest speeds possible, typically in the order of terabytes per hour.
Best Techniques: To achieve these speeds, I find that having a small holding tank for incoming data to accumulate a meaningful portion of data (usually just a few gigabytes) so that you can apply variable sized dedupe with very quick compression and then storing the data to disk works best.

What do you guys think about the variations in approaches? Have you thought about what happens when data moves between tiers? I have some thoughts on that but I’d love to hear your input.

Compression and Dedupe Strategies for Online, Archival and Backup Tiers

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List