In my last post, I discussed the differences in various dedupe and compression techniques. In this post, I'd like to talk about how a product should apply the correct technique in each data tier. People have different opinions on this, so here are my take on the correct technologies at each tier:
Online storage:
- The nature of this storage is fast access. Data is used all the time and the importance is on speed, not really space savings. The data path directly impacts application performance, so don’t go too crazy on saving space.
- Best techniques: Inline dictionary based compression coupled with fixed block dedupe. Fixed block dedupe requires less CPU processing to be done (the boundaries are pre determined and do not have to be computed in real time) and dictionary based compression techniques are nothing more than memory lookups. There is almost no CPU time being utilized. So these two are fast and save space for many applications such as virtual machine workloads (virtual desktops), sparse data bases, blogs and so on.
Archival storage:
- Typically over here, the data is largely unstructured (just files) Data is seldom accessed, some latency can be tolerated – it is usually a human user accessing the data.
- Best Techniques: My vote here is to go with very large chunk sized variable dedupe (in a future blog post I’ll explain the details around the implications of chunk sizes in more details).
The more important topic here is that I would go with file specific compression techniques that understand the type of file it is dealing with and apply a file specific algorithm. Additionally, I find that post process techniques work best here since dedupe and compression stages are more CPU intensive and you do not want to impact ingest speeds.
Backup storage:
- The most important thing to keep in mind about backup targets (such as our DR4100 product) is that you want to minimize backup windows (how fast you can protect your data). To do this, you need the backup target to be as fast as the source can deliver that data… so we are talking the fastest ingest speeds possible, typically in the order of terabytes per hour.
- Best Techniques: To achieve these speeds, I find that having a small holding tank for incoming data to accumulate a meaningful portion of data (usually just a few gigabytes) so that you can apply variable sized dedupe with very quick compression and then storing the data to disk works best.
What do you guys think about the variations in approaches? Have you thought about what happens when data moves between tiers? I have some thoughts on that but I’d love to hear your input.
Image may be NSFW.Clik here to view.
