Data deduplication and compression on primary storage

FAQ Guide
Data deduplication and
compression on primary storage
can reduce data footprint
What are the most common approaches for deduplication and
compression on primary storage? Learn the answer to this and more
by checking out this FAQ guide featuring insight from Dave Russell,
research VP at Gartner Research.
Sponsored By:
SearchStorage.com FAQ Guide
Data deduplication and compression on primary storage can reduce data footprint
FAQ Guide
Data deduplication and compression
on primary storage can reduce data
footprint
Table of Contents
Data deduplication and compression on primary storage can reduce data
footprint
About IBM
Sponsored By:
Page 2 of 6
SearchStorage.com FAQ Guide
Data deduplication and compression on primary storage can reduce data footprint
Data deduplication and compression on primary
storage can reduce data footprint
Although the opportunity for data deduplication and compression on primary storage may
be less than for backups, there's ample reason for users to expect significant success in
reducing unstructured data such as word processing documents, spreadsheets and
PowerPoint presentations.
Dave Russell, a research vice president at Gartner, discusses the current techniques for
data reduction on primary storage, from standard compression to file- and sub-file
deduplication, to a combination of deduplication and compression. He also outlines the
emerging approaches that users may find more prevalent in the future. You can read the
podcast interview below.
SearchStorage.com: What are the most common approaches for data deduplication
and compression on primary storage?
Russell: There are really four main techniques that are used. The first is standard
compression, typically Lempel-Ziv compression, based off of algorithms from the late '70s -1977, 1978. Another approach is oftentimes referred to by one of two names, either singleinstance store, or its acronym SIS, or it's sometimes referred to as file-level deduplication,
which really tries to reduce commonality from a complete file perspective -- for example, if
you and I both have the same copy of a PDF. The third approach is really sub-file
deduplication, and that's the kind of deduplication that most people are aware of, where we
look for commonality between little bits of files or potentially databases, email systems as
well. And a fourth area that we see really being applied more frequently is a couple of
different techniques, most typically deduplication and compression, whereby we can look for
commonality within files, bits and pieces of file data, and then compress the results further
from there.
Sponsored By:
Page 3 of 6
SearchStorage.com FAQ Guide
Data deduplication and compression on primary storage can reduce data footprint
SearchStorage.com: Can you briefly compare and contrast how each of the
different approaches works?
Russell: Compression is really looking at a very specific amount of data. One everyday
example might be a sound file. An MP3 is an example. Compression is just looking within
that individual object, in this case, one single music file, and doesn't really persist any kind
of data reduction across other types of files or data that it's going to process later on.
The next step would be single-instance store, which actually would look for commonality
across many different files and would persist this idea of a dictionary of looking to see what
had been repeated in the past. Deduplication takes this concept even further and really
keeps this dictionary of known bits of data and looks further at a smaller chunk, or further
granularity, across files for repetitive data that may have shown up in the past. So, whereas
compression's typically whatever it's presented with and oftentimes at a single-file
perspective, single-instance store looks at multiple files over a period of time, and then
deduplication cracks this down a little bit further and looks at elements of objects or files
and over a longer period of time.
And part of the difference is when and how this data reduction is applied. It could be as
data is created, or it could be after data lands on disk and is processed later on. So these
techniques have different kinds of processing requirements associated with them.
SearchStorage.com: How effective are data deduplication and compression for
primary storage vs. backups?
Russell: Certainly backup is one of the most redundant kind of workloads that we have,
meaning that we're capturing the same files on a very frequent basis. And for some
organizations, they might be doing full backups every single night, and if they're not doing
that, they're probably doing a full backup at least once a week, which with the typical
change rate of data means at least 90% to 95% of what they're backing up on each full
backup is exactly redundant with what they've captured before.
So, the opportunity on primary storage might be a little less, but it's still very significant,
especially for so-called unstructured data or things like word processing documents,
Sponsored By:
Page 4 of 6
SearchStorage.com FAQ Guide
Data deduplication and compression on primary storage can reduce data footprint
spreadsheets, PowerPoints, which tend to have not only a lot of commonality but a lot of
situations where even one individual saves a file with very, very similar data multiple times.
Maybe they're only changing a little bit of a title page as one example, but a lot of the
specifics in, say, a contract might look very, very similar. Another example might be
databases, where an organization oftentimes has at least half a dozen, maybe even more
than 10, copies of their database. So there's a lot of opportunity for reduction in the primary
world as well.
SearchStorage.com: Do you foresee any new approaches emerging for data
deduplication and compression on primary storage, and if so, how will they work
and what kind of results can users expect to see?
Russell: I think the first thing we see coming down into the marketplace relatively soon is a
situation where vendors combine more of these capabilities together, and certainly we have
some evidence of that today. But we think that we're going to see many more products and
solutions that combine compression and deduplication. Today one vendor may only offer
deduplication; in the near future, they may offer compression on top of that; and where
potentially today they only offer compression, they may expand into dedupe as well.
The next area that we see is that because this tends to take a certain amount of processing
power, particularly CPU, we're going to see more advancement in chip technology and that
is going to be much more cost-affordable. The speed of being able to process data and
potentially to do that more in an inline process rather than land all this on disk first is very
likely to come about in numerous products.
The third area really is around scope, meaning how global is the data reduction, how wide
can we look across repetitive data and reduce commonality. Today, some products are
limited from a single LUN or a volume. Others are limited by certain streams, as examples,
and we think that we're going to see increasingly broader, more global capabilities that will
really drive home this data reduction even further for primary storage.
Sponsored By:
Page 5 of 6
SearchStorage.com FAQ Guide
Data deduplication and compression on primary storage can reduce data footprint
About IBM
At IBM, we strive to lead in the creation, development and manufacture of the industry's
most advanced information technologies, including computer systems, software, networking
systems, storage devices and microelectronics. We translate these advanced technologies
into value for our customers through our professional solutions and services businesses
worldwide.
Sponsored By:
Page 6 of 6