It is no secret that every enterprisebig or small is struggling with the
rising tide of data. The vast amount of information gathered poses not only
storage management problems, but also the challenges associated with
performance, data security and escalating operating costs. Anticipating such
issues, storage vendors have come up with various interesting technologies to
address the capacity optimization challenges. Data deduplication is one such
technology that has gained prominence in the recent past. It is touted as the
technology that can comprehensively address capacity problems in conjunction
with performance, security and management issues. However, its effectiveness is
largely dependent on vendors choice of architectural and technological
underpinnings.
Influencing Factors
- Duplicate Data: The efficiency of data deduplication implementation
largely depends on its capability to recognize duplicate copies of the data
blocks/objects. This means, if there is no duplicate data in a given storage
capacity, no capacity savings can be achieved irrespective of the
implementation techniques and efficacy. Similarly, assuming that the internal
algorithm of data deduplication implementation is capable of recognizing all
duplicate data, the maximum extent to which the storage capacity can be
optimized is limited by the amount of duplicate data present. - Data and Workload Characteristics: Apart from the data type itself,
the usage pattern determines the amount of storage savings. Typically, user
data such as mails and attachments tend to get forwarded multiple times and
replied upon leading to duplicate data. In addition, there are certain types
of shared data that can render themselves for deduplication as an example, the
digital media industry creates various assets from a common data set.
Many times, operational policies of companies have a bearing on the amount of
duplicate data stored. Restrictions on storing specific file types and
extensions (such as music, pictures, videos etc) help in reducing the storage
capacity required to back up, thereby avoiding duplication.
Data deduplication techniques that operate at file level boundaries tend to
treat copies of the whole file as unique, even if there is a small change that
occurred during an update operation. Therefore, environments in which file
objects undergo frequent changes will have a lower ratio of deduplication. In
cases, where reference data can be kept around without modification for extended
time, the data set renders for better deduplication ratio.
- Data Protection and Deduplication: When given a choice, IT teams
would prefer to hold data online as long as possible because it helps them
improve efficiency through improved performance for business analytics, legal
discovery and data recall. This of course has a non-linear relationship with
capacity consumption. Considering that data security implementations tend to
create multiple copies of data, deduplication solutions can benefit
corporations by reducing online storage requirement. - Compression, Deduplication, Encryption: It is a general practice to
apply compression techniques to inactive data. Once compressed, the data set
bears a unique constitution. Hence applying deduplication techniques on
compressed data may not fetch any space savings. Similar results are observed
on encrypted data. By encrypting, data is rendered unique and hence
deduplication may not have much effect. It is a good practice to perform
deduplication before compression or encryption. - Deduplication Architecture: Deduplication architecture is the main
area which helps differentiate vendor offerings. The underlying algorithms of
the architecture determines how effective a solution is in detecting duplicate
data, deciding when to apply the deduplication sequence, and lastly how to
perform the activity. - Sequence: This is perhaps the most debated aspect of data
deduplication where and how should it be applied. Usually, two techniques are
followed, namely In-line and Post Processing.
If the technology is applied In-line, then the data is deduplicated as and
when it is generated and before writing to the storage device. This means the
resulting storage capacity consumed can be much lower. However, since the
processing is in real time, additional compute capacity may be required and
needs to be accommodated. The performance of the applications could be limited
by the speed with which the deduplication is performed. Vendors have addressed
this adeptly by augmenting the required compute power. Many of the VTL-based
deployments operate using this principle.
The additional benefit of In-line processing is better business continuity.
In a disaster recovery (DR) solution, in-line processing can begin replicating
stored data immediately, leveraging the storage based synchronous or
asynchronous replication techniques. Besides, since the stored data is already
optimized, bandwidth consumed is considerably less.
- Scope: Wider scope for data analysis can result in better
deduplication ratio. If the solution is deployed only at the source location,
then the benefits are believed to be limited. There are solutions which allow
data deduplication within multiple storage systems extending the scope to a
global level. Naturally, the resulting in space savings will be higher.
Another dimension to consider is whether the scope of deduplication is
extended to a file object or file system or multiple file systems. Temporal
deduplication detects for duplicity within a file at different points in time.
The spatial deduplication can look for duplicate blocks across files and
sometimes across file systems also.
- Algorithms: The architectural underpinnings of the deduplication
implementation can be broadly categorized into hash based or delta based. Both
methods present different approaches and hence have different characteristics
and ensuing results. Depending on the algorithm used, the ratio and
performance and deduplication can differ.
Going Forward
There is no denying that the issue of better capacity optimization is
increasingly becoming a primary focus area for many IT leaders. Enterprises
looking for capacity optimization find that deduplication is a promising
technology. However, the estimated savings claimed largely depends on various
factors identified earlier. In-line solutions are simpler to implement and
manage. Post processing approach excels in managing multiple deduplication
domains on multiple devices.
In real life environments where there could be wide mix of data types, the
actual capacity savings can vary. IT leaders looking at adopting capacity
optimization strategies should therefore adopt data deduplication in conjunction
with other techniques discussed earlier. Taken together, the savings could be
more effective and predictable.
Subram Natarajan
The author is senior consultant, STG Asia Pacific
maildqindia@cybermeia.co.in