Glossary Terms
What is deduplication?
A deduplication process reduces or eliminates the amount of redundant data in a dataset. Deduplication is used to analyze a dataset, find identical files or pieces of files, and strip the extraneous data leaving behind a smaller, refined set.
There are different methods of deduplication depending on the scale of the process. The simplest form of deduplication compares two files and if they are exact matches, only one file is kept. More sophisticated deduplication can break down files to the byte level and compare individual pieces of that single file, refining its data to only unique pieces. The file can then be reconstructed from that remaining data when accessed.
Why is deduplication important?
Deduplication refines the amount of data in a set. This reduces the size of the data set, which improves transit of the files and the amount that can be stored.
Local storage has higher speeds but limited space, and deduplication can increase the amount of data that can be crammed into a network attached storage (NAS) or RAID array. Cloud storage can be bottlenecked by lower quality internet connections and infrastructure, and deduplication optimizes the amount of data transmitted over such connections.
For companies holding data, deduplication also helps manage the costs, both in bandwidth and in the long-term storage of such data, but also in time spent on data operations. Time is money, and data takes time. Companies often have to store a large amount of redundant data, and cutting down the time to store and transmit that data is in itself a significant gain in efficiency.
Best practices for deduplication
Understand deduplication and how it affects your data.
At a high level, deduplication can be performed on a dataset (multiple files) or by blocks (within the data of a single file) or even byte to byte (to a single file). It can be performed at a hardware level by certain types of storage and server devices or by software running on an endpoint. It can even run on the cloud using third party compute power. Knowing how your hardware and software perform deduplication, and where that deduplication is performed, enables you to allocate the right resources and to choose products that fit your needs.
Ask yourself where you are trying to save storage or bandwidth, and choose solutions that work in the correct layers for your business. Do you need to protect individual endpoints? Do you need to store data locally or in the cloud?
Implement deduplication that works for you.
Deduplication can save you time, storage space and bandwidth, but it also has its own costs. Deduplication is a process that itself takes up CPU cycles. Think about the performance and time impacts of running deduplication processes locally, or relying on cloud solutions, or running it on hardware versus software platforms.
Ask yourself what is best for the data you are working with and what you are trying to do. Do you need a mail server that can save you bandwidth when emailing thousands of newsletters? Do you need a backup solution that can deduplicate thousands of documents into the smallest possible size?
Deploy software and hardware that handles deduplication for you.
Running a local deduplication process takes you time to do manually for your data. Choosing optimized software that performs multiple processes at the same time and automatically, such as analysis, deduplication and encryption of your datasets, can allow you to more efficiently make use of your time and resources. It is likely that if you have need for deduplication, you are also interested in encryption and storage. A BaaS (backup as a service) provider could, for example, handle all three of these tasks for your data and save you a lot of time.
We’ve got your back(up)
Find the perfect data backup and recovery solution with our plan comparison. Kick-start your journey with a free trial.
CrashPlan® provides peace of mind through secure, scalable, and straightforward endpoint data backup. We help organizations recover from any worst-case scenario, whether it is a disaster, simple human error, a stolen laptop, ransomware or an as-of-yet undiscovered calamity.
- Resources
© 2024 CrashPlan® All rights reserved.
Privacy | Legal | Cookie Notice | Free Trial