4 min read

How Do You Archive Data at Scale?

By Catherine Chiang on June 12, 2018

Archive, as a concept, seems simple enough: Offload your infrequently-accessed data to a secondary storage tier and save it until the one day you might need it.

But what happens when you have petabytes of unstructured data? As with other aspects of data management, the name of the game changes once you have a certain scale of data, and it’s unstructured.

Why is Archiving Billions of Files So Hard?

As with many data management functions involving this scale of data, archiving billions of files becomes a data movement game. How do you move billions of files to an archive tier, and how do you recover within a reasonable amount of time when you actually need the data?

In addition, data is always growingmeaning that your archive tier needs to be able to grow with your data. Adding capacity in a way that doesn’t introduce unwanted complexity into your infrastructure can be a difficult problem to solve with legacy solutions.

Another problem with a forever growing archive tier is that it can become unmanageable and cumbersome, unless your archive solution is designed to reduce complexity and management overhead.

How Do You Archive Data at Scale?

Archiving data at scale requires your archive solution to have three main capabilities. First, it needs to be able to move large amounts of data quickly. Second, it needs to be scalable. And third, it needs to be consolidated.

Highly parallel, latency aware data movement

Whenever massive amounts of data are involved, a modern data movement engine is a must. Whether data needs to be moved from where it lives to where it’s needed or to protect the data through offsite redundancy, an enterprise data management strategy must include effective data movement.

Archive is no exception. After all, an archive solution is useless if you can’t move your data to the solution effectively!

Modern solutions which utilize highly parallel threads can move much more data than legacy solutions, which were built for terabyte-scale data and fail under the demands of moving hundreds of terabytes or petabytes of data.

Another issue with moving large amounts of data with traditional archive solutions is the load placed on underlying filesystems, impacting the performance of other applications. Latency awareness ensures that archiving does not interfere with user functions, so that work doesn’t have to halt when data is archived.


Why scalable? As an organization’s data grows, which it will, likely at an exponential rate, more data will need to be archived off the primary tier. The nature of an archive tier is that it’s forever growing, since you’re retaining all of the data your organization has ever created that may be of use someday.

Scale-out archive allows you to add more capacity by adding more appliances that stay within the same system, eliminating silos and complexity as your archive tier grows with your business.


For enterprise-scale data, archive need not only be scale-out, but consolidated as well. While scale-out may seem to imply that the data is “consolidated,” because scale-out architecture contains all the data within one system, there is a difference.

Scale-out simplifies your archive tier by allowing you to add more capacity within the same system, but if your clusters are geographically distributed, you’ll still be left managing them separately. Consolidated archive means that everything is managed via one interface or dashboard. The key added benefit of consolidated archive is that it reduces the amount of management needed.

Reducing management overhead is extremely important when dealing with large amounts of data, as most organizations don’t want the management overhead involved in traditional primary storage for their archive tier of rarely accessed data.

In multi-vendor environments, a consolidated secondary storage solution is key to reducing unwanted complexity in your archive tier. Traditional archive solutions are vendor specific, locking users into buying archive solutions from the same vendors as their primary storage solutions. To make the situation worse, silos typically exist even within the same vendor, with each silo on the primary tier creating a silo on the secondary tier.

For organizations with multiple primary storage vendors, archiving gets cumbersome and difficult to manage when that complexity is replicated on the secondary tier. Archived data becomes siloedpreventing the organization from fully taking advantage of their archive tier as a searchable repository.

A modern consolidated archive solution should be able to integrate easily with any NAS vendor, as well as recover transparently back to the same system.

Igneous provides a scale-out, consolidated backup and archive solution for petabyte-scale unstructured data.

Read our use-case page to learn more about how we manage file archive at enterprise scale.

Catherine Chiang

Written by Catherine Chiang

Subscribe for Updates

Get the latest Igneous blog posts delivered to your inbox.