We often talk about backup and archive as coupled processes, so much so that these two very different concepts may become conflated.
It’s not just a matter of semantics. Many organizations don’t differentiate clearly between backup and archive, using backups kept for long periods of time as their “archive.” Unfortunately, this approach creates problems down the road, especially as data grows.
To effectively manage their data, organizations must differentiate between backup and archive and build both processes into their infrastructure.
What’s the Difference Between Backup and Archive?
Backup is one type of data protection. It reduces the risk of data loss by creating a secondary copy of the data. Traditionally, this has been achieved through backing up an organization’s entire corpus of data to LTO tape or to another, lower-cost tier of spinning disk using purpose-built software, at regular intervals. For instance, some organizations might run a full backup every weekend, and run “differentials” each night as their employees are asleep.
Archive involves moving rarely-accessed, old, or inactive data to a secondary or tertiary storage tier for medium- or long-term storage. Organizations may archive in order to save data that may one day be needed, to comply with industry regulations, or to offload data from expensive primary storage. With data growth increasing and the advent of high performance primary storage, such as all-flash arrays, archiving has become a necessity for controlling costs. Best of all, when done effectively, archiving enables organizations to better understand their data and its value.
The key difference between backup and archive is that backup is a copy of the data, while archive is the main version of the data, located in an infrequently accessed, but cost-effective tier.
You’ll need your backup copy if a server goes down, if you accidentally delete a file, or if some data set is accidentally changed. You’ll look in the archives if your applications need to reference more historical data than anticipated, if you want to run a 25th anniversary edition of your studio’s breakout cartoon, or if you are involved in a lawsuit.
Traditional Backup Software Can’t Archive Effectively
Often, organizations attempt to fulfill their backup and archive needs through backup software. Although this approach may seem to kill two birds with one stone, it can actually result in more complexity and management overhead for enterprise IT.
For example, backup schedules will often include yearly full backups for long-term retention, referred to as “archives.” But unlike a true archive, it’s difficult to retrieve specific files out of these full backups.
Another pitfall of using backups as archives is that storing backups over long periods of time is not cost-effective. Since the backup is a copy of the data, there are now two copies of data that need to be stored. The original data is still on primary storage, eating up expensive storage capacity. In addition, having to manage two copies of the data adds to data management overhead. At scale, this is simply untenable.
How Do You Backup and Archive Massive Unstructured Data?
Traditionally, organizations have used tape as their archive solution. While tape is a reasonable archive solution for some, today’s increasingly data-driven organizations may find that tape doesn’t allow them to fully harness the power of their archived data.
Tape workflows are labor-intensive and requires administrative overhead; between shuffling the tapes between a datacenter and a tape-vaulting service and keeping track of catalogs, maintaining archives on tape is far from painless. Once the data is needed, retrieving it from tape is another laborious process that often prevents archived data from actually being used.
Organizations utilizing modern workflows, especially machine learning and artificial intelligence workflows, need to be able to access their archives in a timely manner without investing large amounts of IT resources into maintaining their secondary tier.
A modern archive solution should:
- Help identify data that needs to be archived
- Have automated workflows that are simple to set up and use
- Enable end users to easily access archived data
- Contain true archive capabilities such as cataloging and search, which make it easier to know what’s there, organize the data, and retrieve it when it’s needed.
If you would like to learn more about Igneous’ modern backup and archive capabilities for massive unstructured data, check out our product datasheet.