Blog

Subscribe Here!

Rsync: Necessary but not Sufficient

by Jeff Hughes – June 27, 2019

What is rsync?

Rsync is a very influential open-source tool that people have been using for more than 20 years to copy and move their data. It’s the “swiss army knife” of data movement, capable of migrating home directories, maintaining websites, backing up a small to medium-sized database, and aiding in content distribution for all scales of business.

First published in 1996 by Andrew Tridgell and Paul Mackerras, rsync was built around the networking constraints of the early- to mid-1990s. This was the time of MB to GB, when Western Digital had just come out with a 1.6GB hard drive that sold for a wildly low $399. Storage was pricey, and network was in short supply. Compute, however, was relatively plentiful--so a good tool for the time would’ve optimized for the ratios of storage:network:compute that were available. Rsync was a good tool for the time.

Today, we have 15TB hard drives readily available: our world has changed. Volumes and scales are fundamentally different now. 

 

How does rsync work?

Although “rsync,” as an open-source project, has gone through many surface-level improvements, the fundamental algorithm developed in the early days to compare two file systems to each other remains central to the tool. It works like this: for each file, it calculates the differences between the two sources blockwise, using individual checksums. It very effectively minimizes the load on the network. Because you’re doing a calculation on each side, you’re minimizing your network use, waiting to send each block after a checksum comparison can be completed. Now, we have tons of network, not enough I/O and not enough CPU to touch all of your data before you see if it’s worth transferring (or transfer it).  Every bit of latency matters when it comes to maximizing throughput.

Over time, there have been upgrades  and features added to the project to try to optimize it for today’s constraints (eg, you can compare metadata now), giving IT admins and other users the option of adding multiple layers of code to make up for the fundamental mismatch of the algorithm to the evolving infrastructure it’s deployed on. 

At the end of the day, you’re still spending a lot of time waiting for checksums or comparisons. If, like many of our customers, you want to move data from fast systems to archive in order to free up your fast storage for more latency-sensitive applications, and you’re forced to check every one of billions of files in order to do this, it doesn’t make a lot of sense as a strategy.

What day-to-day issues does this actually cause in modern production environments?

 

First Problem: Scale

Rsync works serially, because back in the '90s, everything was single-stream, spinning disk. When we entered the petabyte era, this stopped making sense. But the “petabyte era” isn’t a single point in time--some industries are more data-intensive than others, and some people are managing home media libraries, for instance, in the tens of terabytes. If you’re still working in the terabyte world, and you don’t see yourself or your organization passing half a petabyte anytime soon, perhaps scale won’t be what makes rsync an untenable solution for you. 

If you’re dealing with more data, however, you know you simply have to parallelize things in order to maximize throughput in today’s multi-core world, with today’s I/O profile. We’re talking about running a petabyte through an algorithm in hours instead of weeks. And when it comes to running backups, which is what many people still use rsync for, weeks of lost work is about as devastating as complete loss of data.

 

Second Problem: Reliability and monitoring, or, Tools versus Solutions

Rsync is a command. It’s a tool. If you’re going to use rsync to set up a data transfer system, you have to think about a whole host of things:

  • Where am I going to run it?
    • Does it have access to the network it needs?
    • Does it have access to the file systems I want it to look at? 
    • Is it close to the data?
  • Is it reliable, or is there likely to be some hiccup?
    • Who’s going to monitor this? 
    • Who’s going to monitor this if I go on vacation?
    • What happens if it fails?

These are just a list of questions that are top-of-mind. As you can imagine, it is likely to take significant time and expertise to do this effectively--time and expertise that your team may or may not have.

In fairness, there are lots of people who have built more tools to help this tool. With the right choices and effort, you could find another open source tool to do job search, or another rsync-like tool that is not rsync but is similar and has the features that you want, but then you’ll be in the world of evaluating the effects of the gaps in each tool, and finding ways to work around them. Globus, Aspera, and Signiant are all commercially-available data transfer tools that give you more reliability --which points to the usefulness of rsync -- but they don’t solve the problem of management. You’ll always have some priority that you’ve optimized for, and some gap that you’ll eventually have to home grow to make up for the shortcomings of the tool, no matter what tool you choose. This is the challenge of using tools versus a solution--you can always find another tool to pile on, another project that implements the feature you want, but there’s no guarantee it will “just work”.

 

Third problem: Complexity of modern data ecosystem

The final big barrier to “just using rsync” in today’s enterprise environment is that it only works for POSIX filesystems. That means that rsync does work for any Unix filesystem and any NFS-type file system, and it doesn’t work for SMB or for object. In other words, an investment in setting up rsync to work for your environment means committing to only using POSIX filesystems.

Of course, even if your datacenter is only dealing in POSIX-compatible storage at the moment, you can’t control the people you’re working with. Perhaps you get assigned to collaborate with a different business unit. Perhaps your company acquires your next biggest competitor… and they rely exclusively on SMB.* Perhaps you need to share data with a business partner or a fellow research university. Essentially, if you’re ever going to collaborate with anyone, ever, you don’t want to make storage decisions based on who you’re currently collaborating with! Not only can your list of collaborators change, but the type of storage they’re using can change. 

If you’re at any type of scale, storage is a big budget item, and you need to shop based on what you need--not based on the limitations of your collaboration tools.

Your team’s workflows are another thing to consider. If they ever want to move data from a filesystem to a public cloud, maybe back to a filesystem, you’ll need a tool that isn’t limited to POSIX. 

Finally, if you’re considering cloud migration, a POSIX-only tool won’t serve you well. Whether you’re moving only a certain application to the cloud, or adopting a multi-year, multi-cloud migration plan, rsync will be useless in this endeavor. 

 

To wrap it up…

Rsync is still good at what it does, but the data management world has reached 2019 and it’s not going to rewind. You might be able to make it work for a project, or even a small and simple environment, today, but a serial algorithm that eats up compute and I/O, that requires constant monitoring and is only compatible with POSIX systems, isn’t going to serve your organization well for very long. Even if you don’t end up choosing Igneous for your UDM projects, please don’t start building out an environment that depends on rsync for anything that is petabyte-scale or business-critical or both. If you ignore this advice and start building anyway… well, you know who to call.

Read Whitepaper

*If you are the company that uses SMB, just substitute the word “robocopy” for “rsync” in the other sections of this article. 

 

Comments