Article update: Curious about moving a petabyte of data in 2019? Read Jeff's Medium article about moving the 5PB of data used to discover the first black hole image.
There is a saying that nothing beats the bandwidth of a FedEx truck. To this point, in 2007 Jonathan Schwartz (then CEO of Sun) claimed that moving a petabyte from San Francisco to Hong Kong would be faster via sailboat than via a network link. In his calculations, it would take 507 years to transfer this mythical petabyte.
That was nearly a decade ago. I decided to revisit that claim to see if things are better today. After all, much has changed since 2007. First of all, while a petabyte was a lot of data in 2007, today it’s common place. Second, network speeds improved substantially. Jonathan talked about a half megabit per second network link, which is laughably slow amidst today’s ubiquitous 10Gbps connections.
So, using the same methodology Jonathon used, but with updated numbers, I calculate that transferring a petabyte should take less than 11 days. However, having spent the better part of 7 years moving petabyte data sets, I can report that it’s never that simple. Here’s why …
- Jonathan’s calculations assumed you would have the entire pipe for your transfer. Who has a 10Gbps link lying around to be used specifically for a PB data move? Better plan for congestion. So, 10 Gbps is probably closer to 5 Gbps in the real world.
- TCP/IP over long fat networks doesn’t work out of the box. If not tuned for the characteristics of the network, performance will be 50 percent to 70 percent slower.
- Applications aren’t designed for high-throughput, high-latency links, even those designed for transferring data such as rsync or robocopy. As a real world example, I had to establish more than 50 concurrent rsync processes to approach filling a 10 Gbps link between the east and west coast.
- Is the source data system capable of delivering the transfer throughput on top of its regular workflow? The answer is most frequently no. The solution is to run transfers only during off hours, effectively adding 2-3x the time required.
- Just as storage systems are chosen based on the type of data (more CPU for metadata intensive, more spindles/controllers for throughput intensive), solutions for a transfer have to be tuned for the data set. In a large file workflow, 15 threads of rsync were able to saturate a network. Same customer, but different data set of tiny files ran at 10 percent the performance.
So, how do you bulk move across networks? The most common strategy is optimizing one of the above bottlenecks to get “good enough” performance. The rest is a waiting game. In those 7 years of experience, the average transfer duration would realistically be over a month for a petabyte over a 10 Gbps link.
Specialized software can help. Aspera (now owned by IBM) built a nice niche business optimizing large file transfer across WANs. Signiant did as well. And now we have another -- TransferBoost. Yet none of these address the entire range of large data transfer pitfalls.
What’s the point of all this? Despite huge advances in raw bandwidth, Jonathan’s statements still hold quite a bit of truth! Moving large data sets is slow (though not impossible) and will be for the foreseeable future. Bandwidth alone won’t fix the problem.
A better question might be: Do you need to transfer all that data?
In a surprisingly large number of cases the people who produce the large data sets are collocated with the people who need those same datasets. In cases like this the current trend towards cloud storage is the worst thing you can do.
I spoke to one very large producer, processor, and consumer of content and he said that he would need 320 direct connect links to various AWS data centers to move their workflows into “the cloud” and maintain their current SLA’s. Bear in mind that this is a business with fairly sophisticated IT operations.
So, while the problem of transferring a petabyte of data from San Francisco to Hong Kong is an interesting challenge, for many companies their time would be better spent finding a way to optimize storing, maintaining and accessing their large datasets locally.
What do you think? Let me know how you are solving the ‘petabyte to Hong Kong’ dilemma. How do you handle large stores of information?