Blog

Subscribe to Email Updates

Archive First, Backup Should Be Boring, and Other Insights: A Conversation with Life Sciences Technologist Chris Dwan

by Catherine Chiang – April 24, 2018

Chris Dwan headshot 2014-090776-edited

Chris Dwan is a leading consultant and technologist specializing in scientific computing and data architecture for the life sciences. Previously, he directed the research computing team at the Broad Institute, and was the first technologist at the New York Genome Center.

Chris joined us for a conversation on data management challenges and trends in the life sciences. Read on to discover Chris’ insights from over a decade of experience in life sciences IT!

 

The Life Sciences Data Management Problem

Life sciences data has exploded in recent years due to developments in lab devices and equipment that now generate more data than ever before. In particular, DNA sequencing workflows now generate huge amounts of data as a result of next-generation sequencing.

“One challenge with next-generation sequencing was that it generated monstrous amounts of data. This is when we saw small to mid sized labs blowing up the storage budgets of the entire enterprise,” said Chris.

“In the early 2000’s, we were all impressed by gigabyte scale instruments. You could get an instrument that would produce a couple of terabytes a week, which was ridiculous at the time.

"There was a disturbingly common pattern of behavior that continues to this day: The lab would just be bumping along with completely ordinary IT needs—laptops, monitors, printers, and so on. Then one day they file a ticket or mention in passing that they are setting up an instrument that could produce a substantial fraction of a petabyte of data per year. Enterprise IT would be shocked, but by that time the instrument was usually already on the loading dock.

"The raw DNA ‘reads’ that make up one human genome in the most common mode of sequencing take up something like 130 GB. Modern DNA sequencing instruments can sequence tens of individuals at the same time over a run that takes a couple of days. That works out to tens of terabytes per week.

"A compounding factor in all of this is the desire to maximize the value of that expensive instrument. When you buy an expensive machine, there is a strong financial incentive to run 24/7 and keep it busy all the time.”

The result? Life sciences organizations can now routinely generate petabytes of data, which will crush IT infrastructure that was built for less data intensive office or lab operations.

Archive First: How Life Sciences Organizations Can Effectively Manage Their Data

Chris has led and advised many IT departments of life sciences organizations through periods of huge data growth, including the Broad Institute and the New York Genome Center, so we were curious about his insights on life sciences data management strategy.

Chris advocates for a simple and elegant “archive first” approach to data management. From Chris’ perspective, archiving first is the simplest way to protect the data, along with its valuable metadata, before the data can be lost or changed.

“One of the things I’m pushing with most of my customers is the idea that there should be a place where your data is protected for the long term, and that valuable data goes there first. As soon as you generate something of value, you capture it in the archive, along with whatever metadata you have at the time of creation. The key here is to not overthink the metadata schema while you allow that first few petabytes to accumulate in a disorganized way.

“One of the things I’m pushing with most of my customers is the idea that there should be a place where your data is protected for the long term, and that valuable data goes there first. As soon as you generate something of value, you capture it in the archive, along with whatever metadata you have at the time of creation."

"What that does for you is that downstream, in analysis and exploitation, you can recover. It means that all subsequent or derived pieces of data can be treated as transientkeep them as long as they have value. It also does wonders for the questions of reproducibility, governance, and provenance.” said Chris.

“It’s actually a surprisingly hard sell, because people think that I’m talking about storing the data twice (doubling storage costs), or that I’m proposing that we stash it in a place that is not amenable to high performance analytics, machine learning, and so on. In my experience, over time, the benefits of simplicity, reliability, and capturing metadata up front outweigh any arguments about incremental costs. I actually think it’s much cheaper in the long run because hopefully you had planned to back the data up anyway. Why not do it in a simple way that creates an incredibly valuable index almost for free?”

This “archive first” approach ties back to Chris’ belief that backups should be boring. In the backup space, we’re familiar with the perception that backup is boring compared to more exciting and flashy technologies. According to Chris, this is okayin fact, it makes perfect sense.

“As an infrastructure person, there are pieces of technology that just shouldn’t be exciting,” said Chris. “What you would really like, particularly out of your disaster recovery plan, your operational continuity, your data availability setup, and things of that nature, is that they be absolutely, rock-solid reliable. Eventually, over time, you always need to use your backups. At that time, it’s never a good sign when it gets too interesting or exciting.

“On high performance computing systems, part of the game is to see how far you can push it. That means that you run larger and more difficult problems until you find the edges of a system’s capabilities. It’s okay when you find the limits of an HPC system. It’s actually pretty cool to have a really exciting day breaking and fixing an HPC system over and over as you see what it’s capable of.

"When the data layer fails, it’s never a good day. That means that architects make our plans looking for simplicity, reliability, and stability. A virtuous sort of boring backup and data management layer is what we want."

“With the durable archive layer of our data, we’re looking for stability coupled with predictable costs over time periods measured in years. On that timescale, all systems experience small and large failures. You also experience staff turnover, mission creep, corporate mergers and acquisitions, and other fun external circumstances. When the data layer fails, it’s never a good day. That means that architects make our plans looking for simplicity, reliability, and stability. A virtuous sort of boring backup and data management layer is what we want.

“Naturally, there will also be high performance storage associated with the analytic or HPC systems. That’s not what we’re talking about here. In order to get really high performance, we usually need to spend more dollars per terabyte and accept lower levels of uptime and durability. It’s a different problem.

"Failure to set up a robust data strategy does not prevent the bad thing from happening. You just give up your agency and let your system choose when your team is going to have a really bad day."

“In terms of setting up failover systems, you’ve got a couple of options. You can either do the work at your convenience, before the failure occurs. That will be a project that will require justification, scheduling investment, and so on. The other possible plan is to just wait for something to fail and implicitly plan to drop everything and deal with it then. Failure to set up a robust data strategy does not prevent the bad thing from happening. You just give up your agency and let your system choose when your team is going to have a really bad day.

“It’s very much like the old ‘quiet conversation’ sales pitch from insurance vendors. Nobody really likes to talk about planning for the bad thing, but in the long run it’s much better than the alternative.”

How Does Igneous Fit into the Picture?

At Igneous, we have built our product to solve the data management problems of data-intensive industries, including the life sciences.

We asked Chris how he thinks Igneous fits into the picture and solves the problems that life sciences organizations face.

Chris explained that in many life sciences organizations, the problem starts with not having a consistent way to implement the core pieces of a data strategy that focus on durability and indexing.

The core problem that Igneous is positioned to solve [is] data skew. In most organizations, I find teams making up their very own solutions to the question of data storage."

“The core problem that Igneous is positioned to solve, I would call it data skew,” said Chris. “In most organizations, I find teams making up their very own solutions to the question of data storage. This creates challenges for some person in the future who wants to go back and do a cross-cutting aggregate analysis. Even answering simple questions like ‘Have we ever run experiment X on sample Y’ can be incredibly difficult to answer. When data is spread over a bunch of different platforms and schemas, you get a large amount of waste, re-work, and redundancy.”

According to Chris, Igneous is a solution that could provide that centralized location. While it is certainly one tool among many, merely capturing and indexing data is a foundational piece of the data puzzle.

Chris is excited to help life sciences and medical organizations move past the basics of capturing and protecting their data and get on with the business of improving medical outcomes and improving quality of life for people around the world.

Chris said, “What we’ve found in the last couple of decades is that bringing together data from a bunch of different modalities is where the real value lies. It’s fairly rare to be able to really change someone’s life with any single type of observation or measurement.

“What we’ve found in the last couple of decades is that bringing together data from a bunch of different modalities is where the real value lies."

“The real clinical value of this work will come when we are able to combine genomic data with the clinical narrative in the context of wearables and journals of observations from clinicians, caregivers, and the patient themselves. This is stuff that the digital marketing world has been exploiting for years and years, and we’re finally beginning to see it in hospitals. While it’s been important and challenging to get to the point where we can reliably capture and store genomic data, the value of that data will come from bringing all of these other data modalities together in a common and flexible framework.

“We need to think about data in terms of systems where you’re able to pull up information on a single individual across many different modalities. I think that’s where the real discoveries and clinical innovations will be enabled.”

Investing for the Future

As data and its value grows, it’s imperative that life sciences organizations begin investing in their IT infrastructure now to prepare for the future.

Chris suggested that organizations should invest in the following:

“The first thing that a life sciences organization should invest in is a person, a human being, whose job is to own the organization of the data. This needs to be somebody’s job. I’ve struggled to find the right word for this role - they are a curator, a steward, a librarian, and a warden.

"Of course, they will also need a durable, cost-effective storage substrate coupled with a robust, flexible way to deal with metadata."

“That person will need tools: The most important tool is clout within the organization. They will build a network of data champions who will be the ones to create and maintain the metadata index for the data their team creates. Of course, they will also need a durable, cost-effective storage substrate coupled with a robust, flexible way to deal with metadata.

“The analytic infrastructure is going to be domain dependent. It could be image based machine learning, natural language processing, custom statistical tools that run on top of Spark and Graph data models, or any of a bunch of other things.”

The bottom line, though, is that “it starts with a decision by the organization that data is important enough to hold a person accountable for it, and then capturing the data in a safe, secure, and well indexed platform. Everything proceeds from there.”

Learn more about Chris Dwan and his work on his website.

Interested in learning more about how our life sciences customers use Igneous? Contact us!

Contact us

Related Content

Why Isilon Users Need Multi-Protocol Support in Their Data Protection Solution

August 21, 2018

As you may have heard, we’ve added support for multi-protocol on Dell EMC Isilon OneFS—making Igneous the only modern scale-out data protection solution for large enterprise customers with this capability.

read more

Igneous Announces Industry “Firsts” Integrations with Modern NAS Providers

August 21, 2018

We are excited to announce three new integration “firsts” with primary network-attached storage (NAS) systems: Dell EMC Isilon OneFS, Pure Storage FlashBlade, and Qumulo File Fabric (QF2)!

read more

Moving to Cloud? Here's How a Hybrid Approach Can Help

May 8, 2018

Today, the growth of unstructured data is scaling beyond what traditional legacy storage and backup softwares were designed for. While advances in cloud architecture from AWS, Microsoft Azure, and Google Cloud present solutions for managing enterprise data, many enterprises face obstacles to adopting cloud and stick with legacy storage and backup software as a result.

read more

Comments