The data landscape in genomics and the life sciences is changing. We are moving away from a world in which the work was dominated by merely capturing data, towards one with a more subtle set of challenges centered on data organization, governance, and appropriate usage.
In this new world, we see the weaknesses of the very systems we used to solve our earlier problems. The systems capable of leveraging machine learning, cloud technologies, and other innovations in storage will focus on metadata management and data mobility.
The New Problem for Data Storage: Organization
For more than a decade, we have been grappling with the sheer size of the files generated by high-throughput DNA sequencing. Adding to this challenge is that life sciences data is notoriously non-homogenous. The files holding the information from a single human genome take up between 100 and 130 Gigabytes on disk. Of course, the numbers can be either smaller or larger depending on the particulars of the scientific or medical question at hand.
I have seen bioinformatics pipelines blithely produce single files weighing in at more than 10 terabytes, while other tools used by the same project team were spraying thousands or millions of files, each a kilobyte or two, into a single directory on the same filesystem. The small files were admittedly temporary, but it serves as a small example that there is no single best answer to how to configure data storage for the life sciences.
For years now, scale-out network attached filers have been the compromise of choice. In 2018, it is a straightforward question of engineering and purchasing to deploy storage systems, either cloud based or on premises, that are capable of holding petabytes of unstructured data. Most large organizations will adopt hybrid architectures, combining some level of locally controlled systems with their exa-scale public cloud of choice.
The question of merely storing the data appears to be well and truly solved, but of course, even in this simple case, the devil is in the details. Ewan Birney recently observed, “Genome sequencing is routine in the same way the US Navy routinely lands planes on aircraft carriers. Yes, a good, organized crew does this routinely, but it is complex and surprisingly easy to screw up.” This is as true of the analysis and storage of data as it is for the laboratory processes.
Beyond Storage: Organizing Data for Success
Once we are set with merely capturing the data, we must confront the question of organization. Without extremely strong organizational discipline, even a few petabytes of storage holding a few million files will very quickly become an unusable mess. I know of several organizations who have little to no idea, beyond vague approximations sketched on whiteboards during budget season, of the usage patterns or level of inadvertent duplication in their data. This holds them back from important insights. It also means that we are certainly spending more money than we need to.
We are entering a new and transformational era in which life sciences datasets will be sufficiently large, rich, and well-annotated to reap the benefits of the AI/ML revolution. These new analytic technologies require well organized and curated datasets housed on storage systems that are capable of feeding data to tens of thousands of CPUs or GPUs at a time. Machine learning and analytics have an insatiable appetite for raw performance, which pushes our storage in the direction of speed.
At the same time, we are pushed towards durability. As the benefits of genomics move towards the clinic, we are bound by ever stricter systems of compliance and governance. Regulatory frameworks like HIPAA and agencies like the FDA require us to be able to confidently control and protect data over long timelines. This demands durable, cost effective storage systems with strong metadata management.
There is an old engineering saying: “You never eliminate a bottleneck; you just move it around.” This comes to mind as I consider a data architecture capable of meeting the performance requirements of ML/AI, the governance requirements of the clinic, and the financial constraints of the CFO’s office.
A modern data strategy must go well beyond the capabilities of any single storage technology. It requires agility in the face of changing requirements. Metadata management, then, seems fundamental to the future of data management.
Chris Dwan is a leading consultant and technologist specializing in scientific computing and data architecture for the life sciences. Previously, he directed the research computing team at the Broad Institute, and was the first technologist at the New York Genome Center.
Chris is a featured speaker at our Data Workflow Forums for the Life Sciences. Check out our Data Workflow Forums to see where Chris will be speaking next!