Subscribe Here!

Navigating the Unstructured Data Management Challenges of Next Generation Sequencing Workflows

by Adam Marko – May 1, 2019

Life science organizations face many challenges when it comes to the informatics component of their research. Scientific instrumentation is generating unstructured data at an unprecedented rate, and existing first tier storage systems can quickly reach capacity.

Next Generation Sequencing (NGS) is currently the largest consumer of storage capacity in the life sciences, but adding to expensive high performance storage as demands increase is not a sustainable or cost effective solution. Let’s look at the unstructured data management challenges of NGS workflows and possible solutions.

What is Next Generation Sequencing?

Next Generation Sequencing (NGS), also called massively parallel or deep sequencing, is one of the most commonly used tools in modern research. NGS is a method of genomic sequence determination that is highly parallelized and high throughput, and has accelerated genomics research immensely. Several different NGS technologies exist, but all involve sequencing large numbers of fragments of DNA which are then used in downstream bioinformatics pipelines. Since the advent of NGS over a decade ago, the cost of sequencing has dropped dramatically, at a rate even exceeding that of Moore’s Law (Fig 1).


Cost per genome decreasing faster than Moore's Law

Figure 1. Rapid decrease in cost per genome (green line) after NGS was developed c. 2007. Note how the cost changes more rapidly as compared to Moore’s Law (white line). Source:

This decrease in cost has come with an associated increase in data production, and currently NGS accounts for the majority of unstructured data stored by in life science organizations. As an example, at the NIH Biowulf cluster, the world’s largest compute resource dedicated to public life science research, 70% of the storage is consumed by genomics data.



Figure 2. Storage usage by research area on the NIH Biowulf compute resource. 

Not Just Volume: Other Challenges of Managing Unstructured NGS Data

In addition to the fact that NGS data is huge, and therefore really expensive to store, it’s hard to manage. Although it all contains genomic data, there are a number of factors that prevent end users or IT departments from treating it as a single entity.
  • Instruments: Different instruments write data at varying rates and sizes. They also can have different network connectivity and attached Windows or Linux workstations.
  • Workflows: Within any given organization, you’ll find a large number of workflows associated with different research areas. For example, a cancer panel pipeline will differ from an RNA-Seq gene expression pipeline. To further complicate this, many researchers extensively customize their workflows so even analysis from the same type of experiment may have different output file types and amounts.
  • Researchers: Different researchers at the same organization and even the same department often have different ways of organizing their data. In most cases, there is no enforced data management policy. This results in multiple unorganized NGS data directories within organizations.

What Does an NGS Workflow Look Like?

While there is considerable variability in NGS workflows, most have three main steps:

  • Raw data acquisition
    • FASTQ files are generated by sequencers such as the Illumina NovaSeq. These raw data files are typically written to a local disk or a local NAS shared by several devices in a single lab or core facility. It is essential to backup FASTQ files at this stage, in order to preserve the experiment in the event of hardware failure or user error.
  • Alignment
    • This step aligns the experimental sequence data (sample FASTQ files) with a known reference genome, such as the human HG38 genome, using alignment software (ie, Bowtie or BWA). FASTQ and other downstream alignment files are large, so they can be a little cumbersome to deal with, and will require plenty of high-performance storage space at this step.
  • Variant calling
    • Determination and discovery of genomic variants is performed with analysis software (ie, GATK). The files resulting from the alignment step are analyzed for differences from the reference genome. These differences are used to aid research areas, including cancer, hereditary diseases, and drug response.
    • Once the final results have been obtained, the raw data files can be archived. It makes sense to use a tool that will let you keep the small results files local, for rapid access, while automatically pushing the larger raw files to a secondary- or cold-storage tier, freeing up expensive primary-tier storage for the next batch of raw data. 

Simplifying Unstructured NGS Data Tiering

File shares used for NGS analysis are filled with unstructured data of varying sizes. Even within research groups, there are rarely data retention and management policies in place. This can cause issues for IT departments, since there may be TBs or often PBs of NGS data on the first tier of storage that must be retained, but has already been analyzed.

Life sciences organizations would benefit from an intelligent system that automates the movement of large post analysis files to lower cost storage, freeing up expensive high performance tiers for current analysis needs.

How Igneous DataProtect Can Help

Igneous DataProtect is an efficient, simplified, and automated backup, archive, and recovery tool. It is not file or block storage, but a software service for addressing unstructured data management at scale across your cloud and network attached storage (NAS) systems. Igneous DataProtect provides an automated way for large post-analysis files to be moved to lower cost storage, freeing up expensive high performance tiers for current analysis needs.

Learn more about Igneous DataProtect by downloading the datasheet.

read datasheet

Related Content

Accelerating Image Analysis and Cancer Diagnosis with AIRI from Pure Storage and Igneous

November 28, 2018

Artificial Intelligence (AI) has various applications today, from self-driving vehicles to optimizing workflows in manufacturing operations to detecting malware on the internet. Deep learning is a form of AI where multi-layer neural networks are utilized to transform input data into progressively more defined and useful outputs. Deep learning differs from machine learning (ML) in that ML focuses on the development of task-specific algorithms that can be applied to specific problems, while deep learning focuses on extracting information at multiple levels.

read more

Archive First, Backup Should Be Boring, and Other Insights: A Conversation with Life Sciences Technologist Chris Dwan

April 24, 2018

Chris Dwan is a leading consultant and technologist specializing in scientific computing and data architecture for the life sciences. Previously, he directed the research computing team at the Broad Institute, and was the first technologist at the New York Genome Center.

Chris joined us for a conversation on data management challenges and trends in the life sciences. Read on to discover Chris’ insights from over a decade of experience in life sciences IT!


read more