Blog

Subscribe to Email Updates

Why is Data Growth a Big Problem for Science?

by Catherine Chiang – October 17, 2017

Within recent years, developments in scientific research have enabled scientists to collect and generate far more data than ever beforebut the scientific industry’s ability to store, protect, and manage this data has not kept pace.

Too Much Data, Too Quickly

Rapid breakthroughs in scientific technology have enabled the generation of huge quantities of data at far faster rates and lower costs than just a decade before. Meanwhile, the field’s ability to store and manage this data is still lagging behind.

A prominent example is DNA sequencing. The first time the human genome was sequenced, it took 13 years and between $50 million and $1 billion to complete; now, sequencing a human genome takes 1 to 2 days and costs below $1,500.

A 2015 PLoS Biology study predicts that by 2025, between 100 million and 2 billion human genomes could have been sequenced, with data storage demands of 2 to 40 exabytes. That’s more than the projected storage needs of YouTube and Twitter.

The dropping costs of DNA sequencing means that more researchers and labs can afford to sequence DNAresulting in more data being generated from disparate sources and stored in silos. Without a way to aggregate and analyze all of this data, scientists cannot fully take advantage of today’s wealth of genomic information.

In recent years, advancements in technologies such as electron microscopy and flow cytometry have resulted in similar explosions in data growth.

Challenges of Scientific Data Management

Effective management of scientific data enables and aids research, but the scientific community struggles to store, protect, and manage its growing data.

Long-term preservation of data would enable scientists to access the results of previous studies and conduct ongoing, robust research, but data loss is prevalent due to the challenges of scientific data management. Often, data is not preserved after the completion of a study, data is too difficult to find, or data is too difficult to access because it is stored on older media.

A 2013 study found that the odds of sourcing datasets decline 17% every year and over 80% of datasets over 20 years old are not available. This prevents scientists from utilizing the potential gold mine of information gained from past studies.

“My team and I realized that the additional cost of storing data represents about 1/1 000 of the global budget. Thus, publication of new articles based on use of archives in the ensuing 5 years represents a profit of 10%. We basically have research that costs almost nothing. Without any data storage strategy, we completely miss out on potential discoveries and low-cost research. Once data has been properly stored, however, its cost is practically zero,” said Cristinel Diaconu, research director at the Centre Nationnal de la Recherche Scientifique.

Scientists need not only more data storage, but more computing power and effective modes of moving their data to where it’s needed. Unfortunately, the costs of processing power and storage can be prohibitively high.

“The cost of computing is threatening to become a limiting factor in biological research,” said Folker Meyer, a computational biologist at Argonne National Laboratory in Illinois, who estimates that computing costs ten times more than research. “That’s a complete reversal of what it used to be.”

In addition, it’s essential that data management platforms preserve rich metadata to make it easier for scientists to retrieve the data they need from the growing sea of scientific data.

Scientists need cost-effective, streamlined data storage and management solutions built to handle petabytes and more of data.

Curious if Igneous can help you manage your scientific data? Talk to us!

Contact us

Related Content

8 Principles for a Better Data Management Strategy

December 5, 2018

I’ve spent the better part of three decades leading one of the most demanding high-performance computing infrastructures in the world. One of the greatest challenges of HPC infrastructure is keeping data available and meeting the needs of the business with supporting engineers located in dozens of locations around the world. Here are some key takeaways for anyone struggling with this problem.

read more

What's the State of Unstructured Data Management in 2018?

October 23, 2018

We are in the midst of a dramatic shift from the majority of organizations’ data being structured (application data often sitting in relational databases and VMs) to unstructured (data within individual files and objects)...machine-generated unstructured data. So dramatic, in fact, experts estimate that 90% of organizational data is unstructured.

read more

Data Management: Why Does It Matter and Why Is It So Hard?

October 23, 2017

With enterprise datasets growing so quickly, it’s not just about storage anymore. Enterprises need effective data management strategies to harness the value of their data.

read more

Comments