Last Friday, we talked about DataDiscover, and went over a few positive outcomes that our customers have achieved as a result of being able to make more fact-based decisions about their data. For our customers in the Life Sciences space, the presence or absence of data visibility is especially impactful to their success. Their concerns include everything mentioned in that article, but since the life sciences generates such massive volumes of data per employee (ie, per researcher), these customers have some additional concerns.
The first of these concerns: scientists generally don’t keep track of what data is added to a NAS system, what’s compressed, or what’s being used lately. As long as they can read and write, they aren’t going to focus on managing it--so it’s really up to their IT department to keep things under control. IT, then, needs up-to-date answers to these questions, so that their organization can keep creating data.
The second concern specific to the life sciences is that in research labs, the established career path of rotations, internships, and postdoctoral fellowships result in high turnover of researchers. Since few organizations have a defined data management policy, each researchers’ data can be fragmented across different primary storage systems.
These data from former researchers can fall into two categories:
- No immediate research need: ok to archive to secondary storage.
- Needed immediately for continued research efforts: keep on primary storage.
Unfortunately, when a researcher leaves an organization, finding their data can be very difficult. In order to find these datasets using DataDiscover, you can simply look at the directory structures and the age of data, and archive entire shares or directories that aren’t relevant to anyone’s current projects.
The third aspect of data visibility that is specific to the life sciences is the opportunity to get all of your data into one system. In many other industries, deleting or moving cold data off of primary storage is a cost-saving measure, designed to avoid buying more primary storage capacity, but in the life sciences, it could mean the difference between having a single, centralized data storage system and, well, not. Today, many world-class life sciences research organizations are keeping all the data they create, “just in case.” Obviously, this is difficult: since keeping it all on primary storage is out of the question due to its sheer volume, the data ends up on servers under tables, in closets, in drives in conference rooms. Some of it gets used in workflows, some of it doesn’t, and it’s impossible for teams to tell the difference between recent backups and data that is being kept far past its useful lifespan.
With DataDiscover, as teams begin to understand why they are keeping these mountains of data (ie, it is current, in-use, high-value), they can be more deliberate about how they are keeping their data. Ultimately, as they are able to set different retention policies for different types of files, and put different workloads in different places proactively, the mess of random hard drives gets replaced by a well-organized directory structure on a NAS device.
Finally, answering the question of “who is generating the most data?” in order to allocate costs among research groups that may be sharing a lab can be essential, since there is often great variation between teams. Even among a single research group, it is often useful to understand the relationship between data generation/storage usage and scientific output for a given workflow.
Although data visibility is vital to all industries, the life sciences stand to gain even more from a clear understanding of what data is on premises, where it is, and how old it is than other industries do. If these problems sound familiar, don’t hesitate to get in touch for your Test Drive of DataDiscover, so you can see firsthand how useful this information can really be.