Laboratory technologies are evolving rapidly. The latest instruments for life science research enable scientists to visualize highly detailed 3D models of protein molecules and assemble genomes in a fraction of the time they could in the past. These instruments are helping to accelerate the journey toward personalized medicine and deepen the exploration of new treatments for cancer and infectious diseases.
New lab instruments are also producing tremendous amounts of data. For example, in the field of genomic research, next- generation sequencing (NGS) instruments can output 1 TB of data per hour.
Cryo-electron microscopy (cryo-EM) is another burgeoning field that can produce massive data volumes. With cryo-EM, researchers study protein molecules at cryogenic temperatures, taking numerous images of each molecule and constructing 3D models. Researchers then use those models to study cancer at a molecular level and develop strategies for combating emerging viruses, such as Zika. The cryo-EM workflow can generate 5 TB of data in 24 hours.
Life science teams often require substantial high-performance computing (HPC) resources to process and analyze data created by new instruments and techniques. In fact, at some research institutions, life science departments consume more compute and storage resources than the traditional data-driven sciences of computational chemistry, astrophysics and climate research.
Now, many organizations need to bolster their storage environment. They need solutions that allow them to ingest a large amount of raw data from instruments, present data for analyses, preserve all data for the long term and ensure that data remains readily accessible by collaborative research teams. These IT solutions should not be overly complicated to manage scientists should remain focused on their scientific research, not IT.
Identifying emerging IT challenges
Growing data volumes and disconnected systems threaten to slow research. The development of new scientific instruments and the emergence of new techniques are creating important research opportunities for life science organizations. To take full advantage of those opportunities, however, organizations must address several key IT challenges:
Today’s scientific instruments are generating exponentially more data than instruments from just five years ago. They are part of workflows that incorporate a wide variety of file types, from text and binary files to databases and directories, in sizes ranging from a few kilobytes to hundreds of gigabytes.
To accommodate growing data volumes and new file types, organizations must scale storage capacity. But simply expanding existing storage environments is not always the most expedient or cost-effective option. In recent years, many institutions leveraged cloud storage for long-term data storage.
The cloud model had the benefit of low capital expenditures and was well suited for moderate data volumes that rarely needed to be retrieved.
However, as data volumes have grown, many organizations must now rethink their use of the cloud. They need ways to avoid the substantial costs of retrieving increasingly large data sets from the cloud, and they must support research teams that require stronger performance than cloud services can deliver.
Due to the cyclical nature of research grants and the resulting fluctuation in budgetary planning for IT, many organizations have expanded their technology resources without an overarching, standardized methodology. As a result, they are now faced with multiple data silos. These distinct, unconnected environments make it difficult for researchers to easily access the data they need, when they need it.
Rapid data growth and the development of multiple data silos can contribute to spiraling IT costs. Without a single, coherent storage strategy, many organizations buy more than they need. They might expand storage for multiple siloed environments, leaving capacity underutilized in each. Or they might spend too much on production storage because they lack an integrated archiving solution. Managing these complex, siloed environments can be costly and pull researchers away from their primary tasks.
Defining requirements for life science storage
Solutions must offer cost-effective scalability and robust performance while simplifying management. What does your organization need in a storage solution? The right one will provide scalability, performance, and flexibility – while also reducing management complexity.
Your storage solution must have the scalability to accommodate rapidly rising data volumes generated by the latest scientific instruments. Research teams need to store both data produced through scientific analysis and raw data, so they have the option to run additional analyses in the future. You should be able to expand the capacity of your storage environment to petabytes of data without ripping and replacing your solution or undergoing major augmentation projects.
An integrated archive solution is critical for preserving and protecting data. In many cases, researchers need to retain data for several years. Over that time, data must stay both secure and unaltered to conform to regulations that mandate the stored copy is identical to the original. Lost or corrupted data could be catastrophic—for clinical research there might be no way to re- capture historical patient samples. Meanwhile, archived files must be quickly and easily accessible. Researchers can’t wait hours or days to retrieve data sets.
The archive must also be cost-effective. In the past, organizations used the cloud as a cost-effective option for archiving. But retrieving data from the cloud can be time- consuming and costly: it can take up to a day to transfer 10 TB at 1 Gbps. For many life science organizations, on- premises data tape-based archives are a better solution. Compared with the cloud, tape can offer a more cost-effective option for storing and retrieving massive data volumes. Keeping an offline copy of data can also protect organizations from various cybersecurity threats.
Your storage solution requires robust performance for multiple phases of the life science workflow. For example, you need sufficient sequential performance for ingesting large amounts of data from scientific instruments. In addition, you need strong random I/O performance to support the sophisticated analyses you conduct using HPC or supercomputing resources. The storage solution must be able to complete a significant number of reads and writes per second, especially when you are using large clusters, with numerous cores to process data.
Whether you have a small team of researchers in a single lab or numerous researchers spread across the globe, your data storage solution must support seamless collaboration. It must help eliminate data silos, enabling researchers to easily access data from a single, shared pool. The storage solution must also have the flexibility to accommodate multiple types of client systems, from Linux and UNIX machines to PCs and Macs, so researchers can tap into the data they need without having to change systems or alter workflows.
Meanwhile, the storage platform should accommodate different types of connectivity. You should be able to continue utilizing your existing investments and have the flexibility to support new systems, regardless of whether those systems are connected by Ethernet or Fibre Channel.
Few life science organizations have large, dedicated IT teams. In many cases, researchers with technical knowledge are responsible for managing IT systems. As your organization expands its storage environment to accommodate growing data volumes and new techniques, you need ways to manage that environment simply and in a holistic way. The less time and resources your researchers spend on IT administration, the more they can devote to scientific work and research projects.