Breadcrumb überspringen und zum Hauptmenü wechseln

Bielefeld Institute for Bioinformatics Infrastructure

image with a decorative network representation — shutterstock.com/sdecoret

Cloud Computing

Zum Hauptinhalt der Sektion wechseln

High-performance Computing for Life Sciences

image with a decorative cloud representation — © iStock.com/Just_Super

Alexander Sczyrba; Bielefeld University, Bielefeld

The increasingly widespread availability and application of high-throughput technologies in the life sciences, such as (meta-)genomics studies or imaging applications, generate an exponentially increasing amount of experimental data. The number of specialized databases distributed around the world is also growing rapidly. Therefore, the storage, integration and processing of this data becomes the bottleneck of the analysis workflows, as they require infrastructures for data storage as well as services for data processing, analysis and possibly special access approval.

According to the definition of the National Institute of Standards and Technology (NIST), “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. Cloud computing plays an important role in many modern bioinformatics analysis workflows, from data management and processing to data integration and analysis, including data exploration and visualization. It provides massively scalable computing and storage infrastructures and can therefore represent the key technology for overcoming the aforementioned problems.

Two exemplary BiBi projects are presented on the following pages

BiBiGrid
Scaling from single virtual machines to high-performance clusters within minutes
EU SIMBA project
Analyzing large scale metagenomics data on the de.NBI Cloud

Cloud computing (bioinformatics) services are often divided into the following areas:

Data as a Service (DaaS):
provides data storage in a dynamic virtual environment hosted in the cloud, providing data that can be accessed from a variety of connected devices on the Internet. One such example is the National Center for Biotechnology Information (NCBI), which provides the Sequence Read Archive (SRA) data on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) clouds. All publicly-available, unassembled read data and authorized-access human data are available for access and compute through these cloud providers.
Software as a Service (SaaS):
offers cloud-based tools for performing various bioinformatics tasks, e.g. sequence processing, gene expression analysis, or image analysis.
Platform as a Service (PaaS):
In contrast to SaaS solutions, PaaS solutions enable users to provide bioinformatics applications and maintain complete control over their instances and the associated data.
Infrastructure as a Service (IaaS):
This service model is offered in a compute infrastructure that includes servers (usually virtualized) with specific computing capacities and/or storage. The user controls all provided storage resources, operating systems and bioinformatics applications. The German Network for Bioinformatics Infrastructure (de.NBI) Cloud provides such a service free of charge for life scientists in Germany.

Growth of Sequence Read Archive (SRA) database hosted at the National Center for Biotechnology Information (NCBI), USA. The data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. [1]

Development of costs for sequencing one megabase of genomic information over the last 20 years [2]

Virtual environments such as virtual machines (VMs), Docker or Singularity provide maximal flexibility to the users. In contrast to classical high performance environments they are independent from the installed operating system, software stacks libraries. Special requirements can be fulfilled easily without side effects. Additionally, virtual environments allow easy exchange of analysis workflows and with publication of these environments research becomes reproducible.
The cloud computing department of BIBI develops and provides bioinformatics environments and workflows for bioinformatics analyses, mainly in the field of (meta-)genomics. A mirror of SRA’s metagenomics data sets hosted at the de.NBI Cloud site in Bielefeld allows large scale analyses integrating publicly available data. Examples of such projects are described in the following sections.

References

[1] National Center for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/sra/
[2] The Cost of Sequencing a Human Genome. http://genome.gov/sequencingcosts