Scalable cyberinfrastructure for life science research
Over the last decade, the discipline of life sciences has benefited tremendously from new, massively parallel, and highly quantitative technologies. These technologies have facilitated rapid data acquisition at an increasingly higher resolution and throughput across all forms of modalities. Transformational advances in information technology have complemented and fueled this phenomenal growth in data acquisition, including cloud and high performance computing, large-scale data management systems, and high-bandwidth networks. Managing the lifecycle of these datasets from acquisition and analysis to publication and archiving often necessitates interdisciplinary collaborations with geographically distributed teams of experts. A common requirement for these interdisciplinary teams is access to integrated computational platforms that are flexible and scalable. These platforms must provide access to appropriate hardware and software resources that support diverse data types, computational scalability needs, and the usage patterns of diverse research communities.
This talk with focus on my work on two platforms addressing scalable data management and analysis in life science research. CyVerse, a National Science Foundation (NSF) funded cyberinfrastructure project launched in 2008 (as iPlant Collaborative), provides computational resources associated with managing data-driven research, collaborations, and discoveries. The computational infrastructure that CyVerse provides consists of several modular components that connect data, tools, storage, computational resources, and knowledge resources. By focusing on building software and systems to provide access to these components, CyVerse is enabling third-party platforms to leverage its underlying resources to scale their services and more easily interoperate with one another. CoGe, funded by NSF, USDA, and GBMF, is one such platform that is Powered by CyVerse, and provides data management and comparative analysis for over 21,000 genomes from 17,000 organisms. CoGe also permits researchers to add new genomes, keep them private, share them will collaborators, and make them fully public as well as a variety of pipelines to seamlessly integration all types of functional and diversity genomics data for downstream analysis and visualization. In addition, each analysis run through CoGe being assigned a unique URL that permits its exact regeneration ensuring long-term reproducibility, and all of CoGe’s data and analyses are available for programmers to use through its REST APIs.
These platforms are part of a larger ecosystem of interoperable computational resources, enabling researchers to seamlessly move their data and analyses across platforms, and permits new platforms to be created using backend services. This permits the global community of computational biologists, bioinformaticists, and life science programmers to focus on solving new problems rather than resolving old problems.