Working in the Clouds: The Virtual Lab
By Carina Storrs, NYAS Contributor
For scientists of all stripes, research networks are expanding. The number of authors listed on articles published in peer-reviewed journals, the gold standard for academic achievement, climbed steadily in the second half of the 20th century, and shows no sign of slowing down. A physics paper published in Physical Review Letters in 2015 set the record with more than 5,000 authors!
Although many of the reasons for scientific collaborations are as old as science itself, cloud-based data sharing and collaboration platforms allow researchers to access vast data sets and connect with experts in different disciplines — historically two of the major motivations to collaborate — with unprecedented ease.
Last spring, St. Jude Children’s Research Hospital, Memphis, Tenn., launched St. Jude Cloud, where researchers can openly view and analyze sequencing data from nearly 10,000 whole genomes, as well as thousands of whole-exome and RNA-seq datasets, of pediatric cancer patients. For researchers studying pediatric cancer, tapping into such large datasets is not just informative, but necessary. “Without data sharing, everyone has a small cohort to analyze and you really cannot assess whether your finding is significant,” said Jinghui Zhang, PhD, co-leader of the St. Jude Cloud project and chair of the St. Jude Department of Computational Biology.
Although the St. Jude Cloud team had concerns about data security of cloud-based genomic computing, recent improvements made them comfortable with moving to the cloud, Zhang said. They are partnering with DNAnexus, a data management and bioinformatics company that developed their software. The U.S. Food and Drug Administration is also using DNAnexus software, which complies with the data protection standards for Health Insurance Portability and Accountability Act (HIPAA) and a number of regulatory bodies, for precisionFDA, a cloud-based portal for researchers to exchange data and tools for processing next-generation sequencing. “We found it reassuring that a government organization is also confident about the security features of the software,” Zhang said.
Previously, researchers had to download sequencing data to local computers to work with them, which would be nearly impossible with the massive datasets like those on the St. Jude Cloud. Scott Newman, PhD, recalls years ago trying to query sequencing data from only about 100 patients with high-grade glioma to see if any harbored a mutation he had found in a single patient. “It was just so frustrating because I had one question about one gene and it took months to download the data,” and all the while his principal investigator was asking for the result, said Newman, who is currently a bioinformatics scientist with the St. Jude Cloud project.
Now researchers who want to know more about a mutation can use the St. Jude Cloud to visualize pre-analyzed sequence data in graphic form, and they can also run their own analysis using tools that are available on the platform (to explore either sequencing data already on the cloud or that they upload). In a way, this capability may reduce collaborations with computational scientists because “everyone can read a picture” and researchers no longer have to figure out how to install tools on their own computers, Zhang said.
The tools are also opening up doors to different types of collaborations. For example, St. Jude Cloud recently added a collaboration interface to one of its tools called PeCan Pie, that lets clinicians or geneticists share intriguing new mutations with other researchers to facilitate the difficult task of determining whether a mutation found in a patient’s genome is responsible for the disease. Scientists who have been studying the function of the gene by using transgenic mice or other model systems, can add their data and “really bring a lot of power to the classification,” Zhang said.
In a similar way, VDJServer, a cloud-based portal for analyzing immune cell receptor gene sequencing data, has a special sharing feature for users to send such data amongst themselves. “It is an intermediary between working independently and uploading your sequencing data into the publicly available section,” said Lindsay Cowell, PhD, associate professor of biomedical informatics at the University of Texas Southwestern Medical Center, Dallas, Texas.
Cowell and her colleagues developed VDJServer to streamline the sharing of B- and T-cell receptor sequencing data, and also to provide a platform for researchers without computational expertise to handle these data sets, which have rapidly grown bigger, more complicated and more abundant in the era of high-throughput sequencing.
Nevertheless, navigating the tools on VDJServer, from assessing the quality of the sequence data, to comparing the processed sequence data to other samples to answer an experimental question, still requires some computational know-how and basic understanding of immunology. It is common for clinicians who want to analyze samples on VDJServer, such as samples from patients with cancer or an autoimmune disease or from a vaccine study, to strike up collaborations with Cowell’s group or another bioinformatics research group.
In one case, research clinicians at Moffitt Cancer Center in Tampa, Fla., collaborated with Cowell’s team to use VDJServer to identify B-cell receptor genes that form an antibody with a protein that is highly expressed on the surface of acute myeloid leukemia (AML) cells. They then engineered CAR T-cells expressing these receptor genes and hope to test them in a clinical trial of patients with AML this year.
The strength of cloud-based platforms is not limited to immunology and cancer, or even the biomedical sciences. “I definitely think the sharing and collaboration, really all the benefits of VDJServer, apply across the board to any of these high throughput complicated data types,” Cowell said. A major strength of VDJServer, according to Cowell, is that it eases reproducibility. Each time a researcher runs an analysis, their method — the file, software version and parameters they use — is automatically saved in the publicly available section and can be viewed by others who may be interested in running the same analysis, or applying the same method to their own data set.
While cloud-based platforms are providing researchers with an entirely new forum for finding collaborators, other services have revolutionized some of the tried-and-true methods, which include searching the literature for relevant academic articles and meeting other researchers at conferences. SSRN (Social Science Research Network), now a division of Elsevier, was launched two decades ago with the express purpose of connecting scholars at different institutions. SSRN sets itself apart from peer-reviewed journals and their highly specialized content, by hosting on its website free content from all academic disciplines such as engineering, economics and cognitive science. Researchers submit, often unpublished, material ranging from a new database to conference slides to white papers. Experts working for SSRN, typically PhD students and junior professors, evaluate a piece and determine the categories it best fits into in an iterative process that increases its chances of landing in multiple related categories.
“Whereas I as an accountant would submit to, and read, an accounting journal, SSRN ensures I get exposed to other areas that make sense,” said Gregory Gordon, the managing director of SSRN. Researchers find unexpected yet relevant content in their categories of interest, which they can peruse on the website or view in email alerts. The SSRN team found one of the most effective ways to connect researchers is simply by posting the researchers’ email address on their SSRN author page.
Another virtual avenue for connecting researchers is through web-based seminars, and the pioneers in this area are a phylogenetics focused YouTube channel called Phyloseminar and a microbiology series on Google+ Hangouts called MicroSeminar. These series help create environments that foster casual interactions for researchers to learn more about their field, said Erick Matsen, an associate member at Fred Hutchinson Cancer Research Center in computational biology, who created and runs Phyloseminar. “I’ve had people come up to me at conferences and say, ‘I am here because I saw a really neat talk on Phyloseminar.’ A lot of these people are from other countries where maybe there’s not as much phylogenetics and wouldn’t have heard about it otherwise,” Matsen said.
As for managing shared projects once collaborations are established, Matsen relies heavily on Slack, a cloud-based collaboration platform that has quickly become popular among scientists and others in communication-intensive fields. Matsen’s lab has a separate channel on Slack for each of their research projects in which they exchange messages and share data and feedback in a much more user-friendly way than email. Matsen can easily share the appropriate channels with his collaborators while keeping the others private. With Slack, he said, “If they are in the next room or across the world, it’s just the same.”