The New York Academy of Sciences
RECOMB Regulatory Genomics/Systems Biology/DREAM Conference 2009
Posted March 30, 2010
The 2009 RECOMB Regulatory Genomics/Systems Biology/DREAM conference took place on December 2-6, 2009 at the Broad Institute in Cambridge, Massachusetts. The meeting combined the 6th RECOMB Satellite Conference on Regulatory Genomics, co-chaired by Manolis Kellis and Ziv Bar-Joseph, the 5th RECOMB Satellite Conference on Systems Biology, chaired by Andrea Califano, and the 4th DREAM Conference, chaired by Gustavo Stolovitzky. The first two conferences had previously spun off from the RECOMB conference on Research in Computational Molecular Biology, while the DREAM conference, Dialogue for Reverse Engineering Assessments and Methods, had arisen with a more focused goal of evaluating systems-biology tools for building networks and assessing the limitations of quantitative prediction in biology.
This eBriefing contains reports and multimedia focusing on the keynote talks, which addressed a wide-ranging set of topics related to stem cell networks, development networks, new approaches to genetics in heterogeneous populations, protein interactions, the evolution of networks, and new directions in research in regulatory genomics. It also reports on the results of the DREAM Challenges, a set of problems that invite teams of researchers to submit "predictions" that are then compared with highly trusted "gold standards."
Use the tabs above to find a meeting report and video from this event.
Presentations are available from:
Naama Barkai (Weizmann Institute of Science)
Mark Biggin (Lawrence Berkeley National Laboratory)
Walter Fontana (Harvard Medical School)
Nevan Krogan (University of California, San Francisco)
Ihor Lemischka (Mount Sinai School of Medicine)
Edward Marcotte (University of Texas at Austin)
Franziska Michor (Memorial Sloan-Kettering Cancer Center)
Garry Nolan (Stanford University)
John Reinitz (Stony Brook University, Chicago Center for Systems Biology)
Robert Waterston (University of Washington)
Kevin White (University of Chicago)
Michael Yaffe (Massachusetts Institute of Technology)
Richard Young (Massachusetts Institute of Technology)
Philip Kim (University of Toronto)
Daniel Marbach (Massachusetts Institute of Technology)
Julio Saez-Rodriguez (Harvard Medical School, Massachusetts Institute of Technology)
Robert Prill (IBM)
DREAM Challenges 1-3 Responses
- 00:011. Introduction; Flexibility of gene expression
- 03:202. Promoter structure
- 10:043. Nucleosome pattern and flexibility; Evolutionary implication
- 16:434. Genetic basis for expression; cis vs. trans effects; Low predictive power
- 25:395. Correlation with expression; Summary, acknowledgements, and conclusio
- 00:011. Introduction; Knowing the normal state; Single cell signaling analysis
- 06:342. Differentiation; Mapping altered signaling; Clustered AML signaling
- 10:333. Clinical progression and potentiation; Recounting disease history
- 15:554. Experimental design - RA; Predicting outcomes and flares
- 23:405. Measuring more "stuff"; Metasearch and scale-up; Summary, acknowledgements, and conclusio
- 00:011. Introduction; Toolkit for systems analysis
- 06:122. Breaking a network down by signals; The apoptotic response
- 12:103. The model and what it tells us; The dynamic range of cell processing
- 20:074. Linear distribution of dynamic range; MK2 activity; Biological importance
- 28:185. Applications; Conclusion and acknowledgement
Academy eBriefings on past DREAM events
For multimedia and meeting reports from past DREAM events, please see the following eBriefings:
Annals of the New York Academy of Sciences
Stolovitzky G, Kahlem P, Califano A. 2009. The Challenges of Systems Biology: Community Efforts to Harness Biological Complexity. Annals of the New York Academy of Sciences, Vol. 1158.
Stolovitzky G, Califano A. 2007. Reverse Engineering Biological Networks: Opportunities and Challenges in Computational Methods for Pathway Inference. Annals of the New York Academy of Sciences, Vol. 1115.
RECOMB Regulatory Genomics and Systems Biology 2009
The official Web site for this combined conference.
The DREAM Project
The DREAM home page includes more information about the project and the DREAM challenges
Software used by Walter Fontana to explore rule-based modeling of complex biological systems.
Company that makes CyTOF, the mass-spectrometry based analog of flow cytometry being used by Garry Nolan.
The model organism ENCyclopedia Of DNA Elements (modENCODE)
This project aims to identify all of the sequence-based functional elements in the Caenorhabditis elegans and Drosophila melanogaster genomes.
Berkeley Drosophila Transcription Network Project
Discussed by Mark Biggin, this effort is working to decipher the transcriptional information contained in the extensive cis-acting DNA sequences that direct the patterns of gene expression that underlie animal development.
This algorithm for predicting sustrates of kinases was a best performer in DREAM Challenge 1.
Gene Net Weaver
Used in generating in silico networks for DREAM Challenge 2.
The Web site of the Pawson lab, which includes a catalog of peptide recognition domains often found in proteins, like those used in DREAM Challenge 3.
Genomic Evolutionary Rate Profiling
Used by Kevin White to quantify evolutionarily conserved sequence elements.
Berlin Institute for Medical Systems Biology
Focuses on post-transcriptional gene regulation
A database of quantitative expression of transcription factors at cellular resolution referred to by John Reinitz.
Tirosh I, Barkai N. 2008. Two strategies for gene regulation by promoter nucleosomes. Genome Res. 18: 1084-1091.
Tirosh I, Reikhav S, Levy AA, & Barkai N. 2009. A yeast hybrid provides insight into the evolution of gene expression regulation. Science 324: 659-662.
Tirosh I, Weinberger A, Carmi M, Barkai N. 2006. A genetic signature of interspecies variations in gene expression. Nat. Genet. 38: 830-834
Fowlkes CC, Hendriks CL, Keränen SV, et al. 2008. A quantitative spatiotemporal atlas of gene expression in the Drosophila blastoderm. Cell 133: 364-374.
Keränen SV, Fowlkes CC, Luengo Hendriks CL, et al. 2006. Three-dimensional morphology and gene expression in the Drosophila blastoderm at cellular resolution II: dynamics. Genome Biol. 7: R124
Li XY, MacArthur S, Bourgon R, et al. 2008. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol. 6: e27.
Luengo Hendriks CL, Keränen SV, Fowlkes CC, et al. 2006. Three-dimensional morphology and gene expression in the Drosophila blastoderm at cellular resolution I: data acquisition pipeline. Genome Biol. 7: R123.
MacArthur S, Li XY, Li J, et al. 2009. Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol. 10: R80.
Feret J, Danos V, Krivine J, et al. 2009. Internal coarse-graining of molecular systems. Proc. Natl. Acad. Sci.106: 6453-6458
Danos V, Feret J, Fontana W, et al. 2008. Rule-based modeling, symmetries, refinements. Lecture Notes in Bioinformatics 5054: 103-122.
Danos V, Feret J, Fontana W, et al. 2007. Rule-based modeling of cellular signaling. Lecture Notes in Computer Science 4703: 17-41.
Hlavacek WS, Faeder JR, Blinov ML, et al. 2006. Rules for modeling signal-transduction systems. Science STKE, 344: re6.
Bandyopadhyay S, Kelley R, Krogan NJ, Ideker T. 2008. Functional maps of protein complexes from quantitative genetic interaction data. PLoS Comput. Biol. 4: e1000065.
Beltrao P, Trinidad JC, Fiedler D, et al. 2009. Evolution of phosphoregulation: comparison of phosphorylation patterns across yeast species. PLoS Biol. 7: e1000134.
Collins SR, Miller KM, Maas NL, et al. 2007. Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature 446: 806-810.
Roguev A, Bandyopadhyay S, Zofall M, et al. 2008. Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast. Science 322: 405-410.
Ulitsky I, Shlomi T, Kupiec M, Shamir R. 2008. From E-MAPs to module maps: dissecting quantitative genetic interactions using physical interactions. Mol. Syst. Biol. 4: 209.
Lu R, Markowetz F, Unwin RD, et al. 2009. Systems-level dynamic analyses of fate change in murine embryonic stem cells. Nature 462: 358-362.
Macarthur BD, Ma'ayan A, Lemischka IR. 2009. Systems biology of stem cell fate and cellular reprogramming. Nat. Rev. Mol. Cell Biol. 10: 672-681.
Schaniel C, Ang YS, Ratnakumar K, et al. 2009. Smarcc1/Baf155 Couples Self-Renewal Gene Repression with Changes in Chromatin Structure in Mouse Embryonic Stem Cells. Stem Cells 27: 2979-2991
Whetton AD, Williamson AJ, Krijgsveld, et al. 2008. The time is right: proteome biology of stem cells. Cell Stem Cell 2: 215-217.
Gray RS, Abitua PB, Wlodarczyk BJ, et al. 2009. The planar cell polarity effector Fuz is essential for targeted membrane trafficking, ciliogenesis and mouse embryonic development. Nat. Cell Biol. 11: 1225-1232.
Hart GT, Lee I, Marcotte EM. 2007. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics. 8: 236
Lee I, Lehner B, Crombie C, et al. 2008. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat. Genet. 40: 181-188.
Li Z, Lee I, Moradi E, Hung NJ, et al. 2009. Rational extension of the ribosome biogenesis pathway using network-guided genetics. PLoS Biol. 7: e1000213.
McGary KL, Tae Joo Park TJ, John O.Woods, et al. 2010. Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proc. Nat. Acad. Sci. USA Published online before print March 22, 2010.
Schrimpf SP, Weiss M, Reiter L, et al. 2009. Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes. PLoS Biol. 7: e48.
Assanah MC, Bruce JN, Suzuki SO, et al. 2009. PDGF stimulates the massive expansion of glial progenitors in the neonatal forebrain. Glia 57: 1835-1847.
Haeno H, Levine RL, Gilliland DG, Michor F. 2009. A progenitor cell origin of myeloid malignancies. Proc. Natl. Acad. Sci. 106: 16616-16621.
Irish JM, Hovland R, Krutzik PO, et al. 2004. Single cell profiling of potentiated phospho-protein networks in cancer cells. Cell 118: 217-228.
Sachs K, Perez O, Pe'er D, et al. 2005. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308: 523-529.
Janssens H, Hou S, Jaeger J, et al. 2006. Quantitative and predictive model of transcriptional control of the Drosophila melanogaster even skipped gene. Nat. Genet. 38: 1159-1165.
Friedländer MR, Chen W, Adamidi C, et al. 2008. Discovering microRNAs from deep sequencing data using miRDeep. Nat. Biotechnol. 26: 407-415.
Selbach M, Schwanhäusser B, Thierfelder N, et al. 2008. Widespread changes in protein synthesis induced by microRNAs. Nature 455: 58-63.
Stoeckius M, Maaskola J, Colombo T, et al. 2009. Large-scale sorting of C. elegans embryos reveals the dynamics of small RNA expression. Nat. Methods. 6: 745-751.
Bao Z, Murray JI, Boyle T, et al. 2006. Automated cell lineage tracing in Caenorhabditis elegans. Proc. Natl. Acad. Sci. 103: 2707-2712.
Boyle TJ, Bao Z, Murray JI, et al. 2006. AceTree: a tool for visual analysis of Caenorhabditis elegans embryogenesis. BMC Bioinformatics 7: 275.
Murray JI, Bao Z, Boyle TJ, et al. 2008. Automated analysis of embryonic gene expression with cellular resolution in C. elegans. Nat. Methods. 5: 703-709
Zhao Z, Flibotte S, Murray JI, et al. 2009. New tools for investigating the comparative biology of Caenorhabditis briggsae and Caenorhabditis elegans. Genetics [ePub ahead of print].
Hua S, Kallen CB, Dhar R, et al. 2008. Genomic analysis of estrogen cascade reveals histone variant H2A.Z associated with breast cancer progression. Mol. Syst. Biol. 4: 188.
Hua S, Kittler R, White KP. 2009. Genomic antagonism between retinoic acid and estrogen signaling in breast cancer. Cell 137: 1259-1271.
Liu J, Ghanim M, Xue L, et al. 2009. Analysis of Drosophila segmentation network identifies a JNK pathway factor overexpressed in kidney cancer. Science 323: 1218-1222.
Zhuang M, Calabrese MF, Liu J, et al. 2009. Structures of SPOP-substrate complexes: insights into molecular architectures of BTB-Cul3 ubiquitin ligases. Mol. Cell. 36: 39-50.
Janes KA, Reinhardt HC, Yaffe MB. 2008. Cytokine-induced signaling networks prioritize dynamic range over signal strength. Cell 135: 343-354.
Janes KA, Yaffe MB. 2006. Data-driven modelling of signal-transduction networks. Nat. Rev. Mol. Cell Biol. 7: 820-828.
Reinhardt HC, Yaffe MB. 2009. Kinases that control the cell cycle in response to DNA damage: Chk1, Chk2, and MK2. Curr. Opin. Cell Biol. 21: 245-55.
Toettcher JE, Loewer A, Ostheimer GJ, et al. 2009. Distinct mechanisms act in concert to mediate cell cycle arrest. Proc. Natl. Acad. Sci. 106: 785-790.
Jaenisch R, Young R. 2008. Stem cells, the molecular circuitry of pluripotency and nuclear reprogramming. Cell 132: 567-582.
Marson A, Levine SS, Cole MF, et al. 2008. Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134: 521-533.
Mathur D, Danford TW, Boyer LA, et al. 2008. Analysis of the mouse embryonic stem cell regulatory networks obtained by ChIP-chip and ChIP-PET. Genome Biol. 9: R126.
DREAM Challenge 1 (Peptide Recognition Domain)
Benedix A, Becker CM, de Groot BL, et al. 2009. Predicting free energy changes using structural ensembles. Nat. Methods 6: 3-4.
Brinkworth RI, Breinl RA, Kobe B. 2003. Structural basis and prediction of substrate specificity in protein serine/threonine kinases. Proc. Natl. Acad. Sci. 100: 74-79
Guerois R, Nielsen JE, Serrano L. 2002. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J. Mol. Biol. 320: 369-387.
Mok et al. 2010. Science Signaling (in press).
Saunders NF, Brinkworth RI, Huber T, et al. 2008. Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites. BMC Bioinformatics 9: 245.
Saunders NF, Kobe B. 2008. The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information. Nucleic Acids Res. 36: W286- W290.
Tonikian R, Zhang Y, Sazinsky SL, et al. 2008. A specificity map for the PDZ domain family. PLoS Biol. 6: e239.
Tonikian R, Xin X, Toret CP, et al. 2009. Bayesian modeling of the yeast SH3 domain interactome predicts spatiotemporal dynamics of endocytosis proteins. PLoS Biol. 7: e1000218
DREAM Challenge 2 (in silico networks)
Marbach D, Prill RJ, Schaffter T, et al. 2010. Revealing strengths and weaknesses of methods for gene network interference. Proc. Natl. Acad. Sci. USA Published online before print March 22, 2010.
Marbach D, Schaffter T, Mattiussi C, Floreano D. 2009. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16: 229-239.
DREAM Challenge 3 (signaling network)
Saez-Rodriguez J, Alexopoulos LG, Epperlein J, et al. 2009. Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Mol. Syst. Biol. 5: 331.
Saez-Rodriguez J, Goldsipe A, Muhlich J, et al. 2008. Flexible informatics for linking experimental data to mathematical models via DataRail. Bioinformatics 24: 840-847
Prill RJ, Marbach D, Saez-Rodriguez J, et al. 2010. Towards a rigorous assessment of systems biology models: the DREAM3 Challenges. PLoS ONE. 5(2):e9202.
Ziv Bar-Joseph, PhD
Ziv Bar-Joseph is an associate professor at the Machine Learning Department and at the Computer Science department (CSD) at Carnegie Mellon University. His primary research areas are computational biology, bioinformatics, and machine learning. He heads the Systems Biology Group at the School of Computer Science at CMU. His team’s work addresses issues ranging from the experimental design level to the systems biology level. He completed his PhD at the MIT Laboratory for Computer Sciences, where he worked on problems related to distributed computing and computer graphics.
Andrea Califano, PhD
Andrea Califano is professor of biomedical informatics at Columbia University, where he leads several cross-campus activities in computational and system biology. Califano is also codirector of the Center for Computational Biochemistry and Biosystems, chief of the bioinformatics division, and director of the Genome Center for Bioinformatics.
Califano completed his doctoral thesis in physics at the University of Florence and studied the behavior of high-dimensional dynamical systems. From 1986 to 1990, he was on the research staff in the Exploratory Computer Vision Group at the IBM Thomas J. Watson Research Center, where he worked on several algorithms for machine learning, including the interpretation of two- and three-dimensional visual scenes. In 1997 he became the program director of the IBM Computational Biology Center, and in 2000 he cofounded First Genetic Trust, Inc., to pursue translational genomics research and infrastructure related activities in the context of large-scale patient studies with a genetic components.
Manolis Kellis, PhD
Manolis Kellis is an assistant professor of computer science at MIT, a member of the Computer Science and Artificial Intelligence Laboratory, and associate member of the Broad Institute of MIT and Harvard. He is the recipient of a National Science Foundation Career Award (2007), the Karl Van Tassel 1925 Career Development Professorship, and the Distinguished Alumnus 1964 Career Development Professorship. He was selected as one 35 top young innovators under the age of 35 by Technology Review magazine, one of 20 young scientists recognized as the Principal Investigators of the Future by Genome Technology magazine, and one of three scientists representing the next generation in biotechnology by the Museum of Science in Boston.
Kellis obtained his PhD from MIT, where he received the Sprowls award for the best doctorate thesis in computer science, the first Paris Kanellakis graduate fellowship, and the Chorafas Foundation award. His research is in the field of computational biology, developing algorithms and machine learning techniques to interpret complete genomes, understand gene regulation, reconstruct cellular networks, and study genome evolution. Prior to computational biology, he worked on artificial intelligence, sketch and image recognition, robotics, and computational geometry at MIT and at the Xerox Palo Alto Research Center.
Gustavo Stolovitzky, PhD
Gustavo Stolovitzky is manager of the Functional Genomics and Systems Biology Group at the IBM Computational Biology Center in IBM Research. The Functional Genomics and Systems Biology group is involved in several projects, including DNA chip analysis and gene expression data mining, the reverse engineering of metabolic and gene regulatory networks, modeling cardiac muscle, describing emergent properties of the myofilament, modeling P53 signaling pathways, and performing massively parallel signature sequencing analysis.
Stolovitzky received his PhD in mechanical engineering from Yale University and worked at the Rockefeller University and the NEC Research Institute before coming to IBM. He has served as Joliot Invited Professor at Laboratoire de Mecanique de Fluides in Paris and as visiting scholar at the physics department of the Chinese University of Hong Kong. Stolovitzky is a member of the steering committee at the Systems Biology Discussion Group of the New York Academy of Sciences.
Naama Barkai, PhD
Mark Biggin, PhD
Walter Fontana, PhD
Nevan Krogan, PhD
Ihor Lemischka, PhD
Edward Marcotte, PhD
Franziska Michor, PhD
Garry Nolan, PhD
Nikolaus Rajewsky, PhD
John Reinitz, PhD
Robert Waterston, PhD
Kevin White, PhD
Michael Yaffe, MD, PhD
Richard Young, PhD
DREAM Best Performers
University of Padua
Jonathan Ellis, PhD
Alex Greenfield, PhD
Philip Kim, PhD
Robert Küffner, PhD
Daniel Marbach, PhD
Robert Prill, PhD
Julio Saez-Rodriguez, PhD
John Schwacke, PhD
Nicola Soranzo, PhD
Vân Anh Huynh-Thu
How do some DNA regions emerge as strong regulatory elements from the many regions that bind weakly to transcription factors?
How clear is the distinction, experimentally or conceptually, between protein complexes and pathways?
Is there selection for the long-term flexibility of modular processes, or only for their short-term advantages?
Does the inclusion of regulatory RNAs and RNA-binding proteins change the inferred topology of biological network models?
How do rule-based descriptions change the understanding of various biological processes?
How does the organization of chromatin at different scales reinforce the expression patterns of different cell types, such as stem cells?
Do we have all of the ingredients to quantitatively predict the global expression patterns of developing metazoans?
How can quantitative differences in transcription-factor binding give rise to robust changes in expression of the corresponding proteins?
How do regulatory RNAs and RNA binding proteins affect that dynamics and stability of transcription networks?
Is the dynamic range of signaling more important than the absolute levels for some disease pathways?
Richard Young, Massachusetts Institute of Technology
Ihor Lemischka, Mount Sinai School of Medicine
- A handful of "core" transcription factors reinforce each other to maintain the self-renewing pluripotent stem-cell state, in which stem cells have the ability to differentiate into any cell type in the body.
- To maintain pluripotency, stem cells must avoid expressing the master regulators that lead them to differentiate into specialized cell fates, but they are typically "poised" to do so.
- Other factors, including histone modifications and chromatin structure, microRNA, and signaling pathways are also involved in pluripotency.
- By inserting an inducible gene whose transcript resists degradation, researchers can precisely control the core transcription factor Nanog to initiate differentiation in stem cells.
- Subsequent snapshots of the patterns of histone modifications, RNA polymerase II elongation, messenger RNA levels, and nuclear protein levels reveal how the differentiation signal propagates through the regulatory network.
- In differentiating stem cells, messenger RNA levels correlate very poorly with the levels of the corresponding proteins.
A simple twist of fate
Pluripotent stem cells, such as embryonic stem cells, combine a persistent self-renewing state with the capacity to differentiate into any cell type in the body. If researchers learn to control these processes, the cells could be widely useful for studying genetically influenced diseases or perhaps even for new treatments. At a more fundamental level, the molecular underpinnings of pluripotency are central to how cell fate is determined.
Over the past few years, researchers have identified a set of "core transcription factors" that are central to maintaining the pluripotent state. "The master regulators—Oct4, Sox2, Nanog, and to some extent Tcf3—form these interconnected autoregulatory loops where they all get together and bind the promoters of their own genes," noted Richard Young.
Another critical feature is that "to maintain pluripotency you must not express the master regulators of any other cell type,” Young stressed. Once differentiation begins, Oct4 and Nanog rapidly degrade, he said, "so that once you initiate the process of differentiation, these cells cannot normally return to their embryonic state."
"To maintain pluripotency, you must not express the master regulators of any other cell type."
Shinya Yamanaka's 2006 demonstration that exposing an already differentiated cell to these chemicals can reset the clock, rendering them pluripotent, showed that "transcription factors are really key to cell state," Young said. Nonetheless, there are other factors that help establish or maintain the self-renewing state, including chromatin-regulating proteins like those of the polycomb group, histone modifications, microRNAs including miR-92, and developmental signaling pathways such as the Wnt pathway.
Young and his colleagues used ChIP-seq (chromatin immunoprecipitation followed by sequencing) to identify DNA regions bound by transcription factors, confirming that the key regulators not only reinforce the pluripotent state but also suppress programs for other cell types. "Oct4, Sox2, Nanog, and Tcf3 ," together with proteins that modify chromatin and histone modifications, "co-occupy genes that encode the master regulators of other cell types," he said.
Young suggested that this repression keeps the gene in a state that's "poised for expression" once differentiation begins. The core transcription factors also occupy genes for microRNAs that fine tune the messenger RNA of the same target genes through "incoherent feed-forward loops," Young said. "This kind of circuitry allows you to more rapidly remove the messages that would otherwise cause the cells to remain longer in the embryonic stem cell state."
In addition to the expected binding of the core factors, the researchers found RNA polymerase II not only at the promoter regions of genes, but also at enhancer regions upstream. Rather than binding directly to enhancers, Young said, "it's just as likely that the enhancers bend around to interact with polymerase at the initiation site, so these [measurements] may represent just cross-linking of RNA polymerase to the transcription factors.
Further screens identified other indications of the DNA looping between enhancer and promoter sites. Young and his colleagues found evidence for the huge protein complex called mediator at active genes, which is not too surprising since it "mediates the interaction between DNA-bound transcription factors and the transcription-initiation apparatus at the core promoter."
Getting the ball rolling
To understand how pluripotent cells begin to move towards differentiation, it's not enough to map the networks that maintain the self-renewing state. Ihor Lemischka said, "It is absolutely necessary to measure network dynamics during the course of changes in cell fate and hopefully in real time."
To address this challenge, Lemischka and his colleagues have developed an experimental system to initiate differentiation in mouse embryonic stem cells in a controlled way. They then track the cells using a wide variety of techniques to begin to build a movie of information propagating through the network during the early stages of commitment to a more differentiated cell type.
"It is absolutely necessary to measure network dynamics at multiple levels during changes in cell fate."
To kick start the process, the researchers use a tetracycline- (or doxycycline-) inducible promoter. They coupled this promoter to a modified gene for the Nanog protein, whose messenger RNA contained a 3' untranslated region that is immune to targeted degradation due to short hairpin RNA. "These cells depend absolutely on the presence of the doxycycline to maintain their pluripotency," Lemischka said. Moreover, "there are no significant off-target effects" in this scheme.
After the researchers remove doxycycline from the medium, the Nanog protein disappears within a day, and the cells begin to differentiate. They fixed cells after one, three, and five days, and did extensive experiments. They mapped some histone modifications, notably acetylation of lysine-4 on histone 3, mapped the positions of elongating RNA polymerase II on the DNA, and measured the level of messenger RNA transcripts for various proteins.
In addition, in collaboration with Tony Bretton in the UK, they profiled over 1600 nuclear proteins using inductively coupled plasma mass spectrometry. The comparison of protein levels with the corresponding messenger RNA revealed a surprise. "In almost half the cases there is, if anything, an anticorrelation between what happens at the mRNA level and what happens at the business end of the process; i.e., the protein level," Lemischka remarked. Under these conditions, he said, if one followed the common practice of using messenger RNA as a proxy for protein levels, "one would be not only fooled but seriously misled." With his Mount Sinai colleague Avi Ma'ayan, Lemischka has been developing visualization tools to integrate the various types of data. He noted that such visualization aids are essential to understanding, "how does information propagate through this rather simple network?"
Recently, the researchers have extended their inducible promoter technique to the nuclear hormone receptor Esrrb, and constructed putative networks to describe the observations. "These data, I hope, will give you essentially a complete—or mostly complete—picture of how the cell fate change is taking place across multiple molecular and biochemical levels and as a function of time after a very defined perturbation," Lemischka said.
Nanog sets differentiation in motion. But what sets Nanog in motion? Lemischka worked with Patrick Paddison, then at Cold Spring Harbor Laboratory, to look for drivers of Nanog expression after they triggered differentiation with retinoic acid. They monitored fluorescent protein driven by the Nanog promoter as they used short interfering or hairpin RNAs against a library of proteins to see whether the transition was slowed down or sped up. "If you knock down a gene that's required for this transition, you'd expect to prolong the time," Lemischka said.
Among the library components that influenced the rate of differentiation was SWI/SNF (SWItch/Sucrose NonFermentable), which is an ATP-dependent chromatin-remodeling complex. As with Richard Young's research, this observation suggests that the chromatin state is critical to the commitment of stem cells to differentiation, Lemischka said. "SWI/SNF is absolutely required to change the chromatin from a loose configuration to a more compacted and folded configuration that's seen in more differentiated cells."
John Reinitz, Stony Brook University and the Chicago Center for Systems Biology
Mark Biggin, Lawrence Berkeley National Laboratory
Robert Waterston, University of Washington
- Gene regulation in eukaryotes involves the combined effect of many enhancers, which are usually thought to contribute independently, but sometimes do not.
- Many of the features of development in Drosophila, including surprising non-additive effects of transcription factors, can be mathematically modeled with a model that considers both long-range recruitment of the general transcription factors to the promoter and short-range quenching.
- The binding affinity of sequence-specific transcription factors in Drosophila to various regions of the genome varies widely, but functional importance cannot be assessed by using a threshold occupancy probability.
- Variations in accessibility reflecting local chromatin structure explain why some sequences that should bind factors do not.
- Fluorescence microscopy allows quantitative tracking of the gene expression in each cell in the developing fly embryo, providing a rich data set for evaluating quantitative models of gene regulation.
- Tracking worm nuclei during development lets researchers trace the known cell lineages to within one cell division of the hatched larva while a fluorescent reporter reflects the gene activity in each cell.
- Many of the 150 genes tracked so far in C. elegans have a repetitive structure that is reminiscent of the partitioning known to take place in Drosophila embryos.
- Many genes in C. elegans are spliced to form different transcripts in different tissues or stages of development.
More than the sum of the parts
Early animal development is shaped by relatively few transcription factors in a geometrically simple environment. These features make development an attractive context for exploring the rules governing gene expression in animals, which is much more complex than for simpler organisms. Researchers are hopeful that they will soon be able to predict the global expression patterns that result from the interactions of many transcription factors with the known DNA sequence.
Although the essence of gene regulation in prokaryotes was understood a half century ago, John Reinitz said, "we still don't understand it for higher metazoans, and I would claim for eukaryotes in general."
Researchers attribute genetic regulation to the combined effects of individual cis-regulatory modules or enhancers, which are short sequences on the same DNA strand as the genes they act on, but not next to them. These entities "give you a way to break up large promoter or control regions into independently acting pieces," Reinitz said, "but the hypothesis of independence insured by short-range repression has not been adequately tested."
Reinitz and his collaborators have developed mathematical models for the development of the fruit fly, Drosophila melanogaster. The eventual segmented body plan of the insect appears early during development as a series of seven stripes of varying gene expression by the well studied “even-skipped”, or “eve” gene. Quantitative measurements of the levels of various transcription factors allow stringent tests of models, Reinitz noted. "As in physics, if you have a fairly well understood system, looking very, very closely at small changes from what you would expect can be very scientifically informative."
"In a well understood system, small changes can be very informative."
Reinitz highlighted two experiments—more than a decade old—that pose particular challenges for modeling. One observation involves two fragments of DNA, neither of which by itself induces expression of stripe 7. In genetically modified flies where the elements are adjacent in the DNA, however, strip 7 appears. "This is a case where nothing plus nothing gives you something," Reinitz said. "That doesn't fit into the classical picture of enhancer action."
Although researchers often talk about thermodynamic models of transcription-factor binding, Reinitz said, when it comes to protein-protein interactions "people really don't understand the fundamental chemistry. Those have to be represented phenomenologically."
Reinitz and his coworkers have built a model in which binding of multiple activators to DNA recruits the transcription machinery. They also include short-range quenching factors. "Each activator tends to interact with six or eight quenchers," he said, and the feed-forward calculation "can model this something-from-nothing behavior."
The second puzzling observation, which involved fusing two well characterized enhancers and obtaining a completely novel pattern, also violated the simple additive expectation. Reinitz showed that the observed effect was a consequence of the interaction of two factors, in this case bicoid and hunchback. "Bicoid is a co-activator, that is to say it transforms hunchback from a quencher to an activator," Reinitz said.
Reinitz emphasized that recognizing these effects required careful measurements coupled to a quantitative model. He expressed cautious optimism that researchers have enough ingredients to predict genome-wide expression as in real embryos, where many enhancers act simultaneously. "One of the things we hope will come out of this is to understand much better how enhancers emerge as emergent properties from a distribution of binding sites."
A question of degree
Together with his colleagues in the Berkeley Drosophila Transcription Network Project, Mark Biggin has used chromatin immunoprecipitation to map the binding of tens of transcription factors in fly embryos. "The predominant view is that transcription factors that have different biological functions each bind and control the expression of very distinct sets of target genes," he said. In contrast, experiments show that binding of many factors is widespread and highly overlapping across the genome, but it varies widely in degree, in what Biggin calls a "quantitative network."
"A lot of the lower-level binding is nonfunctional," and such low-level binding has long been expected from simple thermodynamic arguments. "It may at least be weakly affecting transcription," Biggin said, but "it's my suspicion that the system tolerates some level of weak transcriptional control that's not biologically significant." Simply setting a threshold level, however, cannot discriminate between functional and non-functional binding.
Instead, the overall activity of genes seems to reflect the precise degree of binding. "Relatively modest quantitative differences in occupancy between transcription factors on shared targets turn out to be a major determinant of the factors’ distinct regulatory specificities," Biggin explained.
Specificity does not arise because particular factors bind exclusively to particular targets. Nonetheless, groups of transcription factors that have similar functions tend to bind at similar levels at different sites, and these groups recapitulate the classic developmental classification of embryonic factors. In contrast, when researchers use a threshold to classify factors as either bound or not bound at each site, this correlation disappears.
Researchers often attribute the degree of binding between a transcription factor and a particular sequence to the measured affinity between purified protein and naked DNA. "If we use only such in vitro data as input to our models, the results correlate very poorly with the measured patterns of binding in vivo," Biggin cautioned. However, accounting for the accessibility of different regions of DNA, as measured by sensitivity of chromatin to cleavage by DNAase, markedly improves the binding predictions, especially for regions that would be expected to bind tightly on the basis of sequence alone.
To move further toward quantitative modeling of transcription patterns, the Berkeley team has developed a scheme to measure the expression levels of genes over time in each cell in the embryo, using confocal microscopy. Since they can track only a few gene products at a time, they have devised ways to use a reference reporter to register data from thousands of embryos, encompassing millions of cells, into a single model blastoderm.
"Our current atlas records expression at 7 time points and has data for protein expression of 20 factors and mRNA expression of 100 target genes and 25 cis-regulatory modules," Biggin said. By integrating these data with other data sets, he said, "we will be able to gain a deeper mechanistic understanding of how the networks function by using more advanced computational models."
Robert Waterston and his colleagues have developed tools for tracking gene expression at individual cells in the worm, Caenorhabditis elegans. As is well known, "every worm undergoes exactly the same pattern of cell divisions to produce exactly the same cells, so if you know the pattern of cell divisions, you can identify what the cell is going to be," he said.
"If we can take advantage of the lineage, all of the anatomy will be done for us."
The researchers track individual nuclei through development using confocal microscopy to infer where each cell lies on the well-known lineage diagram. "If we can take advantage of the lineage we don't have to do anatomy: all of the anatomy will be done for us," Waterston said. Currently this process works for embryos with up to about 350 cells, compared to the final total of 671 cells, which includes the 558 that survive in the mature worm. "We're about one round of cell division short," he said, "and we're hopeful."
The researchers have fluorescently tracked about 150 genes so far, and have also used protein fusions to get more direct information about protein levels instead of just messenger RNA levels. Although some genes do nothing interesting at this stage, others show a clear lineage pattern, Waterston said. "Many have a repetitive pattern. It reminds us of the kind of partitioning of the embryo that you see in flies." He added, "if we go through the whole set of cells, we basically can see distinct gene expression patterns for almost every cell.
As part of the modENCODE project, Waterston and his collaborators have also done deep analysis of transcription in worms. To do this they sequence huge numbers of short RNAs from the animals, often limiting attention to segments with a poly-A tail that marks them as processed messenger RNA. They looked at samples from 19 populations, including a wide variety of different cell types and developmental stages.
Because the sequence reads are short, the researchers had to align them to the genome, using special techniques for sequences that span a splice site. They also used splice-leader sequences and poly-A tails to identify the beginning and ends of transcripts in their small segments.
In over a billion reads, Waterston said, "we've got almost 15,000 splice junctions that were not found in [the C. elegans database] WormBase, either confirmed or even in prediction." The new junctions included all but about 50 of those previously found by Marc Vidal's group. "By that standard, we've got high representation."
The coverage of the sequence was deep enough to let the researchers, notably Jiang Du in Mark Gerstein's lab, identify stage-specific alternative splice forms. For one gene, for example, in one tissue, 90% of the transcripts represented a single isoform, while in another tissue 96% of the transcripts from the same gene were a different isoform. Nonetheless, the coverage is not deep enough to uncover all rare transcripts, Waterston noted, so the team is working with Jay Shendure to enrich the rare genes by doing array capture of the abundant, well-known forms.
"If we can pull all of these things together," Waterston said, "we can start to really understand what the transcriptional network is doing in going from this one cell egg to the 558-cell hatched embryo."
Nevan Krogan, University of California, San Francisco
Michael Yaffe, Massachusetts Institute of Technology
- Integrating protein-protein interaction data with genetic interaction data can help identify protein complexes and pathways, individual proteins, and components of proteins.
- Genetic interaction data from mutants can clarify the specific amino acids that are critical to proteins and complexes such as RNA polymerase II, nucleosomes, chaperones, the ribosome.
- The effect of a signaling molecule can change markedly depending on what other signals are present and when the signal occurs.
- A precisely characterized network model can generate novel biological predictions.
- The dynamic range of a signal can be more important in disease than its absolute level.
Much biological activity is carried out by interacting proteins, either in the form of protein complexes or sequential pathways. To dissect these relationships, scientists draw on data from in vitro protein-protein interactions and genetics, as well as antibody-based measurements that quantitatively assess the activity levels of signaling proteins.
To explore the proteins that make up complexes, researchers use affinity tagging and mass spectrometry to identify proteins in contact. Nevan Krogan and his colleagues augment this data with genetic interaction data to help them find functional pathways. "The real challenge, Krogan said, is to "take protein-protein interaction data and integrate it with the genetic interaction data and visualize it in a representation that biologists can look at and formulate hypotheses about their protein or proteins of interest." The merged data can give insight at many levels, he said: "organelle, biological process, pathway, complex, protein within a complex, domain within a protein, and amino acids."
Krogan's team characterizes genetic interaction in yeast by monitoring colony size. This gives a quantitative measure of fitness when two genes are simultaneously modified. Sets of genes that have mostly positive interactions "often encode for proteins that are physically associated, OR work in the same pathway," he noted. "Oftentimes now, we can look at the genetic interaction data... and make predictions about the composition of protein complexes before we even go and biochemically characterize them."
"Often the genetic-interaction data lets us predict protein complexes before they are biochemically characterized."
Krogan illustrated the power of the merged data by describing work with the large transcriptional initiation complex mediator, a protein complex also discussed by Richard Young. "If you just had the protein-protein interaction data, you'd say it's a 25-protein complex. If you just had the genetic interaction data, you'd hypothesize maybe you're looking at four different complexes. But together you say it's a 25 protein complex [comprising] four functionally distinct submodules."
Interactions between two genes indicate that their protein products are functionally interdependent, irrespective of their physical relationship. Extending genetic interaction experiments to mutants allows researchers to infer the precise subunits or amino acids needed for the interaction. The relevant subunits may differ for different roles of a multifunctional protein.
Krogan described in detail work exploring RNA polymerase II, done in collaboration with Roger Kornberg and his former postdoc Craig Kaplan. A mutagenesis screen found 71 different mutations in four essential subunits of the polymerase, which disrupt transcriptional regulation. The researchers looked for genetic interactions in crosses between these mutants and 1100 other mutants with either deletions or extra copies of essential genes involved in a variety of processes. "We didn't just want to see the connections with transcriptional machinery, we wanted to connect these point mutants to other biological processes," Krogan said.
Krogan acknowledged that targeting a few subunits of one specialized cellular machine was not guaranteed to illuminate other aspects of cell function. "The first question we had," he said, "was will you get these mutants impinging on other biological processes and therefore interactions with components of those processes?" But the results showed that "this machine is actually impinging on enough processes that it's worthwhile to do this genetic interaction map." The mutants with strong interaction showed that the polymerase subunits impinged on a range of processes, including cryptic initiation, chromosome segregation, and DNA damage. "This has inspired us to look at other major molecular machines," Krogan said. Among their targets so far are the nucleosome, the ribosome, and the chaperone protein Hsp90.
Krogan and his team also used their merged protein-protein and genetic interaction data to compare related species. "The complexes, the functional modules, are highly conserved" during evolution, he said. But their interactions have changed significantly. "There seems to be this massive rewiring that's occurred between these conserved functional modules."
Signaling pathways, in which kinases change the activity of other proteins (including other kinases) by phosphorlyating them, play a central role in many biological processes. But biological context is critical, said Michael Yaffe. "One kinase can do very different things, depending on the context in which it sends the signals."
"One kinase can do very different things, depending on the context in which it sends the signals."
Yaffe reviewed an exhaustive set of measurements he and his colleagues, notably Kevin James, made of the activity of tumor necrosis factor (TNF) in cultured colon carcinoma cells. Following a cue/signal/response model, the researchers first stimulate the cells with various cytokines and growth factors. They then measure the internal signals, such as protein phosphorylation. "[We look at whatever assay we can get, as long as the assay is rigorously quantitative," Yaffe said. Finally, they look at cellular responses such as cytokine release or measures of apoptosis.
"We can't just have a view of our favorite molecule. We have to have lots of molecules all at the same time in order to figure out what's going on," Yaffe said. The researchers collected 12 different measures of apoptosis over a 760-dimensional vector space, he noted.
Faced with this surfeit of data, "you have to do dimensional reduction," Yaffe said. Principal-components analysis showed that out of 760 components, three combinations "completely predict the apoptotic response. Two of those principal components could capture 95% of the response," he said. "The first principal component was the 'stress and death' axis. The second is a sort of survival axis." This analysis helped to clarify the role of TNF, which can either promote apoptosis or survival, depending on the cellular context and when the signal occurs. As Yaffe suggested, "Combinatorial stimuli are not the linear superposition of individual responses."
More recently, Yaffe and his colleagues have applied this approach to DNA damage. In response to stimuli including doxycycline (dox), which is part of many chemotherapy regiments, "one of the most important signals in getting a DNA damage response was extracellular signal-regulated kinase (Erk)," also known as MAP kinase.
"What emerged is something I still am having trouble getting my head around," Yaffe admitted. "First, in response to dox, Erk was stopping the cell cycle. Second, for cells that were in S-phase, Erk was causing them to die by apoptosis." "It's surprising because I grew up thinking that Erk is a survival kinase," he said, but in this case "the very same kinase now sends a cell-cycle arrest and apoptosis message." This observation reiterates the context-dependent nature of signaling.
In another set of projects, the researchers used their extensive quantitative measurements of signaling to "get more insight into the biology without doing a single experiment, by just playing around with the model," Yaffe said. They did this by model-break-point analysis, changing the mathematical properties of the model until its behavior changed. Perhaps surprisingly, the model did not degrade steadily, but often worked fine until it broke catastrophically.
Once the model failed, the researchers looked to see what parts had broken. "The signal that was lost from the model was the activity of MAPKAP kinase2, MK2," Yaffe said. Later, they experimentally validated the role of this kinase in stabilizing the gradual increase in messenger RNA for IL-1. "We found new biology," Yaffe concluded, "by taking data we already had." In another example, the researchers replaced the continuous variation of variables with a finite number of discrete bins. The model failed if the number of bins fell below about 20 for the autocrine signal IL-1, showing that "the IL-1 circuit must be analog. It really cares about the fine levels, Yaffe said. In contrast, the model still worked when TGFα levels were reduced to two binary levels.
Finally, the researchers warped the response of some signals. In one case they made the response saturate quickly, and in other they desensitized it. Again, different kinases were sensitive to different regions of the response curve. By transgenically modifying MK2, they experimentally verified that a highly sensitive response was necessary for its biological function. "It looks like a full linear dynamic range of signals is probably more important than the absolute activity of a particular kinase," Yaffe concluded.
Nikolaus Rajewsky, Max Delbrück Center for Molecular Medicine
Walter Fontana, Harvard Medical School
- MicroRNAs are important for many biological processes and directly regulate at least one third of all human genes.
- About 1000 human genes code for proteins with at least one RNA-binding domain, and new experimental methods promise to clarify their significance and specificity.
- Feeding E. coli with isotope-labeled amino acids to different nematode worm species allows in vivo high-throughput, quantitative, comparative proteomics.
- Because of the combinatorial complexity of modifications like phosphorylation or complex formation, the number of possible molecular species in even a small cellular network is astronomical.
- The large number of possible molecular states can create "logjams" that impede the approach to equilibrium.
- If the different modifications can be treated independently, it is much more efficient to represent changes in the system in terms of rules, rather than in terms of chemical reactions.
The advances in understanding the networks of biological regulation have happened against a backdrop of continuing discovery of basic mechanisms and principles in complex biological systems. These new insights could dramatically change the way people think about regulatory networks.
Biologists have rapidly adopted the powerful tools of RNA interference and its cousins to help them interrogate biological networks by suppressing target messenger RNAs in fully-functional cells. But naturally occurring small RNAs, notably microRNAs (miRNAs), also have a widespread role in many biological processes.
There are at least 500 microRNAs in the human genome, said Nikolaus Rajewsky. They are often conserved and often differentially expressed. "We now know that they are important for many—if not most—biological processes," he said, including metabolism, cancer, memory, immune responses, development, signaling, and others. "They directly regulate the post-transcription expression of at least one third of all human genes," with each miRNA regulating on average hundreds of target genes. Most commonly they do this by inducing degradation of messenger-RNA that contains a complementary sequence, for example in the 3' untranslated region.
Identifying miRNAs requires cloning and sequencing them from cells, but also distinguishing them from other small RNAs. Rajewsky's group has recently revised their miRdeep program, which analyzes raw sequence data to find miRNA candidates. The program exploits the fact the miRNA processing by protein complexes in the nucleus cuts the precursor RNA in characteristic patterns that distinguish it from RNA produced by random degradation.
Analyzing available data from a variety of organisms, the algorithm found some new candidate miRNAs—but not very many. They estimated 80 in mice, but only six in humans and nine in flies. "There aren't many more of the microRNAs to find in these pretty-well researched animals," Rajewsky said. He also commented that many of the microRNAs in the database miRbase may not be true microRNAs.
Short RNAs such as miRNA generally act through complexes with RNA-binding proteins, and these are less thoroughly understood. With perhaps 1000 human genes encoding RNA binding domains, often several in a single protein, this is a large class of genes and a complicated problem. "The function of this class of genes is largely unexplored," Rajewsky said, especially the question of what determines the proteins’ binding specificities. "Computational methods are just in their infancy."
Fortunately, emerging experimental techniques mean that "we are now in a position to really study post-transcriptional gene regulation at a systematic level," Rajewsky said. These techniques include high-throughput quantitative proteomics to complement messenger RNA measurements, translational profiling, and high-throughput identification of binding sites on messenger RNAs for RNA-binding proteins.
"We are now in a position to study post-transcriptional gene regulation at a systematic level."
One powerful technique for quantifying protein levels is called SILAC (for Stable Isotope Labeling by Amino acids in cell Culture). "It allows us to look at changes in protein synthesis, rather than changes in protein levels," Rajewsky said, since it measures the incorporation of amino acids labeled with specific isotopes. "The technology is at the point where you can measure 5000 different proteins in one sample," Rajewsky reported.
These studies showed that although changes induced by miRNA overexpression are widespread, they tend to be small. "MicroRNAs exert mostly weak effects on protein synthesis," Rajewsky said. In many cases they act to globally fine-tune genetic activity. Nonetheless, including RNA in network models is likely to be important for quantitative modeling and other aspects of regulation.
Rules, not reactions
"Physics and chemistry, while necessary, are not sufficient for laying the foundation of systems biology," challenged Walter Fontana, because they do not let researchers grapple with the complexity of the cell and the organism. "Systems biology needs a third leg to stand on to take systems seriously. I believe that that leg, that third partner, is going to be computer science."
Even a small fraction of a model network can include a staggering number of molecular species, once various combinations of phosphorylation and complex formation are accounted for. "Do all these possible molecular species matter, or should we just ignore them?" Fontana asked. "We can't answer that if we have to eliminate them at the outset."
Fontana used simple examples to illustrate this combinatorial complexity. Independent reaction pathways, analogous to computer scientists' "concurrency," can speed the assembly of a protein complex, for example. But interference between the different pathways leaves many molecules in an intermediate, partially-assembled state, in what Fontana called "dynamic jamming." In addition, each cell will contain a unique subset of the vast array of possible molecular species, continuously drifting over time. "We can't ignore these things," he said, "although we routinely do."
The combinatorial complexity arises when different reactions, such as phosphorylation at different sites, can occur independently, Fontana said. "Only then will this set of possibilities become manifest." At the same time, "Independence also offers us a potential solution for tackling these systems," Fontana said. "I can specify this whole system, rather than specifying exponentially many reactions, by using linearly many rules, exploiting independence."
Rule-based descriptions specify patterns rather than molecular species, so they keep the combinatorial complexity implicit. "It allows us to coarse-grain the system," Fontana said. In one example, he and his colleagues pared down a system of 1019 equations to an exactly equivalent set with "only" 180,000. "That [number] is within the realm of the tractable," he said.
In this view, Fontana said, "a network does not exist in the cell in the same way as a subway network or road network in Boston or New York. This is a network of possibilities." One critical issue in applying these techniques is the degree to which biological rules are truly independent. Many biological processes, such as cooperativity and compartmentalization, limit independence, Fontana noted. Although the context can be refined to include more conditions on each rule, this dilutes the simplifying power of the idealized model.
Rule-based models lead to new ways of thinking about how complex biological interactions collectively give rise to coherent, plastic, adaptive, and evolvable system behavior. Assembly of a complex such as the proteosome, for example, "does not occur in isolation, but in the context of massive pleiotropy and conflict with many different proteins trying to compete for the same binding sites," Fontana said. The protein components "might be shaped in a particular way, not just to assemble the proteosome correctly, but to prevent the interference from other proteins that co-occur in the cell."
"Perhaps signaling pathways are induced by the signals that the cell receives, rather than being hard-wired, waiting for a signal to arrive."
"This view leads one to contrast two kinds of perspectives of what is going on in cells," Fontana said, although he cautioned that both are probably useful for different processes. "One view is the engineering view, in which we view the cell as a set of hardwired circuits." In the other view, he said, "you have a huge, fluid network in which signaling pathways are induced by the signals that the cell receives, rather than being ready-made, hard-wired, waiting for signal to arrive."
Kevin White, University of Chicago
Garry Nolan, Stanford University
Franziska Michor, Memorial Sloan-Kettering Cancer Center
- Breast cancer subtypes have distinct genetic identities, but many mutations occur repeatedly.
- Common variants may be less important than rare alleles in breast cancer and other diseases.
- Single-cell studies reveal phenomena that are not apparent in populations.
- The presence of sub-populations of cells that have a particular signaling response can predict outcomes for patients with cancer or autoimmune disease.
- New technology using isotopic labeling and mass spectrometry could quantify the levels of hundreds of proteins in individual cells.
- Mathematical models of population growth can be used to determine where the mutations that lead to cancer first appear in the differentiation cascade of cells.
- For blood and brain tumors, cancer most likely initiates from a progenitor cell that develops self-renewal propensities by accumulating appropriate genetic or epigenetic changes.
In many systems, important features of biology are obscured when researchers measure the average response of a heterogeneous population. At the level of human populations, for example, people with different genetic backgrounds may respond quite differently to treatment. Even within a single person, a disease like cancer may include different populations of cells, and a treatment that deals effectively with one may open the door for others that are more deadly. Moreover, the cancer-cell population generally arises from changes in a single cell. New tools and concepts for grappling with heterogeneity will be critical in understanding how gene regulation and networks affect diseases.
By understanding the transcriptional regulatory networks in cancer, researchers can identify genes that may be high-value targets, said Kevin White. "They are likely to contribute to the cancer phenotype...and are potentially druggable, or parts of pathways that are druggable."
In one example, White described how network analysis in Drosophila led his team to a several factors, including one called SPOP in the Jun-kinase signaling pathway. This factor acts in ubiquitin-driven degradation and is overexpressed in virtually all clear-cell renal-cell carcinomas. In collaboration with Brenda Schulman, the researchers determined the crystal structure of SPOP when it was bound to some substrates, which let them identify a small peptide domain that was critical to its substrate interaction. Knowing this motif, they then scanned all human phosphatases, found one that was a potential substrate, and then verified that this protein's levels are inversely related to SPOP in cancer.
For breast cancer, choosing the right treatment depends very much on whether a patient has estrogen and progesterone receptors and the well-known ErbB2 receptor, White noted. "About 60% of breast cancers are estrogen-receptor (ER) positive, and therefore respond to treatments like tamoxifen, aromatase inhibitors, or related treatments." In contrast, triple-negative patients, who lack all three receptors, "are clinically a much bigger problem because there's not a good targeted way to treat them."
"Disease phenotypes are often heterogeneous."
The differences between patients reflect, at least in part, their genetic differences. Even when they are outwardly similar, White said, "disease phenotypes are often heterogeneous," and these differences affect treatment outcomes.
To explore this heterogeneity, White and his colleagues compared ER-positive and triple-negative breast cancers to look for genetic variants that are shared within one or the other subtype. In addition to the common variants that are compiled in single-nucleotide polymorphism (SNP) data bases, they also looked at less common single-nucleotide variants (SNVs). They used the Genomic Evolutionary Rate Profiling (GERP) score to quantify the degree to which variations have been rejected during evolution.
Compared to common polymorphisms (SNPs), White said, "rare polymorphisms have much higher GERP scores, meaning they're much more likely to be deleterious mutations. In addition, using the fixation index to estimate the statistical correlations within the populations showed that "rare alleles have a similar kind of association, whereas the common alleles do not." "Breast cancer actually has a large number of recurrent mutations," White concluded, and breast cancer subtypes have distinct genetic identities. He contrasted that with the common idea that a few genes drive diseases. "This leads one to think that perhaps rare alleles drive this disease and common polymorphisms have little impact."
Signaling in single cells
Some chemical analyses are not yet feasible at the single-cell level, Garry Nolan admitted, but probing individual cells can reveal important aspects of biology and disease that are obscured in larger average samples. Moreover, their importance may not be obvious until the experiments are done.
In the current technique, Nolan's team starts with cells from patients, fresh or frozen, from solid or liquid tissue. After perturbing the cells, they fix them to lock in the cell state, and then disrupt the membranes and proteins to allow access by fluorescently tagged antibodies to proteins in the cell or on the surface. Running the cells one at a time through a cell sorter, they then quantify the fluorescence for ten or fifteen molecules simultaneously—although new technology will raise this number.
These experiments often show that concentrations of specific molecular species vary together with those of others, meaning that they reflect distinct sub-populations of cells. "You see these perturbations and antiperturbations going on at the level of a single cell, things that you would miss if you were to lyse the cell," Nolan observed.
Interrogating the cells, for example by exposing them to signaling molecules, can reveal that subsets of the cells respond differently. Using antibodies to phosphorylated proteins, Nolan and his colleagues have found differences in the signaling response of different subpopulations of cells in what seems like a homogenous population. "We dive inside of those cell types to find the underlying heterogeneity that exists in there that is revealed by stimulation that you wouldn't see otherwise."
In cancers, the presence of different subpopulations can predict how patients will respond to treatment. "It might be that the gene mutation that occurred doesn't really have its phenotype until later on," Nolan said, but the change can be revealed by its different signaling response.
In acute myeloid leukemia, for example, "cancers that were resistant to therapy could be predicted by that kind of signaling cascade." A company named Nodality, founded by Nolan, has demonstrated, using a preliminary blinded study, an algorithm that identifies with 95% confidence which patients should bypass debilitating chemotherapy and go straight to a bone marrow transplant.
Single-cell studies can also help dissect the effects of chemotherapy on the cancer-cell population. By genotyping the cell-sorted subpopulations, back at Stanford, graduate student Erin Simonds confirmed that returning cancer retained specific genetic signatures. "In multiple tumor types [from one patient], the signaling shows that they're all related and that, at relapse, they're still related," Nolan said. He cautioned that killing a population of daughter cells can encourage cancerous precursors to proliferate. "We believe we are relieving the feedback inhibition that's going on, which perhaps explains why, in many cancers the cancer seems to come back an awful lot faster."
The team also found that cancerous changes could be tracked to only two genes, FGFR3 and BRSK3, out of a large region covering 1/20th of the genome. Nolan suspects that dividing cancer cells into subpopulations could improve the power of genetic studies. "One problem with genome-wide association studies is that they're working with four or five different cancers in the same patient," he said.
"One problem with genome-wide association studies is that they're working with four or five different cancers in the same single patient."
Recently, Nolan's group has applied single-cell techniques to autoimmune diseases. Rheumatoid arthritis showed reduced activation of STAT1 and other molecules, Nolan said. "This is paradoxical because you often think of autoimmune disease as an inflammatory disease, as opposed to a suppression disease."
By comparing the signaling responses of cells, Nolan said, "we can predict systemic lupus erythematosous outcome three months out," as well as indicating treatments that patients’ had previously gotten. "As the immune system is changing as a network, different stimulations and particular phosphoproteins within that cell type, when compared to another cell type, can tell you whether the patient is getting better or getting worse."
New technology promises to expand the power of single-cell techniques. For example, overlapping emission spectra from fluorescently-labeled antibodies currently limit the number of molecular species that can be simultaneously tracked in single-cell experiments. Nolan's group has recently acquired the first "CyToF" instrument, pioneered by Scott Tanner at the University of Toronto. This machine uses antibodies labeled with specific isotopes, which are identified by sensitive inductively coupled plasma mass spectrometry in each vaporized cell, and should be able to track the levels of hundreds of species.
Nolan's group has also teamed with electrical engineers to implement network-inference algorithms in hardware in real time for each cell. "We've been promoting our own DREAM competition on the chip," Nolan quipped.
Where do cancers begin?
For some tumor types, the differentiation cascade of cancer cells mimics that of normal cells, progressing from self-renewing stem cells to progenitor cells to progressively more differentiated cells. But a cancer stem cell could in theory arise in several ways, for example when a normal stem cell gets a cancerous mutation or when a cancerous progenitor or even more differentiated cell acquires the capacity for self renewal.
"One of the important questions in cancer research is which of those sets of cells is actually going to accumulate all the genetic alterations leading to cancer," said Franziska Michor. "It's important because we want to understand the mechanisms of cancer initiation." Determining the cell of origin may also be useful for modeling the tumors in patients and in the lab. In addition, treatments should treat the whole tumor, and if any surviving cells are of the critical type, they can potentially lead to relapses.
Michor has developed mathematical models for tumor genesis that account for the probabilistic nature of mutations. She defines a differentiation cascade for the tumor cells that parallels the cascade for normal cells, and uses biological insight to define the model parameters. For each candidate cell of origin, she runs the model many times and calculates how the cancer-cell population evolves. "With this model we can follow the accumulation of specific mutations over time," she said.
"With this model we can follow the accumulation of specific mutations over time."
Michor applied the model to several tumor types to determine the cell of origin in each case. She described the analysis for myeloproliferative neoplasms, whose cancerous state is driven by a mutation in the JAK2 tyrosine kinase. The critical role of this single mutation is similar to that of the Bcr/Abl mutation in chronic myelogenous leukemia, which she had studied previously. This mutation has been shown not to confer self-renewal, though. In one evolutionary scenario leading to a cancer-initiating cell, the JAK2 mutation occurs in a normal stem cell that already has the property of self-renewal. A second evolutionary scenario has a progenitor cell developing self-renewal by acquiring another mutation, and later the JAK2 mutation. In a third, the progenitor gets the JAK2 mutation and later, before the line dies out, acquires self-renewal. In a fourth scenario, a stem cell accumulates a mutation that confers self-renewal to progenitors, which then evolve the JAK2 mutation.
Calculating the probability of cancer initiation for each scenario, Michor found it to be largest for the second scenario: "This suggests that a progenitor is the most likely cell of origin."
Working with colleagues at Memorial Sloan-Kettering, Michor also applied this analysis to gliomas driven by PDGF (platelet-derived growth factor) overexpression, which can be induced by injecting the oncogenic substance into the brain.
Edward Marcotte, University of Texas at Austin
Naama Barkai, Weizmann Institute of Science
- Protein complexes tend to be more conserved during evolution than individual proteins.
- Phenotypes that are driven by orthologous genes can lead researchers to other relevant genes, even when the phenotypes are quite different.
- Flexibility, or the propensity of individual genes to change the ways in which they are expressed, appears on many time scales, from evolutionary periods to everyday expression variation.
- Differences in flexibility among genes correlate with their different promoter structures.
Everything in biology is the product of evolution, but the way that the usefulness of a phenotype translates into selective pressure on the genome depends on how genes work together. Many biological systems, including regulatory networks, are modular, and useful modules tend to persist through evolution even as their relationships change. Researchers are using evolutionary insight to discover new biology, and are also trying to understand the mechanisms that determine the rate of change in different parts of the genome.
In spite of exploding information about sequence variation, said Edward Marcotte, "we have relatively little ability to interpret the consequences of the genetic variation" at the level of health or disease. Individual genetic changes affect the organism by altering the existing functional and physical organizations of proteins. Recent observations suggest that how essential a gene is depends mostly on the complexes its proteins join, Marcotte said. "It's the machine that matters, not the protein."
Marcotte's team has used large-scale gene and protein networks to identify disease-linked genes. In what he described as "guilt-by-association propagation," known disease genes implicate their neighbors in the inferred networks, and the researchers have validated many of these predictions.
But there is much more protein and genetic information available for model organisms than for people, Marcotte stressed. "Even with the genome-wide association study explosion we still have fewer than 2000 gene-phenotype associations in human at an organismal level."
Researchers could better exploit the data from model organisms, Marcotte said, if they could answer questions like "What's the worm equivalent of breast cancer?" To this end, his team has been exploring the usefulness of orthologous phenotypes, or "phenologs" between distant species.
To identify candidate genes associated with a trait in one species, the researchers start with a known gene. They then look for orthologous genes in another species, and determine what phenotype it relates to in that species. In many cases, the phenotypes in the two species have no obvious relationship. But gene interactions tend to persist through evolution, even when the modules they form are used for new purposes.
As a result, other genes involved in the orthologous phenotype are more likely to also be involved in the original phenotype of interest. "Where this becomes powerful is that almost all phenotypes that we have are very undersampled for the genes associated with a phenotype," Marcotte said, so the approach will often flag candidate genes that haven't been tested yet. For example, genes affiliated with breast and ovarian cancer in humans significantly overlap worm genes that cause excess male progeny.
"We found many interesting mappings between model organism phenotypes and human diseases," Marcotte said, including obvious relations like that between mouse and human cataracts. "But the important thing is we get also surprising cases that you wouldn't necessarily anticipate."
"Ancient systems of genes retain their functional coherence even as their function varies, and can predict disease genes in distantly related organisms."
In another example, sensitivity of yeast to the cholesterol drug lovastatin parallels angiogenesis in mammals. "We're identifying ancient systems of genes that predate the split of these organisms, and ones that in each case retain their functionally coherence and do one thing in yeast, and the genes stay together as a system and in the animal case they do something else."
Most surprisingly, Marcotte said, "we can find hundreds of good mappings between plant defects and mammalian defects." For example, genes that disrupt the response of Arabidopsis plants to gravity are orthologous to genes associated with diverse defects called Waardenburg syndrome, which results from problems with neural-crest cells. The researchers confirmed that other vertebrate genes that are orthologous to those that disrupt gravitropism in plants indeed affect the migration of neural-crest cells in developing frogs.
"This small network of genes worked together and has retained the tendency to work together all of the way through the development of the plant lineage and animal development," Marcotte noted. His team is actively exploring many other candidate disease genes that the phenolog approach identified.
Some aspects of biological systems remain remarkably constant through evolution, even as others change. Naama Barkai argued that genes differ from one another in their variability. Together with her former student Itay Tirosh, she traced some of those differences to differing organizations of the genes' promoters.
Flexibility, or the propensity for changing gene expression, occurs on several different time scales. Over evolutionary times, comparison of different yeast species shows that genes vary in how fast they evolve, Barkai said. "Some genes are quite conserved between species or strains, some other genes change rapidly."
"This result makes evolutionary sense, Barkai said." You always have this interplay between being able to adapt to different conditions and being robust" under current conditions. The relative tradeoff between flexibility and robustness depends on the gene, she noted. "It's fair to say that different genes have to be positioned differently on this scale."
On shorter time scales, "different genes have different dynamic ranges when you change conditions or have mutations." A third aspect is the noise, or intrinsic variability in the expression level for each gene. The response on these different time scales is "all very much connected," Barkai said.
On both short and evolutionary time scales, different genes have different degrees of flexibility, which are connected with their promoter structure.
Statistically, the different types of flexibility are all connected to the structure in genes' promoter regions. One of the most important determinants is the presence or absence of the well-known "TATA box" sequence in the promoter, Barkai said. During evolution, "genes that diverge quickly are much more likely to contain a TATA box than those which diverge slowly." Other researchers have shown that this feature is also associated with noisier expression. "One sees this on all three levels of flexibility of time: noise, regulation, and also in evolution."
The arrangements of nucleosomes in the promoter region also vary in a characteristic fashion, Itay and Barkai found. It is well known that nucleosome binding in the promoter region changes expression levels, but it also affects the flexibility of expression. Low-flexibility genes, Barkai said, typically have a well-defined nucleosome-free region, with two well-positioned neighboring nucleosomes. "This pattern is also conserved in humans."
In contrast, high-flexibility genes often have a TATA box and show no gap in nucleosome occupancy in their promoter region. Barkai is currently exploring the mechanisms by which the promoter structure modifies flexibility, for example by measuring the allele-specific expression in hybrids between two yeast species.
In addition to understanding what determines flexibility, it is important to understand how it evolves, since there can be no direct evolutionary selection pressure for future adaptability. The correlations between genes' flexibility on different time scales hint at a way that evolvability may appear as a favorable side effect of selection for shorter-term responsiveness.
After a successful introduction in 2008, three conferences on genetic regulation, systems biology, and network identification reconvened in early December 2009 in Cambridge, Massachusetts. Over five tightly scheduled days, the meeting combined the 6th RECOMB Satellite Conference on Regulatory Genomics, co-chaired by Manolis Kellis and Ziv Bar-Joseph, the 5th RECOMB Satellite Conference on Systems Biology, chaired by Andrea Califano, and the 4th DREAM Conference, chaired by Gustavo Stolovitzky. The first two conferences had previously spun off from the RECOMB conference on Research in Computational Molecular Biology, while the DREAM conference, Dialogue for Reverse Engineering Assessments and Methods, had arisen with a more focused goal of evaluating systems-biology tools for building networks and assessing the limitations of quantitative prediction in biology.
Because of the large overlap between the conferences, the following summaries draw together the keynote talks that share common elements, irrespective of which conference they were officially part of. This eBriefing contains reports and multimedia focusing on the keynote talks, which addressed the following general topics. Please use the navigation above to find more detailed reports and video from the conference.
Stem cell networks
The potential of stem cells, both for research and treatment, has only been recognized in recent years. Nonetheless, intense study has shown that a handful of transcription factors maintain these cells in their versatile, pluripotent state. In addition to this core regulatory network, Richard Young noted that other factors, including microRNA, signaling, and chromatin structure play key roles in reinforcing pluripotency and its eventual transition toward differentiated cell types. To clarify the nature of this transition, Ihor Lemischka emphasized the need to track it over time using a variety of measurements, including nuclear protein levels, which are surprisingly poorly correlated with the levels of their corresponding messenger RNAs.
The early development of model organisms is a simplified context for exploring gene regulation, but one that remains challenging in detail for even simple animals. John Reinitz described the complex mathematical modeling needed to describe precisely how different factors combine to shape the patterning of Drosophila embryos. In contrast to the common view that each factor targets different subsets of genes, Mark Biggin showed that the factors are often found at varying levels at many common genes, with expression depending on their precise levels as well as chromatin structure. He also described techniques for tracking gene expression with cellular resolution in developing flies. Robert Waterston showed similar mapping results for the worm C. elegans, where the fixed pattern of cell divisions allows researchers to connect each dividing cell to its eventual fate in the mature organism.
At the level of human populations, individual patients differ significantly in how their diseases progress and respond to treatment. Kevin White showed how the tools of systems biology illuminate their differences and point to new approaches to diagnosis and therapy. Even with a single patient, the presence of a subpopulation of cells that responds differently to stimuli than another subpopulation can affect the disease prognosis. Garry Nolan illustrated how single-cell studies can better classify patients with cancers or autoimmune diseases. Cancer itself can start with mutations in a single cell. Franziska Michor simulates the evolving population of cells to determine which cell in the lineage is likely to suffer the critical mutation that sets the disease in motion.
The proteins resulting from gene expression determine biology through their interactions with one another, for example in complexes containing multiple proteins or in pathways in which proteins sequentially affect one another. Nevan Krogan showed how integrating information about genetic interactions enhances the view provided by direct measurements of protein-protein interactions. These techniques illuminate processes ranging from the composition of molecular complexes down to the individual amino acids through which proteins interact. Michael Yaffe explored the interactions that make up a signaling network, emphasizing that the response to even familiar signaling molecules can vary widely with the cellular context. He also showed that tweaking models built on precise measurements of many responses under a variety of circumstances can predict new biological phenomena.
The networks of interacting molecules in biology today are the products of billions of years of evolution, but the principles of how varying phenotypes influence the selection of the underlying genotype are not fully understood. Evolutionary biologists have found that complex biological innovations, such as particular pathways or complexes, remain relatively fixed through evolution, even as they are re-used for different functions. Edward Marcotte showed that this principle can be a powerful predictor of disease genes. Even between distantly related organisms, orthologous proteins (that is, proteins in different species that derive from a common genetic ancestor) often interact in similar groups, although their functions have no obvious relationship. For example, Marcotte's team found new genes for Waardenburg syndrome by looking at disruptions of gravity-sensing in plants. Even as modules stay fixed, their relationships change. Naama Barkai showed that some genes evolve quickly and also change their expression in response to regulation and intrinsic noise. She related some of these differences to different classes of promoter structure for the genes.
In spite of steady progress in understanding the regulatory processes governing biological systems, there remains much to learn, and new phenomena and viewpoints continue to emerge. Nikolaus Rajewsky reminded listeners of the widespread regulatory role of microRNAs and other small RNAs, which are often missing from network models. New experimental techniques could help change this situation by allowing a systematic genome-wide characterization of small RNAs and of the thousand or so RNA-binding proteins in the human genome. Walter Fontana suggested that the combinatorial possibilities in even small networks can give rise to logjams in the assembly of protein complexes, especially ring structures, and cause extreme variability in the molecular contents of similar cells. As a complement to the traditional description in terms of molecular reactions, he advocated a rule-based formalism that can vastly simplify the description and uncover new principles governing complex biological systems.
The DREAM challenges
The DREAM conference arose in 2006 out of a concern that there was little way to tell whether the "reverse engineered" networks being generated from biological data agreed with "reality," or even with each other. The DREAM challenges have become a central tool for addressing that situation, inviting teams of researchers to submit "predictions" that are then compared with highly trusted "gold standards." After the predictions are scored—which is a challenge of its own—the best predictors are invited to describe their techniques at the conference.
Under the guidance of Gustavo Stolovitzky and Robert Prill, DREAM4 included three challenges. Philip Kim described the first, which asked teams to predict the position weight matrices describing the specificities of peptide recognition domains from proteins and of kinases. Seungpyo Hong described a structure-based model for peptide recognition, while Jonathan Ellis presented the established Predikin algorithm for kinases. The second challenge was described by Daniel Marbach, who had also generated biologically-inspired computer-generated networks for DREAM3. Robert Küffner, Nicola Soranzo, Alex Greenfield, Vân Anh Huynh-Thu discussed the diverse techniques that led their respective teams to best-performer status for different subchallenges. The third challenge was also refined from one presented in DREAM3. As described by Julio Saez-Rodriguez, the previous version omitted a few items from a very large set of measurement of a signaling network in liver cancer and normal cells, and was effectively tackled by regression techniques with no biological insight. This year's challenge used less data and required prediction of a new type of data as well as a network. The best performers represented by Federica Eduati and John Schwacke each generated networks using simple mathematical models and then used a more sophisticated model to predict the measurements.
In addition to the keynote talks and DREAM challenges discussed here, the Cambridge conferences included a many fascinating contributed talks, an even larger number of posters, and innumerable stimulating discussions in between.
Gustavo Stolovitzky, IBM
Robert Prill, IBM
Challenge 1: Peptide Recognition Domain
Philip Kim, University of Toronto
Seungpyo Hong, KAIST l
Jonathan Ellis, University of Queensland
Challenge 2: In Silico Networks
Daniel Marbach, Massachusetts Institute of Technology
Robert Küffner, Ludwig Maximilian University
Nicola Soranzo, Center for Advanced Studies, Research and Development in Sardinia (CRS4)
Alex Greenfield, New York University
Vân Anh Huynh-Thu, University of Liège
Challenge 3: Signaling Network
Julio Saez-Rodriguez, Harvard Medical School and MIT
Federica Eduati, University of Padua
John Schwacke, Medical University of South Carolina
- DREAM, the Dialogue for Reverse Engineering Assessments and Methods, challenges researchers to compare and benchmark their techniques for addressing difficult systems biology problems.
- This meeting discussed three challenges: deducing the specificity of protein-recognition domains, inferring the structure of computer-built transcription networks, and predicting a laboratory-measured signaling network.
- The best predictions of peptide-recognition domains exploit information about the three-dimensional structure of proteins.
- The most useful data for predicting the topology of computer generated networks came from doing gene knockouts.
- Successful predictors of signaling used simple models to develop their own networks rather than tweaking a reference network, and used more sophisticated models to predict the network dynamics.
A central feature of the annual DREAM (Dialogue for Reverse Engineering Assessments and Methods) conference is the DREAM Challenges. Before each meeting, the organizers solicit predictions of computational tasks of interest to systems biology. The goal is to compare how different techniques work on the same problem, quantitatively assessing their performance against a "gold standard" solution.
Under the guidance of Gustavo Stolovitzky and Robert Prill of IBM, and with Web support from Nikhil Podduturi and Michael Honig of Columbia, the DREAM4 meeting culminated the third set of DREAM challenges, which continue to be refined. This year, three challenges were posed, covering peptide recognition domains, in silico transcription networks, and signaling networks. Together, the committee gathered 432 predictions from 53 different teams. These predictions "become a data set themselves," said Prill, and the organizers are pulling the data together to look for larger messages. "We like to think of it as a community-wide experiment," he said.
The following sections describe the design of each of the challenges, the overall results, and the strategies used by the best performers.
DREAM Challenge 1: Peptide recognition domain specificity prediction
The first DREAM challenge does not directly involve biological networks, although the results are relevant to networks. The challenge, constructed by Philip Kim, Gary Bader, and David Gfeller of the University of Toronto, asked participants to predict what sequences of amino acids would be targeted by three common peptide recognition domains.
Peptide recognition domains are modular protein subunits, intermediate size between proteins and individual amino acids, that have become progressively more abundant in higher eukaryotes. "The challenge was to create the specificities" of the domains, Kim said. "If you know the specificity then you can predict the binding [between proteins] in vivo."
The organizers chose "the simplest representation of biological specificity," the position weight matrix, or PWM. For peptide sequences comprising 10 amino acids, the PWM is a 10 by 20 matrix containing the relative probability of binding each of the 20 amino acids at each position. The team decided that the benefits that would arise from the familiarity and power of this approach outweighed its inherent neglect of the known correlations in binding at different positions.
"Structure-based models are outperforming machine-learning methods for predicting the specificity of peptide-binding domains."
"Computational predictions of these binding specificities are slowly but surely coming into their own," Kim said. He contrasted two extreme approaches to the problem. On one hand are "physics guys who like to play with structure-based models," including three-dimensional structures and interaction energies. "On the other are machine-learning methods," he said, which try to mimic patterns on training data for peptide binding. Based on the completion results so far, "it seems as if structure-based, or at least structure-aided models, are still outperforming the machine-learning based methods in this challenge,” Kim said.
The overall challenge included three subchallenges, asking participants to predict specificity for five domains from the SH3 family, three kinases, and five from the PDZ family. In each case, participants were given the amino-acid sequence of domains from the family, as well as examples of peptides bound by each domain. They were asked to describe how likely it would be to bind to other peptide sequences, in the context of a position weight matrix. These predictions were then compared to unpublished experiments.
To gauge the significance of the results, the organizers compared their performance to that of uniform random matrices, although some thought that this naïve model was too easy to beat. In fact, some teams did even worse than the random matrix. "When we did more complicated null models, said organizer Robert Prill, "some of the subchallenges didn't produce best performers, so for the purpose of today we're using this very naïve setup that always has a significant winner—well, best performer."
The organizers employed this null model to determine the chances (P-value) of getting a result by chance. The final rank was the geometric mean of the P-values for each of the instances in the subchallenge, Prill said. "We're looking for consistently good predictions across all of the domains."
The data for the SH3-domain subchallenge was provide by Haiming Huang and Sachdev Sidhu at the University of Toronto. Four teams submitted predictions. The best results came from the Protein BioInformatics Lab (PBIL) at KAIST. Seungpyo Hong, together with Dongsup Kim and Taesu Chung, used an approach that examined many possible conformations for the complex made by the domain and the target peptide.
As a starting point for the structure, the researchers used homology, looking at the known structures of proteins with the highest sequence similarity. They calculated the binding energy to candidate peptides using the FoldX algorithm. "But in nature many different peptide bindings are possible," Hong said, "so they could have different structures for different sequences." For this reason the team identified classes of structure for the complex, and did molecular dynamics calculations to determine the precise configuration and energy of each one, for a particular peptide.
The final predictions were a combination of many different configurations with the highest binding energies to the peptide. "We think that this kind of approach will be promising in cases where no binding structure is known and no complex structures are known," Hong said.
Data for evaluating the substrates of kinases came from Benjamin Turk at Yale University. Three teams participated in this subchallenge, and the best predictions came from an existing tool for kinase-substrate binding called Predikin. Jonathan Ellis described the technique, which also reflected the work of Ross Brinkworth, Neil Saunders, and Bostjan Kobe.
Predikin begins by searching the sequence for kinase catalytic domains. These candidates are evaluated for "substrate-determining residues," which are the particular amino acids whose close interaction with a potential substrate protein determine whether it will bind. The precise positions that determine binding are identified by analogy to other kinases, for example at a particular offset from a recognized motif.
Specificity is inferred from databases of other kinases with similar substrate-determining residues. "We look to see what these kinases actually bind to," Ellis said. Once the sites are identified, the data are combined to construct the final position weight matrix. The PDZ-domain subchallenge took advantage of data from Andreas Ernst and Sachdev Sidhu, at the University of Toronto. Seven teams submitted results for this challenge. The best predictor was the team "Chuck Daly," headed by Phil Bradley of the Fred Hutchinson Cancer Research Center, who did not attend the meeting in Cambridge.
Challenge 2: In silico networks
The second DREAM challenge is a perennial favorite: inferring a set of "in silico" networks. Generating networks on the computer is really the only way to have a true gold standard that is known perfectly, although it is not always clear whether other networks might be equally compatible with the data presented. "We can do systematic and efficient performance evaluation because we know the networks, because they're simulated," said Daniel Marbach of MIT.
The benchmarks of this challenge were created using GeneNetWeaver (GNW), an open-source Java tool for the generation of biologically plausible in silico gene networks. However, any similarity between these artificial networks and their biological cousins is limited by the designer's knowledge of the relevant principles, and the synthetic versions probably do not interest biologists much. But they clearly enable the best possible tests of how well algorithms deal with data.
Daniel Marbach again generated the challenge data with Thomas Schaffter, Claudio Mattiussi, and Dario Floreano—former colleagues at the Swiss Federal Institute of Technology in Lausanne. As in the DREAM3 challenges, the network topology was chosen for these small networks to mimic the module-based structure found in biological networks. "These modules can correspond to functional properties of the network and have very similar structural properties," Marbach said. "They are representative samples of complete networks."
Within this topology, the researchers embedded models for both protein and mRNA levels and for their modification by transcription, translation, and degradation; only the transcriptional part is regulated. The transcriptional model is sophisticated enough to include both independent actions by transcription factors and cooperative or competitive binding between them. "Compared to the complexity of real biological networks it's a very simple kinetic model, yet it's already much more accurate and detailed than most models that are used by inference methods," Marbach said.
The researchers use this dynamic model to generate data simulating various biological experiments. They also include noise in the model. In contrast to the previous Gaussian noise model, the data this year included noise in the form of stochastic differential equations (a Langevin framework). The noisy expression time courses that these models produce are given to participants, who are asked to infer the original network topology.
Several forms of output data are provided, starting with the unperturbed "wild-type" expression levels. The simulated experiments include removal of each gene in the network, one at a time, analogous to biological knockouts. Knockdown experiments reduce the activity of each gene in turn by half. This year the researchers for the first time included multifactorial perturbations, simulating the effect of many gene changes at the same time. "This could be thought of as changes in the network," Marbach said, perhaps analogous to environmental or genetic diversity in the population. The researchers also calculate a time series response to the perturbation of a smaller number of genes.
Challenge 2 included three subchallenges. The first sought the structure of five different 10-gene networks, based on their wild-type, knockout, knockdown, multifactorial, and time series response. The second subchallenge asked the same for five 100-gene networks, except without the multifactorial perturbation. The responses to multifactorial perturbation for size-100 networks were pulled out as the third subchallenge. Finally, an optional bonus round asked teams to predict the results of pairs of knockouts, which were data that were not part of the training set.
Teams summarized their predicted network topology as a ranked list of edges between the various genes. By including more and more candidates, starting at the top, the organizers can treat the list as a binary classification task, said Robert Prill. As they vary the cutoff from very selective to very inclusive, they evaluate the degraded accuracy using two well known characteristics, the "ROC" curve and the precision-recall curve.
The area under either curve is "a way to assign a single number to a prediction without imposing any specific cutoff," Prill said. Using randomly sorted lists as a null model, the organizers compute a P-value for how likely it would be to get that area by chance. The final score averages the negative logs of the P-values for the two areas under the curve. This year, "the best performers are doing much better than random on both areas," for the "classic" DREAM data types. For the new multifactorial data, however, the predictions were not so impressive.
"A diversity of methods is being effectively used to extract in silico networks."
Prill noted that blending predictions from different methods tended to be better than any single method. Overall, "a diversity of methods is being used," he said. "It comes down to the talent of the team," rather than a particular method or philosophy. For the optional double-knockout predictions, Prill said, a simple null model, regression, simply adds the effects of the individual knockouts. For the 10-node network, he said, "several teams are doing better than regression." For the 100-node networks, however, none of the submitted predictions performed better than "this kind of regression where you're just taking the sum of the two."
For the 10-node networks, Team Amalia, including Robert Küffner, Florian Erhard, Tobias Petri, Lukas Windhager, and Ralf Zimmer of Ludwig Maximilians University, outperformed 28 other teams. The team adopted a fuzzy logic approach to simulate candidate models, parameterizing the interactions by simple rule tables. To evaluate candidate models, they compared these simulated results to the response data, exploiting all of the datasets provided in the challenge.
To improve the candidates, they employed a genetic algorithm on a population of models, including crossover rules that mix up their attributes. In addition, they used a simulated annealing algorithm to decide if a particular change in rule should be accepted or rejected, based on an effective temperature that sometimes allows relatively low-cost changes. Progressively lowering the temperature allows the system to approach a global optimum without being trapped in locally optimal arrangements.
The 100-node network challenge drew submissions from 19 teams, two of whom described their techniques. In both cases the final method included only the knockout data, ignoring the knockdown and time series data that were provided.
One of the teams consisted of Nicola Soranzo, with Andrea Pinna and Alberto de la Fuente, at CRS4 in Sardinia. They honed their technique by analyzing data from the related DREAM3 challenge, as well as from their own simulator, SysGenSIM, employing data sets with noise patterns similar to those in the DREAM4 data.
To assign edges between two genes, the Sardinian team tried several techniques, but settled on the z-score, which is essentially the absolute distance from the mean expression of one gene when the other gene is knocked out, normalized by the standard deviation. "We did not use at all the knockdown or time series," Soranzo said.
To go further with this relatively simple approach, the researchers pruned the reconstructed network topology. They decrease the weight of edges that indicate a direct influence between two genes, if the indirect influence mediated by other nodes was already stronger than some threshold. "What we wanted to do was down-rank these extra edges since they are not essential for reachability and a lot of these feed-forward loops are present," Soranzo said. "This allowed us to be first."
Alex Greenfield, with Aviv Madar, Harry Ostrer, and Richard Bonneau, also used only knockout data, since their experience with DREAM3 led them to conclude that this type of data was the most informative. "The knockout data performs well, it's powerful, we wanted to use it for DREAM4," Greenfield said.
For the "knockout data we pretty much used z-scores," although "we need to account for gene-specific noise" for the DREAM4 data, Greenfield said. But although this "does very well at predicting topology," the researchers sought more insight and predictive power by combining the z-score predictions with predictions from Inferelator, their ordinary-differential equation-based method. Overall, this dynamic method underperformed the z-score method when the median expression of the gene was high. However, the Inferelator performed better than the z-score method for genes with low median expression, and allowed for the prediction of double knockouts.
To get the best predictions out of the 12 teams that submitted to the size-100 multifactorial subchallenge, Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts of the University of Liège used a machine learning approach based on regression tree ensemble methods.
To inform their supervised learning algorithm, the researchers regarded one gene as the output, and looked for rules that best approximated that output, using the activity of the other genes as inputs. They used the Random Forests™ method to generate the ensemble of regression trees. Then, an edge between two genes was assigned a score equal to the importance of the first gene as an input, to predict the expression of the second gene, as derived from the Random Forests model.
"Our approach is a non-parametric approach and we don't make any assumption about how the data are distributed or how the genes are supposed to be connected to regulate the genes," Huynh-Thu said. "Our method is scalable and can be easily parallelized and it can be extended to handle various types of data."
This method can also be adapted to handle time-series data and, for the size-10 and size-100 challenges, it would have ranked at the third and eighth positions respectively. However, from calibration experiments on the DREAM 3 challenge, the researchers eventually decided to submit the predictions of a much simpler approach that scores an edge by the absolute change in expression in one gene when the other gene is knocked out. "This is a very naïve procedure, but still it would have ranked at the second position in DREAM3 10 and 100," Huynh-Thu said. However, against the better predictions in DREAM4, it did not work so well.
Challenge 3: Predictive signaling network modeling
The third DREAM challenge was also a refinement of a previous challenge, constructed by Julio Saez-Rodriguez and Peter Sorger, both of MIT and Harvard Medical School, and Leonidas Alexopoulos and Giannis Melas of the National Technical University of Athens. The data are drawn from extensive experiments in which normal and liver cancer cells are exposed to a variety of stimuli, including extracellular ligands and small molecule inhibitors, and interrogated for internal signals such as protein activation and responses such as cytokine production.
One question that arose in the DREAM2 challenges was whether the gold-standard networks used to evaluate predictions might reflect the prejudices of the organizers, so that emulating their philosophy could be an inappropriately effective strategy. To address that issue, the signaling challenge for DREAM3 was based completely on observable data, simply omitting particular conditions from a very large multidimensional array of observations.
As it turned out, the most successful predictors did not create a network model at all, but instead looked for regularities in the observations to fill in the missing data. "These were able to predict the data quite well but were not able to say much about the signaling network," said Saez-Rodriguez. "Based on this experience," he said, the researchers sought predictions "that are able to tell us more about how the network is functioning."
"Previous techniques predict the data quite well but were not able to say much about the signaling network."
To allow as many computational approaches as possible, the researchers used a much smaller subset of the original data—about 500 points instead of 10,000. Cancer cells were subjected to combinations of one of four extracellular ligand and one inhibitor and the levels of seven phosphoproteins were measured at three time points.
In addition, the organizers provided a network, based on the current public knowledge of the signaling pathway pertaining to this challenge. Predictors were asked to refine this network by adding or removing links, and to use this network to predict the phosphoprotein responses at a specific time point to various pairs of inhibitors and pairs of ligands.
The overall score for each team was the average prediction score, minus a complexity cost, said Robert Prill. "It rewards teams that predict well, and penalizes teams that use densely connected networks." As in the other challenges, the prediction score is an average for the seven phosphoproteins. "We're looking for consistency," Prill said.
At first, Prill said, there was "no clear delineation between performers. Most teams are predicting well." But after post-conference refinement three teams clearly outperformed the others. Unfortunately, one of the three (Casual_Learning_Without DAGs, formed by Kevin Murphy and David Duvenaud from the University of British Columbia) was recognized as a best performer only in the post-conference analysis. Therefore, only the other two best performers were invited to present at the conference.
The team from the University of Padua included Federica Eduati, Alberto Corradin, Barbara Di Camillo, and Gianna Toffolo. Because of the rather sparse information on the time course and the network, Eduati said, they used a simple Boolean representation to classify whether a particular combination of stimulus and inhibitor is affecting the protein or not. The minimum change in protein that they designated as a true influence, relative to its intrinsic variability, could be varied to adjust the sparsity of the resulting network.
Based on this classification, the team built a network containing no hidden nodes. To predict the output of their network in response to the combinations, the researchers linearly combined the given responses. For a combination of two ligands plus an inhibitor, for example, they added the change induced by each ligand alone in the presence of the inhibitor and sutracted the change induced by the inhibitor alone.
In addition to predicting the combined outputs rather well, the inferred network had many edges in common with the one provided by the organizers. However, the original network was not used in the construction, except to tune the parameter that determines the sparsity of their own network.
A second team that performed well for the signaling challenge, STeam, was John Schwacke of the Medical University of South Carolina. Like the Padua team, he used separate models to create the network topology and to predict the outputs, in what he called a two-stage approach.
To see how well candidate networks replicated the training data, Schwacke used a linear model for the dynamics that allowed closed-form mathematical descriptions. He used an evolutionary algorithm to explore possible networks, adding or subtracting nodes to create new candidates to replace the lower-performing networks, and chose the final network based on how frequently edges were used.
For the final predictions, Schwacke described each node using nonlinear ordinary differential equations that included saturation. The 35 parameters of the resulting model were chosen by fitting the training data, and then used to predict the combination experiments. The time trajectories of some phosphoproteins matched the observations well, but others did rather poorly, Schwacke said. "We're still evaluating what worked and didn't."
That statement could serve as a summary of the DREAM challenges overall. Both the designs of the challenges and the algorithms used to attack them continue to evolve, but there is steady improvement in both.