Presented by Merck Research Laboratories, IBM, and the NIH Roadmap Magnet Center
Crossing Paths: The RECOMB Regulatory Genomics / Systems Biology / DREAM Conference

Posted January 28, 2009
Presented By
Overview
On October 29 – November 2, 2008, in Cambridge, Massachusetts, the RECOMB Regulatory Genomics / Systems Biology / DREAM conference brought three separate conferences together in a single venue. The meeting combined the 5th RECOMB Satellite Conference on Regulatory Genomics, chaired by Manolis Kellis, the 4th RECOMB Satellite Conference on Systems Biology, chaired by Andrea Califano, and the 3rd DREAM Conference, chaired by Gustavo Stolovitzky.
The first two conferences had previously spun off from the RECOMB conference on Research in Computational Molecular Biology. RECOMB was founded in 1997 as a forum for computer science issues in biology, and was last held in March–April 2008 in Singapore. The DREAM conference (Dialog for Reverse Engineering Assessments and Methods) began in 2006 with a more focused goal of evaluating systems biology tools for building biological networks.
This eBriefing focuses on the keynote talks of the three conferences, and pulls together common themes that were officially presented in separate conferences.
Sponsorship
This conference and eBriefing were made possible with support from Merck Research Laboratories, IBM, and the NIH Roadmap Magnet Center.
- 00:011. Introduction
- 03:172. Robustness (part 1)
- 09:183. Model for exploring robustness
- 13:564. Test for robustness in this model
- 15:555. Evolutionary causes of robustness
- 18:496. Robustness (part 2)
- 24:517. Head and neck squamous cell carcinoma
- 31:188. Observations
- 33:349. Conclusions and acknowledgement
- 00:011. Introduction; Post-transcriptional gene regulation; Small RNAs
- 05:382. Small RNA cloning
- 10:443. Visualizing micro RNAs
- 12:374. Targeting micro RNAs
- 15:205. Accomodating the target
- 19:296. Statistics
- 22:107. Locating the truly used binding sites
- 30:568. The mRNP code
- 31:349. Acknowledgement
- 00:011. Introduction
- 04:282. Analyzing the metabolite data
- 06:593. Problems with traditional gene expression experiments
- 10:004. The chemostat
- 13:075. Dilution rate series in chemostat
- 21:526. Relation of heat shock and growth rate
- 26:117. Conclusions: distinguishing stress and growth rate effects
- 27:558. Glucose wasting and survival after starvation
- 31:549. How stressful is slow growt
- 00:011. Introduction to challenge 1
- 01:562. Self/nonself discrimination by T lymphocytes
- 02:523. Single-cell analysis reveals variability in cell response
- 06:494. Determining the roles of CD8, Shp-1, and Erk-1
- 07:425. Flow cytometry data allows for theoretical prediction
- 08:276. Model for T cell regulation by Shp-1 and Erk-1
- 09:277. Insights from computational model
- 10:108. Summar
- 00:011. Introduction; The challenge
- 02:072. Quantifying system response to input signal; Local system response
- 05:443. A top-down approach; Untangling the signaling wires
- 11:464. The mechanistic model; A bottom-up approach
- 17:055. Ligand-dependent responses in MCF7 cells; Coherent feedback loops
- 20:146. Ubiquitous control mechanisms of c-Fos expression
- 22:227. General control principles; Simple motifs; Acknowledgement
- 00:011. Introduction
- 01:192. Challenge 1: signaling cascade identification
- 02:243. Challenge 1 submissions
- 03:414. Scoring challenge 1
- 06:005. Challenge 1 conclusions
- 06:526. Challenge 2: Signaling response prediction
- 08:447. Basis for scoring challenge
- 10:318. Scores for phosphoproteins and cytokines
- 11:399. Challenge 2 conclusion
- 00:011. Introduction; Posttranscriptional regulation; Context-dependence
- 01:462. Illumina mRNA-Seq
- 05:333. Alternative transcript events
- 10:524. Variation in isoform expression
- 13:555. Switch-like regulation and Psi
- 18:306. Splicing and alternative cleavage
- 21:307. Which motifs are conserved?
- 23:568. hnRNP H
- 29:069. Summary; Acknowledgement
Web Sites
The Cancer Genome Atlas project
The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.
The Connectivity Map 02
The Connectivity Map (also known as cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes, and diseases through the transitory feature of common gene-expression changes.
The Coriell Cell Repositories
The Coriell Cell Repositories provide essential research reagents to the scientific community by establishing, verifying, maintaining, and distributing cells, cultures, and DNA derived from cell cultures. These collections, supported by funds from the National Institutes of Health (NIH) and several foundations, are extensively utilized by research scientists around the world.
DREAM Initiative
Since its first meeting in May 2006, the Dialog on Reverse Engineering Assessment and Methods has been working to judge the effectiveness of techniques for describing the networks of interacting molecules that underlie biological systems. Basic information is available here. The Academy has published eBriefings on the first organizational meeting, the first DREAM conference, and the DREAM2 conference.
Gerstein Lab: Networks
Links to software developed by Mark Gerstein's group at Yale for declaring, visualizing, and analyzing networks, as well as to databases containing network representations of several systems.
ImMunoGeneTics database
IMGT is a collection of high-quality integrated databases specializing in immunoglobulins, T cell receptors, and the major histocompatibility complex (MHC). It was created in 1989 by Marie-Paule Lefranc at the Universite Montpellier 2, CNRS. A European project since 1992, IMGT works in close collaboration with the European Bioinformatics Institute (EBI). IMGT consists of sequence databases (IMGT/LIGM-DB, a comprehensive database of IG and TR from human and other vertebrates, with translation for fully annotated sequences, IMGT/MHC-DB, IMGT/PRIMER-DB), genome database (IMGT/GENE-DB) and structure database (IMGT/3Dstructure-DB), Web resources (IMGT Marie-Paule page) and interactive tools.
Personal Genome Project
Founded by George Church, this project aims to get to the point where there is a critical mass of interested users, tools for obtaining and interpreting genome information, and supportive policy, research, and service communities.
Pseudogene.org
A comprehensive database of identified pseudogenes, utilities to find pseudogenes, various publication data sets and a pseudogene knowledgebase, maintained by Mark Gerstein's lab at Yale.
SmiRNAdb
SmiRNAdb is a database containing information about mammalian (human, mouse, and rat) small RNAs that were cloned by the Tuschl Lab. The total number of small RNAs in the database is 224515 for human, 86414 for mouse, 21141 and for rat.
The VISTA Enhancer Browser
The VISTA enhancer browser is a tool for exploring enhancer sequences that have been functionally tested in mice.
Yeast Memory Device
Movie of engineered transcriptional memory circuit that persists as yeast cells divide.
Books and Journals
Alon U. 2006. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/Crc, Boca Raton, FL.
Partner journals for the conference
Journal of Computational Biology
Journal Articles
John Tyson
Csikász-Nagy A, Kapuy O, Tóth A, Pál C, et al. 2009. Cell cycle regulation by feed-forward loops coupling transcription and translation. Mol. Syst. Biol. (in press).
Novák B, Tyson JJ. 2008. Design principles of biochemical oscillators. Nat. Rev. Mol. Cell. Biol. 9: 981-991.
Tyson JJ, Chen KC, Novak B. 2003. Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Curr. Opin. Cell Biol. 15: 221-231.
Tyson JJ, Novak B. Temporal organization of the cell cycle. Curr. Biol. 18: R759-R768.
Boris Kholodenko
Birtwistle MR, Hatakeyama M, Yumoto N, et al. 2007. Ligand-dependent responses of the ErbB signaling network: experimental and modeling analyses. Mol. Syst. Biol. 3: 144. Full Text
Kholodenko BN, Kiyatkin A, Bruggeman FJ, et al. 2002. Untangling the wires: a strategy to trace functional interactions in signaling and gene networks. Proc. Natl. Acad. Sci. USA 99: 12841-12846.
Kholodenko BN 2006. Cell-signalling dynamics in time and space. Nat. Rev. Mol. Cell Biol. 7: 165-176.
Sontag E, Kiyatkin A, Kholodenko BN. 2004. Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data. Bioinformatics 20: 1877-1886.
Yalamanchili N, Zak DE, Ogunnaike BA, et al. 2006. Quantifying gene network connectivity in silico: scalability and accuracy of a modular approach. Syst. Biol. 153: 236-246. Full Text
Pamela Silver
Ajo-Franklin CM, Drubin DA, Eskin JA, et al. 2007. Rational design of memory in eukaryotic cells. Genes Dev. 21: 2271-2276. Full Text
Swinburne IA, Miguez DG, Landgraf D, Silver PA. 2008. Intron length increases oscillatory periods of gene expression in animal cells. Genes Dev. 22: 2342-2346.
Daphne Koller
Chechik G & Koller D. 2009. Timing of gene expression responses to environmental changes. Journal of Computational Biology 16 (in press). (PDF, 413 KB) Full Text
Chechik G, Oh E, Rando O, et al. 2008. Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network. Nat. Biotechnol. 26: 1251-1259.
D. Fiedler et al., Functional organization of the S. cerevisiae phosphorylation network. (under review)
David Botstein
Boer VM, Amini S, Botstein D. 2008. Influence of genotype and nutrition on survival and metabolism of starving yeast. Proc. Natl. Acad. Sci. USA 105: 6930-6935. Full Text
Brauer MJ, Yuan J, Bennett BD, et al. 2006. Conservation of the metabolomic response to starvation across two divergent microbes. Proc. Natl. Acad. Sci. USA 103: 19302-19307. Full Text
Brauer MJ, Huttenhower C, Airoldi EM, et al. 2008. Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast. Mol. Biol. Cell 19: 352-367. Full Text
Saldanha AJ, Brauer MJ, Botstein D. 2004. Nutritional homeostasis in batch and steady-state culture of yeast. Mol. Biol. Cell 15: 4089-4104. Full Text
Uri Alon
Kashtan N, Alon U. 2005. Spontaneous evolution of modularity and network motifs. Proc. Natl. Acad. Sci. USA 102: 13773-13778. Full Text
Kashtan N, Noor E, Alon U. 2007. Varying environments can speed up evolution. Proc. Natl. Acad. Sci. USA 104: 13711-13716. Full Text
Parter M, Kashtan N, Alon U. 2007. Environmental variability and modularity of bacterial metabolic networks. BMC Evol. Biol. 7: 169. Full Text
Aviv Bergman
Belbin TJ, Bergman A, Brandwein-Gensler M, et al. 2007. Head and neck cancer: reduce and integrate for optimal outcome. Cytogenet. Genome Res. 118: 92-109.
Bergman A, Siegal ML. 2003. Evolutionary capacitance as a general feature of complex gene networks. Nature 424: 549-552.
MacCarthy T, Bergman A. 2006. Coevolution of robustness, epistasis, and recombination favors asexual reproduction. Proc. Natl. Acad. Sci. USA 104: 12801-12806. Full Text
MacCarthy T, Bergman A. 2007. The limits of subfunctionalization. BMC Evol. Biol. 7: 213. Full Text
Siegal ML, Bergman A. 2002. Waddington's canalization revisited: developmental stability and evolution. Proc. Natl. Acad. Sci. USA 99: 10528-10532. Full Text
Timothy Hughes
Badis G et al. & Hughes TR 2008, Molecular Cell (in press)Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, et al. 2008. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature
Lee W, Tillo D, Bray N, Morse RH, et al. 2007. A high-resolution atlas of nucleosome occupancy in yeast. Nat. Genet. 39: 1235-1244.
Bing Ren
Barrera LO, Li Z, Smith AD, et al. 2008. Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs. Genome Res. 18: 46-59. Full Text
Heintzman ND, Stuart RK, Hon G, et al. 2007. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39: 311-318.
Hon G, Ren B, Wang W. 2008. ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 4: e1000201. Full Text
Xi H, Shulha HP, Lin JM, et al. 2007. Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genet. 3: e136. Full Text
Eddy Rubin
Prabhakar S, Visel A, Akiyama JA, et al. 2008. Human-specific gain of function in a developmental enhancer. Science 321: 1346-1350.
Visel A, Prabhakar S, Akiyama JA, et al. 2008. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 40: 158-160.
Mark Gerstein
Gerstein MB, Bruce C, Rozowsky JS, et al. 2007. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17: 669-681. Full Text
Kim P, Lam YK, Alexander E, et al. 2008. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation mechanism in recent evolutionary history. Genome Res. 18: 1865-1874. Full Text
Rozowsky JS, Newburger D, Sayward F, et al. 2007. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res. 17: 732-745. Full Text
Wang LY, Abyzov A, Korbel JO, et al. 2008. MSB: A mean-shift-based approach for the analysis of structural variation in the genome. Genome Res. (in press)Washietl S, Pedersen JS, Korbel JO, et al. 2007. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 17: 852-864. Full Text
Zhang ZD, Paccanaro A, Fu Y, et al. 2007. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 17: 787-797. Full Text
Zhang ZD, Rozowsky J, Snyder M, et al. 2008. Modeling ChIP sequencing in silico with applications. PLoS Comput Biol. 4: e1000158. Full Text
Chris Burge
Wang ET, Sandberg R, Luo S, et al. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470-476.
Thomas Tuschl
Hafner M, Landgraf P, Ludwig J, et al. 2008. Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Methods 44: 3-12.
Landgraf P, Rusu M, Sheridan R, et al. 2007. A mammalian microRNA expression atlas based on small RNA library sequencing. Cell 129: 1401-1414.
Todd Golub
Beroukhim R, Getz G, Nghiemphu L, et al. 2007. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc. Natl. Acad. Sci. USA 104: 20007-200012. Full Text
Cancer Genome Atlas Research Network (231 collaborators). 2008. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061-1068.
Hieronymus H, Lamb J, Ross KN, et al. 2006. Gene expression signature-based chemical genomic prediction identifies a novel class of HSP90 pathway modulators. Cancer Cell 10: 321-330. Full Text
Lamb J, Crawford ED, Peck D, et al. 2006. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313: 1929-1935.
Stegmaier K, Ross KN, Colavito SA, et al. 2004. Gene expression-based high-throughput screening(GE-HTS) and application to leukemia differentiation. Nat. Genet. 36: 257-263.
Douglas Lauffenburger, Jones S, Zhang X, Parsons DW, et al. 2008. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321: 1801-1806.
George Church
Dantas G, Sommer MO, Oluwasegun RD, Church GM. 2008. Bacteria subsisting on antibiotics. Science 320: 100-103.
Forton JT, Udalova IA, Campino S, et al. 2007. Localization of a long-range cis-regulatory element of IL13 by allelic transcript ratio mapping. Genome Res. 17: 82-87. Full Text
Lunshof JE, Chadwick R, Vorhaus DB, Church GM. 2008. From genetic privacy to open consent. Nat. Rev. Genet. 9: 406-411.
Lunshof JE, Chadwick R, Church GM. 2008. Hippocrates revisited? Old ideals and new realities. Genomic Med. 2: 1-3. Full Text
Park IH, Arora N, Huo H, et al. 2008. Disease-specific induced pluripotent stem cells. Cell 134: 877-886.
Porreca GJ, Zhang K, Li JB, et al. 2007. Multiplex amplification of large sets of human exons. Nat. Methods 4: 931-936.
Shendure J, Porreca GJ, Reppas NB, et al. 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309: 1728-1732.
Shendure JA, Porreca GJ, Church GM. 2008. Overview of DNA sequencing strategies. Curr. Protoc. Mol. Biol. Jan; Chapter 7: Unit 7.1.
Leona Samson, Fry RC, Svensson JP, Valiathan C, et al. 2008. Genomic predictors of interindividual differences in response to DNA damaging agents. Genes Dev. 22: 2621-2626. Full Text
Workman CT, Mak HC, McCuine S, et al. 2006. A systems approach to mapping DNA damage response pathways. Science 312: 1054-1059.
DREAM
Challenge 1: Signaling Cascade
Feinerman O, Veiga J, Dorfman JR, et al. 2008. Variability and robustness in T cell activation from regulated heterogeneity in protein levels. Science 321: 1081-1084.
Feinerman O, Germain RN, Altan-Bonnet G. 2008. Quantitative challenges in understanding ligand discrimination by ab T cells. Mol. Immunol. 45: 619-631. Full Text
Challenge 2: Signaling-Response Prediction
DataRail open source matlab toolbox for linking experimental data to mathematical models.
Saez-Rodriguez J, Goldsipe A, Muhlich J, et al. 2008. Flexible informatics for linking experimental data to mathematical models via DataRail. Bioinformatics 24: 840-847. Full Text
King G, Honaker J, Joseph A, & Scheve K. 2001. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review 95: 49-69. (PDF, 571 KB) Full Text
Challenge 3: Gene-Expression Prediction
Gustafsson M, Hörnquist M, & Lombardi A. 2005. Constructing and analyzing a large-scale gene-to-gene regulatory network—lasso-constrained inference and biological validation. IEEE/ACM Trans Comput. Biol. Bioinform. 2: 254-61.
Harbison CT, Gordon DB, Lee TI, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99-104.
Natarajan K, Meyer MR, Jackson BM, et al. 2001. Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast. Mol. Cell. Biol. 21: 4347-4368. Full Text
Challenge 4: In-Silico Networks
Organizers
Manolis Kellis, PhD
Massachusetts Institute of Technology
e-mail | web site | publications
Manolis Kellis is an assistant professor of computer science at MIT, a member of the Computer Science and Artificial Intelligence Laboratory, and associate member of the Broad Institute of MIT and Harvard. He is the recipient of a National Science Foundation Career Award (2007), the Karl Van Tassel 1925 Career Development Professorship, and the Distinguished Alumnus 1964 Career Development Professorship. He was selected as one 35 top young innovators under the age of 35 by Technology Review magazine, one of 20 young scientists recognized as the Principal Investigators of the Future by Genome Technology magazine, and one of three scientists representing the next generation in biotechnology by the Museum of Science in Boston.
Kellis obtained his PhD from MIT, where he received the Sprowls award for the best doctorate thesis in computer science, the first Paris Kanellakis graduate fellowship, and the Chorafas Foundation award. His research is in the field of computational biology, developing algorithms and machine learning techniques to interpret complete genomes, understand gene regulation, reconstruct cellular networks, and study genome evolution. Prior to computational biology, he worked on artificial intelligence, sketch and image recognition, robotics, and computational geometry at MIT and at the Xerox Palo Alto Research Center.
Andrea Califano, PhD
Columbia University
e-mail | web site | publications
Andrea Califano is professor of biomedical informatics at Columbia University, where he leads several cross-campus activities in computational and system biology. Califano is also codirector of the Center for Computational Biochemistry and Biosystems, chief of the bioinformatics division, and director of the Genome Center for Bioinformatics.
Califano completed his doctoral thesis in physics at the University of Florence and studied the behavior of high-dimensional dynamical systems. From 1986 to 1990, he was on the research staff in the Exploratory Computer Vision Group at the IBM Thomas J. Watson Research Center, where he worked on several algorithms for machine learning, including the interpretation of two- and three-dimensional visual scenes. In 1997 he became the program director of the IBM Computational Biology Center, and in 2000 he cofounded First Genetic Trust, Inc., to pursue translational genomics research and infrastructure related activities in the context of large-scale patient studies with a genetic components.
Gustavo Stolovitzky, PhD
IBM Computational Biology Center
e-mail | web site | publications
Gustavo Stolovitzky is manager of the Functional Genomics and Systems Biology Group at the IBM Computational Biology Center in IBM Research. The Functional Genomics and Systems Biology group is involved in several projects, including DNA chip analysis and gene expression data mining, the reverse engineering of metabolic and gene regulatory networks, modeling cardiac muscle, describing emergent properties of the myofilament, modeling P53 signaling pathways, and performing massively parallel signature sequencing analysis.
Stolovitzky received his PhD in mechanical engineering from Yale University and worked at the Rockefeller University and the NEC Research Institute before coming to IBM. He has served as Joliot Invited Professor at Laboratoire de Mecanique de Fluides in Paris and as visiting scholar at the physics department of the Chinese University of Hong Kong. Stolovitzky is a member of the steering committee at the Systems Biology Discussion Group of the New York Academy of Sciences.
Speakers
Uri Alon, PhD
Weizman Institute of Science
e-mail | web site | publications
Grégoire Altan-Bonnet, PhD
Memorial Sloan-Kettering Cancer Center
e-mail | web site | publications
Aviv Bergman, PhD
Albert Einstein College of Medicine
e-mail | web site | publications
Guillaume Bourque, PhD
Genome Institute of Singapore
e-mail | web site | publications
David Botstein, PhD
Princeton University
e-mail | web site | publications
Chris Burge, PhD
Massachusetts Institute of Technology
e-mail | web site | publications
George Church, PhD
Harvard Medical School
web site | publications
Neil Clarke, PhD
Genome Institute of Singapore
e-mail | web site | publications
Mark Gerstein, PhD
Yale University
e-mail | web site | publications
Todd Golub, MD
Broad Institute and Dana-Farber Cancer Institute
e-mail | web site | publications
Nicolas Guex, PhD
Swiss Institute of Bioinformatics
e-mail | web site | publications
Mika Gustafsson
Linköping University
e-mail | web site
Timothy Hughes, PhD
University of Toronto
e-mail | web site | publications
Boris Kholodenko, PhD
Thomas Jefferson University
e-mail | web site | publications
Daphne Koller, PhD
Stanford University
e-mail | web site | publications
Douglas Lauffenburger, PhD
Massachusetts Institute of Technology
e-mail | web site | publications
Daniel Marbach
Ecole Polytechnique Fédérale de Lausanne
e-mail | web site
Robert Prill, PhD
IBM Computational Biology Center
e-mail | publications
Bing Ren, PhD
Ludwig Institute for Cancer Research and University of California, San Diego
e-mail | web site | publications
Eddy Rubin, PhD
Lawrence Berkeley National Laboratory
e-mail | web site | publications
Julio Saez-Rodriguez, PhD
Harvard Medical School and Massachusetts Institute of Technology
e-mail | web site | publications
Jianhua Ruan, PhD
University of Texas at San Antonio
e-mail | web site
Leona Samson, PhD
Massachusetts Institute of Technology
e-mail | web site | publications
Pamela Silver, PhD
Harvard Medical School
e-mail | web site | publications
Thomas Tuschl, PhD
The Rockefeller University
e-mail | web site | publications
John Tyson, PhD
Virginia Polytechnic Institute and State University
e-mail | web site | publications
Kevin Yip
Yale University
e-mail | publications
Don Monroe
Don Monroe is a science writer based in Murray Hill, New Jersey. After getting a PhD in physics from MIT, he spent more than fifteen years doing research in physics and electronics technology at Bell Labs. He writes on biology, physics, and technology.
Speakers:
Todd Golub, Broad Institute and Dana-Farber Cancer Institute
Douglas Lauffenburger, Massachusetts Institute of Technology
Highlights
- DNA sequence alone will not reveal all disruptions of signaling pathways in, but protein levels and activity are more likely to.
- A high-throughput, bead-based assay of gene expression signatures in cells can screen for small molecules that may have useful effects.
- The Src inhibitor dasatinib (Sprycel, Bristol-Myers Squibb) may be useful against glioblastoma as well as the leukemia for which the FDA approved it.
- Expression signatures provide common language for finding relationships between diseases, genetic changes, and drugs, and it may be possible to capture these signatures with as few as a thousand transcripts.
- Characterizing network response requires exposing cells to a broad range of conditions (cues), measurements across multiple pathways (signals), and connection to cell phenotypes (response).
- Simple linear models relating the many external cues to internal signals and internal signals to responses can reveal important aspects of how cancer cells differ from normal cells.
- Although normal and cancer cell networks include the same nodes, their effective connectivity looks different.
- Suppression of innate immune response may be an important part of liver cancer progression.
Screening small molecules
The first results from the Cancer Genome Atlas, released in the fall of 2008, confirmed the complexity of even a single type of cancer, in this case glioblastoma. But although researchers found a huge variety of mutations, "what was gratifying was that that this was not just a sprinkling of mutations randomly across the genome," said Todd Golub. "Rather, these were falling together in a set of pathways that were increasingly well understood in cancer," even though a mutation of any particular element in the pathway was relatively uncommon.
Other important disruptions, such as aberrant expression of ligands of a relevant kinase, might not be obvious in the DNA sequences of the pathway members, Golub suggested. "You couldn't capture this if you were focusing on the coding sequence of a kinase gene per se," he noted. To better monitor the pathway itself, Golub and his colleagues have developed methods for watching the state of kinase phosphorylation, as a surrogate for its activity, rather than depending on messenger RNA (mRNA) levels or even protein abundance.
Using a bead-based assay to look for desired changes in the gene expression signature of cells (right) provides and alternative to biochemistry-based screening (left).
Instead of requiring antibodies to specific phosphorylation states of specific kinases, the researchers use a combination of tyrosine kinase-specific antibody and an antibody to generic phosphokinases. The first antibody is anchored to luminescent, color-coded, polystyrene beads, the second antibody to a different luminescent reporter. The beads are measured with a two-laser Luminex system that reports both the kinase and its phosphorylation state. This system requires about 1000-fold fewer cells than mass spectrometry, and involves much less time and effort.
The new profiling method suggested that the well known oncogene for the kinase, Src, was particularly common in glioblastoma. "This was odd because there was no hint of Src activation at the genome level in the Cancer Genome Atlas," Golub observed. Further experiments confirmed that the Src inhibitor dasatinib (Sprycel, Bristol-Myers Squibb), which is FDA approved for leukemia, inhibits growth of human glioblastoma xenografts in immunodeficient mice. Although there was already a drug to target Src, Golub said, "In many cases such drugs don't exist for a known target, or you don't know what target you want to go after." This motivated his team to update the old-fashioned, cell-based approach to drug discovery, which aims to characterize biological states by a signature pattern of gene expression. The researchers devised a way to detect mRNA signatures using the Luminex bead platform. They expose cells to a variety of small molecules, and look for those that cause the desired shift in expression pattern. They then explore these candidate molecules using more time-consuming conventional methods.
The Connectivity Map uses genetic expression signatures as a common representation to connect diseases, genes, and drugs.
Golub and colleagues used this gene expression-based high-throughput screening, or GE-HTS, to look for molecules that push an acute-myeloid leukemia cell to differentiate, avoiding the arrested development that characterizes the disease. They identified FDA-approved drugs Iressa (gefitinib, AstraZeneca) and Tarceva (erlotinib, Genentech and OSI Pharmaceuticals) and confirmed that they indeed induce differentiation, even though these cells do not express the known target receptor for these drugs. "There's an off-target effect," Golub commented, "but in this case not a side effect, but a salutary effect."
Gene expression signatures can also help in a more general sense to make connections between drugs, targets, and diseases. "The concept here is to use gene expression as the lingua franca of these domains," Golub said. His lab has created a database of expression data called the Connectivity Map to let users search, for example, for small molecules whose expression signature resembles a desirable one. In one example, outside researchers looked for signatures similar to known histone deacetylase (HDAC) inhibitors, and the Connectivity Map flagged valproic acid, which was later also found to be an HDAC inhibitor. Golub and his colleagues are currently exploring whether the expression of a smaller subset of genes can adequately represent the signature.
Contrasting networks
Like Golub, Douglas Lauffenburger is working to understand cancer at the level of dynamic protein operations, but not in the context of DNA or mRNA expression. His primary interest is in which pathways are disrupted, not the specific mutations that disrupt them in a particular patient. "I think the last place you find it is at the sequence level," he asserted.
Understanding information processing in signaling networks, Lauffenburger emphasized, requires a wide range of measurements. First, the response must be characterized for a broad range of extracellular "cues," such as ligands and growth factors. Second, researchers must measure internal "signals"—such as protein activity—across the various pathways and processes to understand the internal dynamics of the network. Finally, the results must be related to some phenotypic "response," such as apoptosis or cytokine release. "Unless you know it's related to something the cell does," Lauffenburger said, "it might just be epiphenomena—noise."
Networks must be characterized for a wide range of external cues, internal signals, and phenotypic responses.
Lauffenburger illustrated this framework for liver cancer, which has "no good therapeutic." The researchers subjected cell lines to a variety of stimuli, including lipopolysaccharide, pathogens, growth factors, and pro-death and inflammatory cytokines. "All of these things might be individually or in combinations driving your network," Lauffenburger said. He used small-molecule inhibitors to dysregulate different aspects of the pathways, although he noted that RNA interference would also be a natural tool for this purpose.
The team measured signals, including the phosphorylation state of numerous proteins. Finally, they characterized the response in terms of apoptosis, necrosis, and cell division as well as cytokine release. They used the visualization capabilities of the Datarail tool to navigate this extensive data set, which was also used as the basis for the DREAM3 Challenge.
Lauffenburger primarily discussed the results in terms of a multilinear regression analysis that relates the signals to the cues and the responses to the signals. He also described a Boolean-logic model, but remarked that "as simple as multilinear regression is, it ends up pulling out the same types of differences, cell type to cell type." More mechanistic, differential-equation style modeling, he said, would be "way premature with pathways this complex."
The networks that emerge from this analysis for cancer cells are quite different from those that emerge for primary liver cells. For example, the primary cells respond strongly to inflammatory stimuli, while the liver tumor cells do not. "They've got the receptors, but somehow they've been uncoupled from the downstream pathways," Lauffenburger observed. Similarly, the primary cells release a "whole battery of cytokines," primarily through the IKK/NFκB pathway, while the cancer cells release different cytokines, primarily through the p38/HSP27 pathway.
The researchers validated aspects of this altered immune response in four tumor cell lines, although the details of the response varied because the lines contained different mutations. "It may well be that part of tumor progression is evasion of immune surveillance effects," Lauffenburger observed. Using these detailed results, he said, "you can start to make very specific biochemistry hypotheses of what pathways are responsible for the differences," and ultimately for potential therapeutic approaches.
Powerful new experimental tools have created opportunities for diverse groups of researchers to contribute to studies of biology—not just biologists and chemists but physicists, engineers, mathematicians, and computer scientists. Although each group brings unique strengths to bear, they also speak different dialects and are excited by different challenges. One consequence is a proliferation of small and specialized conferences.
On October 29 – November 2, 2008, in Cambridge, Massachusetts, the RECOMB Regulatory Genomics / Systems Biology / DREAM conference bucked this trend by bringing three separate conferences together in a single venue. Over four tightly scheduled days, the meeting combined the 5th RECOMB Satellite Conference on Regulatory Genomics, chaired by Manolis Kellis, the 4th RECOMB Satellite Conference on Systems Biology, chaired by Andrea Califano, and the 3rd DREAM Conference, chaired by Gustavo Stolovitzky.
The first two conferences had previously spun off from the RECOMB conference on Research in Computational Molecular Biology. RECOMB was founded in 1997 as a forum for computer science issues in biology, and was last held in March–April 2008 in Singapore. The DREAM conference (Dialog for Reverse Engineering Assessments and Methods) began in 2006 with a more focused goal of evaluating systems biology tools for building biological networks.
The combined conference included 17 keynote speakers, 72 oral presentations, and 160 posters. There were 500 registered attendees, 80% of whom registered for all three meetings. These statistics, and the stimulating discussions in the talks and poster sessions and in the halls, illustrate the strong overlap in interests for attendees from each group. "I'm not sure what conference I'm talking in," joked speaker Tom Tuschl, "but I guess it doesn't matter."
This eBriefing focuses on the keynote talks of the three conferences, pulling together common themes that were officially presented in separate conferences. A snapshot of the highlights follows below. Continue to our meeting report pages for a more detailed summary, or to our multimedia page for slides, audio, and video from the event.
Highlights
Dissecting and Replicating Functional Motifs
An important challenge in making sense of complex systems-biology networks is understanding how their pieces work together to perform a function. John Tyson illustrated how distinctive patterns of activation and inhibition among small groups of proteins give rise to various steps in the cell cycle. Boris Kholodenko helped to pioneer the classification of such motifs, but he emphasized the importance of quantitative description to understand, for example, how two distinct receptors can act through the same pathway to generate completely different outcomes. Pamela Silver demonstrated that understanding such basic circuits is enough to construct synthetic elements that can be used to record exposure to molecules or test hypotheses about biological mechanisms.
Unveiling Metabolism
Although identifying motifs and other features of network topology is clearly important, Daphne Koller emphasized recurring patterns of network dynamics, which she called "activity motifs." She analyzed metabolic data to uncover numerous instances in which the timing of gene expression appeared to be tuned for "just in time" response, in part by variations in the binding affinity of transcription factors. David Botstein took a broader view of metabolic activities. By augmenting steady-state reactor technology with modern tools, he has distinguished large-scale patterns of response to environmental changes such as stress or temperature. He warned that such systemic responses can swamp the response to specific experimental perturbations.
Beyond Transcription Factors
Although promoters and repressors adjacent to a gene are important regulators of its transcription into RNA, other mechanisms also need to be considered. The wrapping of DNA around histone proteins to form nucleosomes has a strong effect on expression, for example, and Timothy Hughes described a model that predicts where nucleosomes bind both in vitro and in vivo. Chemical modifications of the tails of the histone proteins also modify the expression of the neighboring DNA. Bing Ren illustrated that these modifications show characteristic patterns for promoters, but also different patterns for enhancers, which can be much more distant from the genes they regulate. Eddy Rubin described the time- and tissue-specific expression of these enhancers, which number in the thousands, in a mouse assay. These studies represent efforts to discover meaning in the nearly 99% of the genome that does not code for protein. Mark Gerstein described various aspects of annotation for noncoding genes, including both identifying the functions of local regions and assessing the quality of those identifications. In addition, even for regions whose function is not known, aggregating different regions into meaningful classes based on repetitions and similarities can help researchers to manage and analyze the data.
Post-transcriptional Regulation
The transcription of an RNA copy of DNA is not the last opportunity for regulatory changes. Chris Burge explained that it's not only the cleavage and polyadenylation of transcripts that changes between tissues, but also their splicing to include or exclude exons in the original sequence. Another area of active research concerns regulation of the degradation or translation of messenger RNA, under the control of complexes including proteins and complementary or nearly complementary small RNA sequences. Thomas Tuschl described both the important regulatory features of these processes and methods his lab has devised to clarify how they work.
Signaling and Cancer
Once messenger RNAs have been translated into proteins, they can still regulate one another's activity by phosphorylation or other modifications. Todd Golub and Douglas Laufenburger both stressed that these modifications can reveal disruptions of signaling pathways, for example in cancer, that are hard to discern from DNA sequence alone. Nonetheless, Golub showed that gene expression as messenger RNA can provide a common language to connect drugs, genes, and diseases. Lauffenburger emphasized the importance of measuring many different responses and internal pathway signals for a wide variety of external cues if one is to make sense of biological networks.
Network Evolution
The networks that we observe today emerged over millions of years of evolution, and learning how that process affects the networks can improve our understanding of both networks and evolution. Uri Alon proposed a mechanism by which evolution could give rise to the modular organizations that are often observed in networks and in biology in general, if the environmental challenges change with time but subtasks that contribute to survival remain the same. Aviv Bergman showed that robustness, in which phenotypes are relatively insensitive to genetic change, arise naturally when there is a complex relationship between genotype and phenotype. He views cancer as a breakdown of this robustness that reveals underlying variations.
Personalized Genetics
In spite of the extensive data being generated around the world, understanding the results of individual genetic variations remains a huge challenge. George Church described the Personal Genome Project, which aims to address this problem by making available both genomes and traits for thousands of individuals. In addition, Church emphasized, the project is collecting epigenomic data, through tissue-specific expression, and environmental exposure history, through retained immune responses. Leona Samson illustrated the remarkable variability among cells from different ethnic groups following exposure to DNA-damaging agents. Surprisingly, two agents that cause similar damage exhibit very different genetic patterns.
Still DREAMing
Gustavo Stolovitzky, Andrea Califano, and a blue-ribbon steering committee organized the DREAM initiative in 2006 to evaluate how accurate and unique the network models of systems biologists were. The conference has brought together experts on aspects of biology, machine learning, and other fields to share their knowledge and to devise ways to answer this question.
One critical aspect of DREAM is a blinded competition, called the DREAM Challenges, in which teams download data, build models, and make predictions. As discussed at previous conferences, the design of such a competition raises many important issues. At the DREAM2 conference, for example, one successful team devised a prediction strategy that leveraged the known biases of the challenge designers. The match between the predicted network and the "gold-standard" target network leaves open the question of whether either matches the true biological network.
To avoid this weakness, the DREAM3 competition emphasized predictions of measurable data, rather than predictions of an unknown network. Nevertheless, the contest entries exposed another issue: many successful predictions did not reverse engineer any network at all, but simply looked for trends in the data provided and looked to fill in missing information. Though the contest produced interesting results, it was also clear that designing reverse-engineering techniques is not the only challenge, but designing ways to assess them also continues to be problematic.
Speakers:
Timothy Hughes, University of Toronto
Bing Ren, Ludwig Institute for Cancer Research and University of California, San Diego
Eddy Rubin, Lawrence Berkeley National Laboratory
Mark Gerstein, Yale University
Highlights
- Nucleosome binding is affected by competitive binding of transcription factors and other nucleosomes, as well as specific sequences that make their binding unlikely.
- Enhancers, short sequences of DNA that affect the transcription of distant genes, play a critical role in controlling expression in eukaryotes, both spatially and temporally.
- Enhancers and promoters show distinct patterns of chromatin modification, such as chemical modifications of histone proteins. By studying these patterns, researchers can identify enhancers that are active in specific tissues.
- About half of noncoding sequences that are highly conserved across species acted as enhancers in a mouse-embryo assay.
- Chromatin immunoprecipitation and sequencing (ChIP-seq) of DNA bound to the cofactor P300 in particular tissues efficiently predicts enhancers that act in those tissues.
- Annotating the noncoding genome requires identifying functional or recognizable regions, cleaning up the data, and aggregating related regions.
- The genome includes large "forest" and "desert" regions where regulatory elements are much more or less likely than average.
- The proximity between different types of large-scale rearrangement in the human genome suggests a burst of changes about 40 million years ago.
What determines nucleosome positions?
Eukaryotes exhibit many ways to modulate transcription. In one such mechanism, DNA is wrapped around histone proteins to form nucleosomes. This wrapping is important for packing meters of DNA into the nucleus, but it also inhibits the expression of some DNA regions. Nucleosomes have characteristic patterns of positioning and occupancy along the sequence, but as Timothy Hughes observed, scientists still don't know what controls this.
Hughes and his colleagues built a linear model of in vivo nucleosome occupancy, based on known nucleosome positioning sequences and structural features of DNA and transcription-factor binding sites. They used a lasso algorithm to set as many weights as possible to zero. The remaining weights were almost entirely negative, Hughes observed, suggesting that "the nucleosome occupancy landscape may be dominated by nucleosome-excluding sequences."
"Nucleosome occupancy may be dominated by nucleosome-excluding sequences."
To further test the models outside of the complexities of the cell, Hughes and his collaborators did in vitro experiments and found good agreement. He concluded that the models thus represent "intrinsic properties of DNA nucleosome interactions."
The team examined various 4-nucleotide sequence stretches as other possible predictors of position. "The major feature in terms of the weight is poly-A," Hughes said, a string of four adenosines. Including this feature and a second one representing the overall fraction of guanine and cytosine captured most of the model predictions for in vitro nucleosome occupancy.
The model also worked well for yeast data. But although the data clearly correlate with the model for humans, there were many positions where a nucleosome was predicted to bind but didn't in vivo, Hughes said. He speculated that the more numerous human transcription factors "do a lot more autonomous exclusion" than those in yeast.
In spite of the excellent correlation, however, the model is "pretty far from absolute" in predicting positions, Hughes said. He attributed this limitation to the regular packing of nucleosomes, as well as interactions with other proteins.
Indeed, mutations in transcription factors have a significant impact on nucleosome binding positions. One such transcription factor, called Rsc3, binds to a motif that includes a CGCG sequence, which is abundant in yeast promoters. "CGCG has been sort of a thorn in my side," Hughes noted, since the model had puzzlingly identified this sequence as a predictor of nucleosome exclusion. "It's probably Rsc3 binding," he noted.
Seeking enhancers
Until a few years ago, Bing Ren said, his research on gene regulation in humans focused on promoter regions immediately adjacent to a regulated gene. But "research in the last five years has completely changed my view," he said, and the distant sequences known as enhancers "also need to be included in our consideration of the gene regulation network." However, since enhancers can be upstream, downstream, or even in an intron within a gene, Ren said, "we don't really know how to find them."
Enhancers are thought to bind a sequence-specific transcription factor, which recruits co-activators. A loop of DNA then brings this complex close to the promoter, which becomes activated. Virtually all of the co-activators are known to be involved in histone modifications, such as acetylation, methylation, and ubiquitination of the histone proteins that comprise the core of the nucleosome, Ren said. "Such chromatin modifications are not randomly distributed along the genes."
"Histone modifications can be used as a signature for enhancers on a genome-wide scale."
"We had a hypothesis that chromatin modifications could be used as a signature for enhancers," Ren recalled. In earlier work, he and his colleagues had found that promoter regions show distinct modification patterns depending on their activity. All active promoters, Ren said, carry "strikingly similar chromatin modifications."
Looking for such patterns for enhancers, Ren and his colleagues did chromatin immunoprecipitation to identify DNA that binds p300, a histone acetyltransferase that acts as a co-factor for many enhancers. The team identified some 74 sequences that bound p300 in HeLa cells, making them putative enhancers that are active in this cell line. In comparison with promoters, he observed, "enhancers are associated with different patterns of chromatin modification."
The researchers used these observations to develop modification signatures for promoters and enhancers. Validation showed that 90% of the predicted promoters were indeed promoters, and 80% of promoters were identified, while for enhancers at least 65% of the predictions were supported by observations of other chromatin modifications, p300, or DNA hypersensitivity. When they evaluated nine of these more closely for further testing, seven robustly activated promoter expression in a transgenic assay.
Surprisingly, although this method flagged some 37,000 putative enhancer sequences in HeLa cells and another 25,000 in K562 cells, "the overlap between these two is quite minimal," Ren said, "less than 6000." Nonetheless, the researchers found that the enhancer positions they predicted correlated with those of genes that are active in HeLa cells.
Ren emphasized that the chromatin modifications associated with enhancers—unlike those associated with promoters—are highly dynamic. For example, only a small subset of the enhancers have the active signature both before and after researchers induce differentiation in embryonic stem cells. The differences between promoters and enhancers," he concluded, "indicate that enhancers play a critical role in driving cell-specific gene expression."
Enhancers across the genome
Eddy Rubin and his colleagues began their search for enhancers using comparative genomics. "We align genomes of humans and mice and fish, and we statistically look for noncoding conserved regions that are above a threshold," Rubin said. "The most highly evolutionarily constrained sequences are our candidates for being enhancers."
In contrast, non-constrained sequences "do not show activity in this assay," Rubin said. The comparative genomics analysis "is pretty good at picking up true positives." He suggested that many false positives may act as enhancers at different points of development. "It's both tissue as well as temporal specificity that are contained with the sequences."
To test these candidates, the team devised a way to insert human sequences into mouse embryos, coupled with a blue-staining reporter gene. At day 11.5, while the entire embryo is still visible, they look for blue-stained tissue as confirmation of enhancer activity in particular tissues at that developmental stage. Some enhancers stimulate characteristic spatial expression patterns in developing limbs, for example, while others appear in the forebrain.
Although this technique is laborious, over the past two and a half years the team has confirmed more than 500 enhancers. Overall, almost half of the candidates act as enhancers in this assay. The team maintains a database of these confirmed enhancers. "Over the next few years, Rubin said, "we hope to have about 3000 enhancers."
Recently, to improve both the number and the quality of the predictions, as well as to identify non-constrained enhancer sequences, the team has adapted a chromatin-immunoprecipitation/sequencing (ChIP-seq) technique. In collaboration with Bing Ren, they are identifying sequences that bind to the p300 enhancer co-activator. "The prediction is that when we see reads from a particular tissue at a particular time point," Rubin said, "this is an enhancer in that tissue at that time point."
The researchers tested putative enhancers in day-11.5 mouse embryos. In at least 84% of predicted cases, they observed enhancer activity in their assay—almost double the fraction for the comparative-genomics technique. Moreover, the activity occurred in the same tissues where they had seen p300 binding at that developmental stage. The researchers also found that enhancers active in particular tissues were located near genes that were expressed in those tissues.
Enhancers can also tell us more about how humans differ from chimpanzees. Rubin, Jim Noonan (now at Yale University), and Shyam Probhakar (now at the Genome Institute of Singapore) found enhancers whose sequences were similar across a wide range of vertebrates, including chimps, but differed significantly in humans. The team found that the differences in human sequences caused marked variation in the spatial expression pattern in the developing limbs of a mouse embryo. His former colleagues are now exploring "how you get a human thumb from this element," Rubin said.
Annotating noncoding DNA
Nucleosome position and enhancers are just two parts of the larger question that arises now that the human genome has been sequenced, said Mark Gerstein: "What are most of the bases of the genome doing?" Gerstein illustrated several projects aimed at annotating the noncoding regions of human DNA. Many of examples drew on the ENCODE project, whose first phase, addressing a representative 1% of the genome, was completed in 2007.
To be useful, annotation should not only survey the genome for signals of functional activity, Gerstein said. It should also look for regularities such as repeated regions and clarify the significance of noisy experiments. Furthermore, annotation should collect small blocks of data into functionally related groups, or presumably nonfunctional but recognizable blocks like pseudogenes.
Considering how best to evaluate data quality, Gerstein described his team's analysis of the significance of ChIP-seq measurements, to determine whether a particular peak really represents binding of a protein to DNA. To do this, they need to understand the measurement background in the absence of binding. "If you assume the genome is uniform, then you find out that doesn't work so well," he noted. "You have to have various elements of nonuniformity." The team built a tool they called PeakSeq, which reproduces the observed power-law distribution of the number of tags. Another project in Gerstein's lab aims to score experiments that identify large regions of the genome that may be duplicated or removed.
Working with ENCODE data, Gerstein and his collaborators looked for patterns in regulatory elements at intermediate length scales, roughly 50 kilobases. Collecting all elements into regional bins, he said, "some bins are highly enriched in binding, and some are highly depleted. We decided to call these 'forests' and deserts.'" The team used an elegant technique called a "biplot" to simultaneously illustrate which regions tend to bind to the same transcription factors and which transcription factors tend to bind in the same regions.
The ENCODE project revealed that a large fraction of noncoding DNA is transcribed into RNA. Since many of these regions are not annotated, Gerstein and his team built an automated tool to classify them and group them according to various characteristics such as proximity to genes and conservation across species. This classification clustered some 7000 novel transcribed regions into 200 loci. These regions are also more likely than most to fold into structured RNAs, according to structure-prediction tools.
"You see in our genome the remnants of a huge Alu burst of activity 40 million years ago."
In a final topic, Gerstein discussed the difference between structural variants or copy-number variants and mutations of individual bases. "There's a tremendous amount of block variation in the genome," he said. Some of these changes have become fixed in the population, and are then called segmental duplications (SDs).
"There's a lot of thought about how these copying events happen," Gerstein observed. One proposed mechanism involves non-allelic homologous recombination between repeated sequences that flank the copied region, while other mechanisms require no such repeats.
Gerstein and his collaborators looked for correlations between these block variations and various recurrent features in the genome. They found that the older segmental duplications often correlate with each other, and also with Alu elements, highly abundant mobile elements found in primate genomes.
In contrast, more recent copy number variations correlate with other genomic repeats like microsatellites and pseudogenes. Gerstein suggested that the appearance of Alu elements about 40 million years ago created a huge amount of self-sustaining variability, including localized hot spots, which has been slowly dying away. "You see in our genome now the remnants of this huge Alu burst," he commented.
Speakers:
Robert Prill, IBM
Gustavo Stolovitzky, IBM
Grégoire Altan-Bonnet, Memorial Sloan-Kettering Cancer Center
Julio Saez-Rodriguez, Harvard Medical School and Massachusetts Institute of Technology
Neil Clarke, Genome Institute of Singapore
Daniel Marbach, Ecole Polytechnique Fédérale de Lausanne
Nicolas Guex, Swiss Institute of Bioinformatics
Guillaume Bourque, Genome Institute of Singapore
Mika Gustafsson, Linköping University
Jianhua Ruan, University of Texas at San Antonio
Kevin Yip, Yale University
The DREAM vision
The DREAM initiative grew out of a frustration that although various algorithms were being used to infer networks underlying various biological problems, there was no objective way to determine which was performing best. The Dialog on Reverse Engineering Assessments and Methods includes a conference and other mechanisms of information sharing, but at its core are the DREAM Challenges.
The Challenges were inspired by other competitions, notably for protein-structure prediction, but biological networks pose special problems for design and evaluation of an appropriate problem. (Many of these issues were discussed in the report on the first DREAM meeting.) A central issue is the fact that real biological networks are never perfectly known, while artificial networks may have special features that make them unrepresentative of biological reality.
The second DREAM meeting, held in December 2007, nonetheless included challenges to predict network structure from observed data. One of these problems, for example, was to predict targets for Bcl6. Speaking for the "best predictor" team, Neil Clarke noted at this meeting, "It was a success because we adopted a model for what a Bcl6 target should look like under different expression conditions, and that was the model of the people who generated the data."
To reduce that problem this year, said Gustavo Stolovitzky, "two challenges respond to this notion that you can only make predictions about something that can be measured," rather than the unknown underlying network. However, the results showed that such predictions can be highly successful without actually reverse engineering the network, but simply by filling in the missing data as best as possible. A third challenge yielded no predictions that met the customary threshold for significance, which may in part reflect the unusual single-cell data set provided. The final challenge exploited the known structure and easy repeatability of artificial network generation to test real network reverse engineering. The complicated algorithm used by the best performer in that case may hold useful lessons, but the results themselves have no direct importance for biology. Assessment of reverse-engineering methods clearly represents an ongoing challenge.
Challenge 1: Signaling Cascade
DREAM3 Challenge 1 required participants to identify four nodes in a specified signaling network, based on measurements of their joint variations in single cells. The powerful but rather unusual data set included measurement of the corresponding levels as derived from flow cytometry, data derived from a study of the variability in T-cell activation. The idea, said Grégoire Altan-Bonnet, is to use the power of flow cytometry in dealing with individual cells and to "leverage the variability" of protein levels to learn about the signaling pathway. Looking at the endogenous variability is important, he noted, because "it is very difficult to do systematic perturbations" in this system.
The researchers presented a rather simple network containing seven species: a receptor and its ligand and the complex they form; a phosphatase and its activated form; a kinase that with the activated phosphatase phosphorylates the complex; and a protein and its phosphorylated form, which the phosphorylated complex creates. Participants were given scatter plots relating the levels of four anonymous species for each of the individual cells. They were then asked to match the measurements to a node of the network, such as the phosphorylated complex or the kinase.
The DREAM3 Challenge #1 was to assign four measured sets of single-cell data to a node in this network.
"We don't have 'the truth' necessarily, but we do have the next best thing, which is a Science paper," observed Robert Prill, who helped to compile the results. In fact Altan-Bonnet and his colleagues published their analysis before the competition closed. However, none of the seven teams who submitted predictions appears to have noticed, since none of them "correctly" assigned more than two of the nodes, although five of them got two right.
The P-value of assigning two nodes of the four correctly, Prill said, is about 11%. "Based on that, there are no winners this year" for this challenge.
Challenge 2: Signaling Response Prediction
The second DREAM3 Challenge involved predicting the differences in signaling between healthy and cancerous liver cells, as described in more detail by Douglas Lauffenburger. There were two related challenges. The participants were given data on the response of either a set of phosphoproteins or cytokines under dozens of conditions and were asked to predict the response in a handful of conditions that had been omitted. It is a "leave some out" problem, Prill observed, so any inferences that the participant might make about the network were only relevant to the extent that they affected the predicted measurements.
The Lauffenburger and Sorger labs pursue a philosophy of collecting information about many signaling pathways, represented by phosphorylated proteins, and responses, represented by extracellular cytokines, in the presence of a wide range of external conditions, or cues. The researchers looked at the liver cell line subjected to eight different stimuli and eight different pathway inhibitors. Of the 64 stimulus/inhibitor pairs, they withheld data for seven. They collected data for 17 phosphoproteins and for 20 cytokines, for both normal and HepG2 cancer cells at two different time delays from the exposure. Lab member Julio Saez-Rodriguez used his DataRail tool to organize and present the huge arrays of data.
Altogether, participants needed to supply about 500 numbers for either the phosphoprotein or cytokine challenge, corresponding to each of the 17 or 20 levels for both cell lines and for seven stimulus/inhibitor pairs at two time points. To assess the predictions, Prill said, the judges computed a normalized-square error and compared them with a null model in which they sampled the missing entries for the same phosphoprotein or cytokine to estimate a P-value. All three teams that submitted phosphoprotein predictions performed very well by this measure, one with a P-value near 10−13, and the two best performers with P-values near 10−22. One of the latter teams was also the best performer for the cytokine challenge, with a P-value of about 10−35.
One of the best performers, Nicolas Guex, observed that in light of the variability of data from different labs, and the "well-designed data set" provided, he and his colleagues chose not to take advantage of prior knowledge or biology to address the challenge.
Challenge 2 asked for the levels of either 17 phosphoproteins or 20 cytokines under seven masked conditions of stimulus and inhibitor, given the rest of the 64 pairs.
Rather than analyzing the network, Guex and his colleagues used established techniques for imputing the missing data by comparison with the existing data. They honed the parameters of the imputation algorithm by leaving out even more of the data and choosing the algorithm parameters to best predict 50 different choices of omitted data. The most important parameter was the number of imputations. The polynomial order was also important, as well as the function (raw, square-root, logarithm) used to transform the data.
Guillaume Bourque and Neil Clarke also used what Bourque called a "simpleminded approach" to attack DREAM3 Challenge 2, and, like Guex, concluded that "adding new information would not help much." By finding two inhibitors with similar profiles and comparing them for different stimuli, the researches filled in the data, he said. "The data was begging to be averaged."
Bourque acknowledged that there was "not much biology in this case, but this is a missing value problem."
In spite of the very high success of the predictors, the results of DREAM3 Challenge 2 illuminate a weakness of the data-prediction approach. The challenge avoided the problem of trying to predict what kind of network model someone else would expect. "We need to be sure we're not just trying to guess what someone else is thinking," Bourque noted. However, the successful methods simply attacked the missing data by looking for regularities in the existing data. The teams made no inferences about the underlying structure of the network, so they can hardly be regarded as "reverse engineering" the network, or indeed of producing any information at all about the underlying biology.
Challenge 3: Gene-Expression Prediction
Challenge 3 concerned the time course of gene expression changes in yeast following exposure to an inhibitor that mimics amino-acid starvation. A 2001 study showed that this inhibitor, 3-aminotriazole (3AT), presumably through its action on the central regulator Gcn4, induces some 1000 genes and represses a similar number. Neil Clarke and his colleagues did microarray profiling of gene expression at eight time points after 3AT exposure. In addition to the wild type, they profiled a deletion strain lacking Gcn4, as well as deletion strains for Leu3 and Gat1, whose promoters bind the Gcn4 protein.
Participants were given complete time series of expression for more than 9000 genes, except that expression data for 50 genes were omitted completely for the Gat4 deletion strain. The participants were asked to rank these 50 genes in terms of their relative expression change at eight time points, relative to the wild type at time zero. Clarke noted that simply averaging the expression of the other strains predicted the expression rather well, except at the earliest times.
Of nine groups that submitted predictions, Gustavo Stolovitzky noted, two groups did significantly better than the rest. Based on a geometric mean between gene-expression and time-course predictions, both groups achieved a P-value of less than 10−3.
Representing one of those teams, Jianhua Ruan noted that incorporating the rich biological data available that related to this response looked very challenging. "That was scary for me," Ruan said, both because of limited predictive power of existing data and possible variations between platforms. Therefore, like the best predictors of Challenge 2, they decided to make predictions based only on the data provided.
The team considered basing their predictions for the 50 missing genes in the Gat4 deletion strain on the expression of the same genes in the other three strains, but they noticed that the different strains were not well correlated at early times. Instead, the researchers sliced the data in the orthogonal direction, looking for genes whose expression was highly correlated with the missing genes. They used the expression of those genes in the Gat4 deletion strain to predict the missing expression data.
The researchers used a general algorithm to supply the missing data, varying the number of closely correlated genes they included. By repeatedly and randomly omitting 50 more genes, they found that ten such neighbors gave the best predictions. Ruan also described potential improvements in the algorithm.
Although their results ranked similarly to the other best predictor for Challenge 3, Mika Gustafsson and Michael Hörnquist used a very different approach, one that incorporated much more biological information than other participants. As allowed by the rules, they mapped the Affymetrix probes onto gene names so they could integrate other types of data. "We included two other types of data, Gustafsson noted, a "compendium of knockouts and other experiments, and time series," from Rosetta and the NCBI omnibus, respectively. The different sets, some 700 extra profiles, were normalized so they could be combined, with the DREAM data weighted more heavily."The core of what we did was a least-squares problem for each of these 50 genes," Gustafsson said, which aimed to predict the expression of the missing genes based on that of the others. Because the amount of data is inadequate to determine the parameters, the team used a "mix between ridge regression and lasso." They chose the parameters for this algorithm by cross-validation, leaving additional genes out and testing the predictions.
The researchers also used another type of biological information, incorporating sets of validated and predicted transcription-factor-DNA interactions. They incorporated this information into the function that the algorithm uses to penalize extra links. "We softly integrate both structural data and extra expression data, but only to the extent that it enhances this cross-validation error," Gustafsson noted. "This soft fusion is needed."
Challenge 4: In silico Networks
Real biological networks are never completely understood. The lack of a true "gold standard" against which to compare predictions led DREAM3 organizers to emphasize predictions of measurable experimental data. Computer-generated networks, however, are completely known. Networks that are predicted from calculated data can thus be rigorously compared with the networks that generated the data, although it is not always known how unique these networks are. These in silico networks formed the basis for Challenge 4. A further strength of computer networks is that many, statistically similar networks can easily be generated, allowing a more confident assessment of the accuracy of reverse-engineering algorithms. "We need to do many networks," noted Daniel Marbach, and "the only way to do this is in silico."
Marbach and his colleagues generated networks including 10, 50, or 100 nodes, with five examples of each. Marbach described the generation of the topology of the networks by sampling nodes from real biological networks of known structure, thus "borrowing" the structure of real networks for the in silico benchmarks. He stressed, however, that this sampling must be done carefully if the extracted subnetwork is to look reasonable. "It only makes sense if it's biologically plausible," he noted. The researchers devised a scheme to modularly extract subnetworks, and showed that network structures obtained in this way preserve such properties as the functional significance and network motifs from the original network.
The team endowed these network structures with dynamical models of expression, and generated two types of "gene expression" data that are common in experiments: Deletion data summarized the steady-state expression changes when particular nodes of the network were knocked out. Perturbation data represented the evolution of the expression with time following a new initial concentration of multiple proteins and mRNAs.
These challenges attracted many participants, noted Stolovitzky, with 29, 27, and 22 teams making predictions for the 10, 50, and 100 node challenges, respectively. Having five networks of each size proved useful, since some competitors did well on one instance but not on average. "You have to do well in all five to do well overall," Stolovitzky noted.
When comparing the results of the 50-node networks this year with those of DREAM2, Stolovitzky observed that "as a community, there was no real improvement." However, since the network structures were more realistic, noise was added to the data, and the dynamical models were more detailed, inferring the networks may have been more challenging than in DREAM2. Interestingly, "some teams that did very well last year did much worse this year," Stolovitzky noted, suggesting that the performance of these methods may strongly depend on the type of network and data to which they are applied.
The same team was the best predictor for all three network sizes. Speaking for that team from Yale University, Kevin Yip noted that "the overall algorithm is rather complicated." He also cautioned that the computer analysis time grew rapidly, from two minutes for the 10-node networks to 78 hours for the 100-node networks.
The Yale team, which also included included Roger Alexander, Koon-Kiu Yan, and Mark Gerstein, combined multiple models to do their reconstruction. They noticed that deletion data were useful for finding direct regulatory connections between nodes. With these connections in place, the perturbation data allowed refinement of the model to include weaker and more complex interactions.A critical challenge in evaluating the deletion data, Yip said, is assessing which deviations from the wild-type expression are meaningful and which may be due to noise. He illustrated an iterative procedure for making this assessment.
The team then used the perturbation data to find ordinary differential equations to model the expression rate. They tried three models: linear, sigmoidal, and multiplicative. In each case they evaluated potential regulators, but could practically evaluate only a small number of these.
These two types of results were combined to get the final predictions. One common cause of deviations of the predictions from the real network was the confusion between direct and indirect regulation connecting two nodes, Yip noted. Nonetheless, the team consistently outperformed the others for all of the tested network sizes.
Speakers:
George Church, Harvard Medical School
Leona Samson, Massachusetts Institute of Technology
Highlights
- The Personal Genome Project aims to collect extensive data on tens of thousands of individuals, including not just genetic data and traits, but also environmental and cellular data.
- Reprogramming donated cells to induced pluripotent stem (IPS) cells and then redifferentiating them creates pseudotissues that reflect epigenetic information.
- Past environmental and pathogen exposure is retained in the immune system response.
- Alykylating agents are a ubiquitous and unavoidable source of DNA damage.
- Human cell lines from various ethnic populations have a huge variation in their sensitivity or resistance to methylation.
- A signature combining the expression of 48 genes correctly classifies cell line resistance with 94% accuracy.
Personal genomes
One of the most complex challenges in genomics, said George Church, "is how you get from human personal genomes—not the generic human genome, but individual ones—to traits." Even without attempting sequence-based preventative medicine, which he called "a little too challenging," making the connection to traits requires much more than sequence information, he said. "It's not just a genome in a vacuum."
Church was a pioneer in high-throughput sequencing and has promoted an open-source hardware and software system called the Polonator. Several commercial systems are available now or soon will be, and have contributed to a phenomenal price reduction for sequencing. Over the past several years, he noted, the historical halving of the price every year, similar to Moore's law in electronics, has been replaced by a tenfold annual reduction.
The economics of sequencing have inspired Church to establish the Personal Genome Project, which aims to collect genomic profiles of a large number of individuals—currently about 10,000, with plans for up to 100,000. Importantly, the project explicitly does not disguise the personal traits of individuals. "The idea is to get genes, environments, traits, and cellular data," Church emphasized. "This is not just genomic." By collecting many attributes, he said, researchers can mine the data to explore many different hypotheses with only logarithmic growth in the number of individuals.
The Personal Genome Project aims to collect not just genomic data, but also epigenetic and exposure data to relate to traits.
Church said his goal is "to get a hint about how the genome plays out" in the context of specific tissues. For example, "to get at the regulatory signals we need RNA or other epigenomics data." Getting such tissue-specific signals is a challenge, he acknowledged, since the volunteers "are not signing up for thousands of biopsies." Instead he and his collaborators have streamlined techniques to induce pluripotent stem cells from fibroblasts. These cells are then redifferentiated to create what he calls "pseudotissues" whose specific gene expression can be compared.
Because the volunteer's cell lines are available, researchers can do mechanistic followup studies, Church said. "It's a huge improvement on looking at a little piece of your liver and a little piece of mine." Differences in cis-acting elements that result in allele-specific expression, he added, "will allow us analyze all the upstream and downstream and intronic controlling elements."
Naturally the manifestation of a genome as traits also depends on the environment. "Some of the environment is not known," Church acknowledged, but allergens and microorganisms are recorded for years by the immune system, in what he called "VDJomics," referring to the combinations of variable, diversity, and joining regions that underlie the extraordinary variation among antibodies. Researchers are still assembling the tools for analyzing the Personal Genome Project data, he noted, and he invited others to join in the shared enterprise.
Individual responses to DNA damage
Correlating DNA damage responses with specific environmental agents is complicated, said Leona Samson, in part because "even if everyone had the same exposure, people would have different responses." She described studies of DNA damage caused by alkylating agents, such as those in tobacco smoke, combustion products, and certain foods. Samson is interested in alkylating agents because they are used to treat cancer patients, so that "an extraordinarily high number of people will be deliberately exposed to these agents."
The alkylating agent MNNG adds a methyl group at the O6 position of guanine in DNA. The resulting methylguanine is very mutagenic, she said, since it mistakenly pairs with thymine instead of cytosine. To correct this problem, she noted, virtually all cells in all species express a protein that repairs this methylation.
Surprisingly, cells that lack the gene for this methyltransferase protein are more sensitive to cell killing by methylation when they have an intact DNA mismatch repair pathway. "The presence of mismatch repair actually drives cell death," Samson noted. But she cautioned that cells with neither the methyltransferase nor the mismatch repair pathway "aren't dying, but they have lots of damage to the DNA." By contrast, in normal cells that have both, "you've taken care of the damage" by repairing the DNA.
Because the balance between at least two repair pathways determines the consequences of exposure, Samson said, "we wanted to look at human populations and ask what the range of sensitivities was." Using 24 ethnically diverse lymphoblastoid cell lines from the Coriell Institute, her team found "tremendous difference in alkylation sensitivity."
Ethnically diverse cells lines show extraordinary differences in theirs sensitivity to the alkylating agents MNNG and MMS. The color coding shows that sensitivity to one predicts the sensitivity to the other rather poorly.
The researchers looked at expression profiles for the least and most sensitive cell lines, and found a set of 48 genes whose basal expression differed between the two groups, and which showed a clear dose-response trend. The most important effect was for the known methyltransferase gene, but the second most positive gene previously had no known function.
When applied to the full spectrum of cell lines, Samson said, "this set of 48 genes seems to be highly accurate in predicting sensitivity or resistance to the alkylating agent MNNG." In addition, selected knockdown experiments confirmed the predictions of the genomic analysis. A large fraction of the genes the researchers identified were associated with cancer and carcinogenesis.
More recently, Samson and her collaborators have compared the effects of different alkylating agents. The results showed a similarly broad distribution of sensitivities, she said, "but they were completely jumbled up about which ones are sensitive and resistant." These diverse responses to similar damage illustrate the challenge of connecting individual genomes to biological characteristics.
Speakers:
Uri Alon, Weizmann Institute
Aviv Bergman, Albert Einstein College of Medicine
Highlights
- Biological networks have been profoundly shaped by evolution.
- The modularity commonly observed in biological systems and networks may arise to solve problems that persist in spite of environmental changes.
- Adaptation to modularly changing goals is significantly faster than adaptation to fixed goals.
- Problems that persist over evolutionary time may evolve nonmodular solutions, like the ribosome.
- Individuals generally vary much less in their phenotype than in their underlying genotypic variation.
- Even in the absence of selective pressure, phenotypic robustness emerges naturally in computer models of evolution that feature a complex connection between genotype and phenotype.
- Genetic variation can be revealed by compromising the function of many individual genes, not just special genes like that for a heat-shock protein.
- Cancers exhibit extreme variability of expression, even when expression changes show no obvious trend.
The evolution of modularity
Evolution has given the biological networks we study today a number of important features. Understanding these features can help to clarify the nature of the networks, and can also provide clues about the conditions under which they evolved.
One striking feature of biological networks is their approximate organization into weakly interacting functional modules. This organization is not an artifact of our limited imagination, said Uri Alon, who has identified over-represented motifs as one aspect of this modular arrangement. Modularity is "a major design principle," he suggested, applying to body plans and protein structure as well as networks.
In contrast, when researchers simulate evolution in computer algorithms, Alon observed, "you see that it's not modular." This is not surprising, since entropy favors the much more numerous non-modular arrangements, which are also likely to open options for greater optimization.
Alon described one possible reason why biological systems develop modular structures: changes in the environment over time that require organisms to develop new survival strategies. He and his colleagues, notably Nadav Kashtan, have explored the effect of changing environment by simulating the evolution of algorithms as they respond to changing goals.
As expected, algorithms that evolved to solve an unchanging problem were non-modular and highly efficient. But repeatedly changing the goal generally doesn't create modularity, Alon observed. "You just confuse the system." The simulation evolves modularity only when "the different goals that it changes through over time have something in common," he said. When the goals have such common sub-problems, the researchers call the environment "modularly varying."
When systems have to meet shifting goals that have common sub-problems, the shifting landscape causes them to evolve a modular structure, and they also solve the problem more rapidly than if the goals stay fixed.
Alon suspects that such modular challenges may be common in biology, since functions such as metabolism or locomotion retain constant activities even as organisms encounter new environments. Other tasks, such as protein translation from messenger RNA, may stay the same over time; the ribosome, which evolved to address this task, has a complex, non-modular structure. Alon also showed evidence that bacteria that navigate varied environments have more modular gene networks than those that see only one environment, such as obligate parasites.
One surprise emerging from the studies, Alon said, is that "when you vary the goals, you have tremendous speedup in the evolution." This acceleration continues to increase when the organism is faced with more challenging problems, so that the time needed to solve them grows only as the cube root of their complexity. "You solve both goals faster than you would solve either one of them giving them constant goals," he said. He stressed that this accelerated evolution occurs only for modularly related goals. Moreover, the goals must change slowly enough for the system to rearrange the modules, but not so slowly that it can evolve into a more optimal, non-modular structure.
The evolution of robustness
In addition to showing modularity, biological networks that persist through evolution often exhibit robustness, meaning that individuals vary much less in their phenotype than would be expected from their underlying genetic variation.
Aviv Bergman is working to understand "how genetic variation within the individual is translated into the phenotype of that individual." Although molecular and developmental biologists "run from variation like from fire," he said, "evolutionary biologists treasure genetic variation—we can't live without it." Robustness preserves this genetic variation without causing fatal phenotypic change. He noted, "Mechanisms that allow the harboring of variation are critical for evolutionary change and for our understanding of how evolution occurs."
Knocking many different genes leads to wider variability in phenotype than is seen in wild-type yeast.
Bergman has simulated the evolution of coupled genotypes and phenotypes to explore this issue. "My model organism is some sort of mathematical object," he commented. These simulations show that robustness emerges even when there is no selective pressure. "Simply by having complex phenotype–genotype mapping, as a side effect you end up with robustness," he said. "Selection is very important, but it may not be the major driving force behind the evolution of robustness."
Genetic variability among individuals is often revealed in mutants. One well known example involves the heat-shock protein Hsp90, which buffers variability like an "evolutionary capacitor." In yeast studies, Bergman and his colleagues found that a huge variety of genes also enhance variability when they are knocked out. "Every gene, just by being an element of this complex gene network, can reveal variation if you compromise its functionality," he noted. "It's not necessarily unique to Hsp90."
Bergman suggested that aggressive cancers exhibit especially large variability, presumably resulting from mutations. He and his collaborators studied a population of patients with squamous-cell carcinoma. Even sophisticated statistical techniques revealed no clear signal distinguishing high- and low-survival groups in the average expression of any gene or group of genes. However, the low-survival group showed significantly higher variability of expression. "We hypothesize that robustness is disrupted in tumor cells, leading to higher gene-expression variation," Bergman said. "Our hypothesis of revealed variation may explain why there are so many different outcomes in cancer."
Speakers:
Chris Burge, Massachusetts Institute of Technology
Thomas Tuschl, The Rockefeller University
Highlights
- Alternative splicing and post-transcriptional modification are critically important for understanding gene regulation.
- By sequencing poly-adenylated sequences preferentially, researchers can identify the splicing and end modifications of messenger RNA sequences targeted for translation.
- For transcripts that have multiple alternatives for splicing or related structures, roughly two thirds occur in different proportions in different tissues.
- "Switch-like" exons, whose incorporation varies strongly between different tissues, will help to elucidate the code for tissue-specific splicing.
- Splicing is coordinated with cleavage and polyadenylation of 3′ untranslated regions, suggesting they might be controlled by common, tissue-specific factors.
- A new method allows efficient, high-resolution identification of RNA sequences bound by binding proteins.
- Complete understanding of the RNA code requires lots of measurements.
Alternative messages
Even after an RNA is transcribed, there are several additional opportunities for regulation of its translation into protein. For example, a single DNA sequence can produce a variety of messenger RNA (mRNA) transcripts. In alternative splicing, different sections of the original sequence are included, producing functionally distinct proteins known as isoforms. The final transcripts may also differ in the noncoding regions at either end, including the polyadenylated (poly-A) 3′ tail that adorns eukaryotic mRNA.
Chris Burge and his colleagues use recent advances in sequencing technology to clarify the biological significance of these alternative proteins, in particular the degree to which they vary between tissues. A new method, called mRNA-seq, selectively sequences RNA that includes poly-A tails. As a result, it identifies mature transcripts with much lower background noise than other techniques.
In one experiment, Burge said, about 40% of the 20 million sequence reads mapped to unique splice locations. Comparing reads from the same location showed where alternative splicing occurred. The researchers compared the frequency of alternative splice forms in different tissues and individuals. "Of the 10,000 skipped exons where we had reads supporting both isoforms, over 6000 of them were significantly tissue-regulated," Burge said. "We estimate that about 2/3 of alternative processing events in humans are differentially regulated in one or the other of this set of ten tissues," he concluded, "substantially higher than previous estimates." Between individuals, the team estimated about 10%–20% differential regulation.
To explore the molecular mechanisms that lead to alternative splicing, the researchers "looked for exons that had the most pronounced patterns of tissue differences" in how frequently they were included in the final mRNA, Burge said. "Our attention was drawn to the subset of exons that have very different inclusion values between different tissues," he noted, calling them "switch-like exons."
"About 2/3 of the alternative processing events in humans are differentially regulated."
As might be expected, the switch-like exons showed high sequence conservation, suggesting that they are biologically important. In addition, however, the regions within 100 bases on either side, where splice regulating elements are known to occur, were also highly conserved. These regions near switch-like exons should be especially useful for defining such regulatory elements, Burge suggested.
The researchers also found a correlation between alternative splicing events and alternative cleavage and polyadenylation events. This was a surprise since the latter processes, which modify the 3′ untranslated region (UTR), are thought to occur when transcription is terminated, while splicing occurs later.
Surprisingly, the team found conserved motifs both near the splice sites and far away in the sequence. This suggests, Burge said, that the factors that bind to these motifs "don't just regulate splicing, but also have 3′ UTR regulatory functions, either regulating 3′ UTR stability, or translation, or cleavage and polyadenylation."
Regulation by RNA
The stability and translation of mRNAs, as well as their localization within the cell, are also guided by other, small RNA segments. "As we are becoming increasingly aware," said Thomas Tuschl, the control of gene expression to proteins once you have made your mRNAs is an extremely important regulatory mechanism that we're just at the beginning of understanding."
These regulatory processes are performed by complexes of RNA and RNA-binding proteins, which make up some 5% of human genes. Mammalian cells feature two types of endogenous small RNA: piwi-interacting RNA (piRNA), which appears only in germ-line cells, and microRNA (miRNA). The precursor transcripts for miRNA form stem-loop structures. These are cleaved and exported to the cytoplasm to form twenty-some base-pair segments of RNA that complex with argonaute-family proteins.
These complexes act in two ways to modulate activity of mRNAs. In the RNA interference pathway, complexes containing argonaute-2 degrade mRNA that is fully complementary to its miRNA. Complexes with the other three argonaute proteins instead repress translation of mRNA targets that need only be partially complementary, mostly in a "seed" region near the 5′ end of the miRNA.
"To understand the role of miRNA, you have to know which miRNA is expressed in what cell," Tuschl observed. He and his colleagues have developed protocols to clone RNA sequences that have signature termination groups for miRNA, avoiding the duplication of other short RNAs such as degradation products.
Fluorescence linked to micro-RNA can illuminate molecular process within individual cells.
Sequencing the clones from a single set of cells reveals some 10,000 different miRNAs, although many of the variants can be clustered into related families. Moreover, Tuschl emphasized, "very abundant miRNAs can represent 70% of everything you clone." Only for these abundant miRNAs is it practical to study the regulatory activity.
Nonetheless, the miRNA profiles differ in modest but distinguishable ways, for example between different classes of breast cancer, Tuschl said. Knowing these differences might help to categorize tumors, but he noted that "taking a tissue is not good enough," since a tissue sample always includes a variety of cells, such as normal, tumor, and stromal cells. Tuschl's lab is developing chemical techniques to visualize miRNA expression in individual cells in tissue.
In addition to knowing which cells express a particular miRNA, however, researchers must learn which mRNA sequences they target. To do this, Tuschl and his colleagues have used chromatin immunoprecipitation (ChIP) to isolate binding partners for the various proteins involved in endogenous miRNA regulation.
Although these techniques for identifying miRNA targets are powerful, Tuschl said, "we really need a method that lets you look at single sites." To do this, his team has recently adapted the cumbersome crosslinking immunoprecipitation (CLIP) method. They fed their cells a photo-reactive modified nucleoside, such as 4-thio-uradine, which improves the crosslinking yield a thousand fold. In addition, although the modified base substitutes for thymine, crosslinking it chemically alters it so that it acts like cytosine, so that comparing the sequences locates the crosslink precisely.
The researchers used the technique to map the precise positions where miRNAs bind in conjunction with different argonaute proteins. "We find about half of them in the coding sequence and the other half in the 3′ UTR," Tuschl said, "which was a little bit unexpected, because you know that miRNAs are predominantly targeting the 3′ UTRs."
Speakers:
Daphne Koller, Stanford University
David Botstein, Princeton University
Highlights
- Over-represented patterns of activity within a particular network topology give clues to general network design principles.
- Such recurring "activity motifs" in metabolic networks include sequential, "just in time" transcription and other examples of coordinated timing.
- The relative timing of expression of different genes is due in part to different binding affinity for transcription factors.
- Growth-rate-induced expression changes resemble, but are distinguishable from, those that constitute the environmental stress response.
- Applying heat pulses to a steady-state reactor can help distinguish genes that respond to heat alone from those that respond to both heat and growth rate.
- New tools allow the simultaneous monitoring of hundreds of metabolites with time resolution of tens of seconds.
Coordinated activity
Researchers have learned much about networks by analyzing their topology, at both the global scale of statistical connectivity and the fine scale of network motifs. But Daphne Koller cautioned that "topology by nature is a static thing, whereas the cell is constantly adapting usage of different parts of these networks in response to changes in the environment, as well as internal changes." She likens the dynamic activity of networks to traffic patterns on a highway: "To truly understand the way in which networks are used, we need to understand not just the map, but also the traffic pattern."
Assuming a fixed network topology, Koller and her colleagues search for "patterns in functional data that are more likely than you would expect by chance." Like the topological motifs discussed by Uri Alon, she said, such "activity motifs" should "correspond to things that it is beneficial for the cell to do."
Among her group's many projects, Koller is analyzing activity in metabolic networks. In particular, they have explored a previous proposal that the timing of transcription along a metabolic chain may be tuned to provide proteins precisely when they are needed. Such "just-in-time" transcription could preserve raw materials and limit side reactions.
To generate precise timing data from periodically sampled microarray data, group member Gal Chechnik devised a way to extract specific features from expression at a few discrete times, by assuming a characteristic impulse response. Among the parameters he extracted is the onset time of expression. Koller and her colleagues looked at a variety of metabolic chains, and asked whether this onset proceeded in a strict temporal sequence along the chains in response to a perturbation.
In response to heat shock, for example, the team found 268 instances of "forward activation," in which the onset was later for genes further down the chain than for those earlier in the chain. In contrast, a randomly permuted network showed only 56 such instances. "That's a very significant difference," Koller noted. For the opposite perturbation, de-heating, the order of shutoff times followed the same pattern. This "forward shutoff" pattern had not been seen previously. The team also saw enrichment of backward activation and shutoff, as well as coordinated timing of activation.
In more cases than would be expected by chance, the order in which genes change their expression is connected to the relationship of their products in the metabolic network.
One limitation of using microarrays to is that the messenger RNA expression they measure does not completely determine the concentration of proteins that participate in metabolic reactions. The beauty of her team's approach, Koller suggested, is that "our analysis does not rely on RNA levels, but on timing ... Timing seems to be a more robust thing." For one sequence of nine timed genes, the researchers confirmed that the onset times for protein levels were highly correlated with the onset times for expression of the corresponding genes.
Some of the time-ordering they observed probably reflects feedback loops in the cell that sense metabolite levels and respond appropriately, Koller said. In addition, "there is a preprogrammed transcriptional response" that tends to optimize the timing of expression. One mechanism that affects timing is the quantitative binding affinity of transcription factors. Her team confirmed that, for cases where they had data, a significant fraction showed tighter binding for the gene that was expressed earlier. Koller concluded that "the cell has evolved the affinity of its binding sites to fine tune its response to changes."
The big picture
Experimenters monitoring gene expression frequently overlook biological side effects of their genetic manipulations, said David Botstein. "Organisms are hugely sensitive to their environment, and the signal from that [response to environment] has a tendency to dominate everything else."
He recalled the widely-cited "yeast-abuse diagram," produced by his student Audrey Gasch by measuring expression changes over time as "she did everything you could do bad to a yeast." The response was by and large the same for all stresses, leading researchers to call it the environmental stress response. Botstein noted that this presents a frustrating obstacle to gaining a clear picture of gene expression, asking, "How do you stop having this confound every experiment that you do?" For example, if a mutant is tuned to slightly different conditions than standard growth conditions, "you'll see many thousands of genes shift their expression."
To resolve this conundrum, Botstein and his colleagues have revived the decades-old chemostat, a closed reactor that ensures that all nutrient flows are well defined. "Steady state is reached when the growth rate is equal to the dilution rate," he noted. "I can arrange the rate-limiting nutrient, and know it for sure." Augmenting this method with modern expression profiling is a powerful tool for separating different causes of gene expression changes.
The expression profile under these nutrient-limited conditions is strikingly similar no matter which nutrient the researchers restrict, except for a few genes on nutrient-specific pathways. "It doesn't matter whether they are limited for glucose, leucine, uracil, ammonia, phosphate, or sulfate." Botstein noted. In fact, the changes are also similar to those seen in the environmental stress response. He said that about three-quarters of the information in the stress response relates to growth rate.
Improved mass-spectrometry techniques allow profiling of hundreds of metabolites with a time resolution of tens of seconds.
Nonetheless, because growth rate takes some time to show an effect, the responses can be distinguished by subjecting the chemostat to a pulsed perturbation such as heat. Based on their response, the genes divide into six classes: those that increase or decrease transcription with both growth rate and heat, those that increase or decrease in response to heat but not growth rate, and those that respond oppositely to the two stimuli in one sense or the other. "There is a growth-rate response, and it's distinguishable from the environmental stress response," Botstein argued, although "much of the pattern is due to growth."
Although the overall response of their yeast cells to nutrient deprivation was quite similar, Botstein and his colleagues identified an important difference in the metabolic response for some nutrients. In particular, "artificial" starvations, which are not ordinarily encountered in the wild, "waste all the glucose," he said. In contrast, "the natural mutations save the glucose," presumably in response to reduced growth rate.
As a result, artificial starvations kill the cells. By finding strains that survive artificial starvation, the researchers identified genes, many affecting or affected by Cdc55, that govern this effect. "We think we have located the region of the network that has evolved this ability to detect the instantaneous growth rate," Botstein concluded.
In addition to studying steady-state growth, Botstein and his team have developed mass-spectrometry protocols for tracking hundreds of metabolites over tens of seconds, which is critical for metabolism studies. "The time scale for transcription is completely out of whack with the time scale for metabolite flux," he stressed. By measuring many metabolites at once, the researchers can produce the sort of heat maps that are familiar from expression data.
Still, Botstein emphasized that the continuous growth often studied in the laboratory may not be a "normal" environment for organisms like yeast. Instead, these organisms probably evolved to conserve resources most of the time, but to be ready to exploit the occasional opportunity for explosive growth. In this view, the "stress" response may actually be the normal state.
Speakers:
John Tyson, Virginia Polytechnic Institute and State University
Boris Kholodenko, Thomas Jefferson University
Pamela Silver, Harvard Medical School
Highlights
- Various "motifs"—simple patterns of excitation and inhibition in biological networks—can be identified and shown to exhibit behaviors like signal transduction, hysteresis, and oscillation.
- The cell cycle in yeast is driven by back-to-back motifs that act as self-flipping switches.
- Feedforward loops with antagonistic interactions execute cell-cycle events exactly once, and are common among periodically expressed genes.
- Choosing perturbations carefully is critical to network reconstruction.
- Even for complex nonlinear systems, the Jacobian matrix, which summarizes the sensitivity of each variable to each other, determines the connections between network elements.
- Different stimuli that share a pathway can induce different time responses because of additional interactions.
- Researchers have designed a completely synthetic transcriptional system in yeast that stores a memory of previous small-molecule exposure, even through cell division; a more sophisticated version does the same in mammalian cells.
- A synthetic auto-inhibiting gene network shows oscillations whose period grows as researchers increase the length of the intron, and thus the transcription delay, for one of the genes.
- It may be possible to modify the photosynthetic system of cyanobacteria to produce hydrogen.
Functional motifs drive the cell cycle
John Tyson defines a protein interaction network to include not just direct interactions, such as the formation and breakup of complexes, or processes like phosphorylation and dephosphorylation. Instead, he includes the entire complex set of chemical reactions by which proteins affect each other, including those that modify their synthesis and degradation. To understand a network's dynamic behavior, it is more important to classify one protein's effect on another as activating, repressing, or neutral than to describe the specific mechanisms by which it achieves this affect.
Classifying protein interactions helps to identify "motifs" in a network. Tyson defined a motif as "a simple pattern of activation and inhibition among a small number of proteins." This meaning is different from that popularized by Uri Alon and his collaborators, for whom a motif is a pattern of activation and inhibition that appears more frequently than expected by chance. In any case, the important question for network behavior is whether motifs are "sufficiently isolated to serve an identifiable function," Tyson said. If so, Tyson identifies the motif as a "module."
For small numbers of interacting proteins, Tyson categorized all possible motifs by the patterns of interactions (positive, negative, or neutral) between each pair of proteins. For two proteins and one link, for example, the motif passes a signal from one to the other, whereas for two proteins and reciprocal links, the motif forms a loop. If the interactions within that loop are both excitatory or both inhibitory, the net result will be positive feedback, which creates bistability for certain values of the interaction parameters. If the interactions are opposite in sign, the resulting negative feedback tends to damp disturbances, creating homeostasis.
"Simple motifs exist in the cell-cycle control system and they carry out their expected functions."
"Three-component motifs are more interesting and more complicated," observed Tyson, who drew on the work of Kholodenko to help classify them. Some motifs can generate sustained oscillations, for example. Other motifs, called feedforward loops, involve competition or cooperation between separate branches of the network.
Tyson identified such modules in the protein interaction network that underlies the yeast cell cycle (see also this 2005 eBriefing). In one example, he predicted that the total concentration of cyclin B in the cell drives the activation of the protein to one of two stable levels; later experiments confirmed that the motif "[carries] out the function that it's expected to carry out," Tyson commented.
Another motif in the cell-cycle network is a three-component feedback loop surrounding a "toggle switch," Tyson observed. "This can operate as a kind of self-flipping switch." Indeed, progression through the cell cycle can be viewed as the sequential operation of two back-to-back self-flipping switches. One switch activates during the G1 phase, once conditions are right to begin DNA synthesis, and events leading up to metaphase. The second switch initiates the completion of mitosis and cell division.

Two separate "self-flipping switches" drive yeast through the cell cycle.
Once the cell cycle is established, Tyson asked, "How does this control system drive the downstream events?" He showed that an "incoherent" feedforward loop, in which one leg inhibits the action of the other until the first stops acting, acts as a "cock-and-fire transducer: It will fire once and only once." This action helps prevent critical events like DNA replication from happening twice in one cycle. Tyson's collaborators have found that such feedforward loops are common among genes whose expression varies with the cell cycle, he said.
Diverse responses
Boris Kholodenko has also categorized simple network motifs. However, he emphasized that understanding the overall behavior of a network also depends on knowing detailed parameters of the interactions.
To illustrate the importance of quantitative modeling, Kholodenko contrasted the response of the PC12 rat cancer cell line to epidermal (EGF) and nerve growth factor (NGF). Both effectors act through the same MAP-kinase pathway, he noted, "yet the cellular response is extremely different." EGF causes a transient response in downstream extracellular regulated kinase (ERK), leading to proliferation, while NGF causes sustained ERK activity, leading to differentiation.
To understand the contrasting responses, Kholodenko said, researchers must reconstruct the network based on a number of observations. "How can we predict the local connections if you measure only the system response?" he asked. Kholodenko argued that the "local responses are directly connected to the Jacobian matrix," which contains the partial derivatives of different variables with respect to each other.

Exposure of cultured cells to two growth factors causes very different cellular responses, even though the two act through the same pathway.
This in turn requires knowing how the system is "wired"; that is, which nodes are connected to which others. "To understand the structure," Kholodenko said, "you should do perturbations." The critical challenge is how to choose the perturbations that will reveal the structure, an issue he has explored extensively.
Even then, Kholodenko noted, "We cannot deduce the wiring inside modules, but we can deduce the wiring between modules." These methods can also be extended to dynamic measurements, he said. He illustrated his ideas by studying how ligands interact with ErbB receptors, which are important in the early stages of many tumors.
New functions from scratch
One way to test how well researchers understand biological networks is to build new ones and see whether they work as expected. Pamela Silver described three such synthetic biology projects, two that illustrate regulatory principles and one aimed at saving the planet. These projects rely on the modularity of biological systems, Silver said, which makes it possible to alter the systems' function methodically. (See also Uri Alon's talk for one idea of how this modularity evolves.)
Silver's first project involved "building cells that can remember that they were exposed to a [small-molecule] signal." She and her team built an autofeedback loop in yeast purely from synthetic parts. They inserted two DNA sequences, each including luminescent reporters of their activity. The first sequence contained a gene that was activated by exposure to a small molecule. This gene's product activated transcription of the second gene, whose product also activated its own production.

An auto-feedback loop inserted into yeast lets it "remember" exposure to a chemical, even after cell division.
Once exposed, a green luminescent reporter showed that the second gene remained active even after exposure to the small molecule stopped, over many cell divisions. "This meets our design goal of building a system from parts that has predictable behavior," Silver said. The researchers also used this system to explore the dynamics of the response in detail.
The team has built a similar system for mammalian cells in culture, although Silver observed that "things get a little more complicated." There could be significant applications for such synthetic memory circuits, she said. "For example, we are building cells that can report on how much DNA damage they were exposed to and when."
In another project, Silver and her colleagues explored the timing of gene expression. They varied the length of intron regions of a gene, which are transcribed but not translated. They incorporated this gene into a delayed auto-inhibition loop that they designed to oscillate, resulting in a series of pulses of expression. "As the intron length increases, the average time between the pulses also increases," Silver noted. "One could use this to build a general system, for example pulsatile delivery of a gene product."
"We built a system from parts that has predictable behavior."
Silver also described work "to create microbes that would use sunlight and convert it to hydrogen." Her team worked with a green photosynthetic cyanobacterium that "is extremely easy to work with." In cyanobacteria (formerly called blue-green algae), fixation of carbon dioxide to useful organic compounds occurs in a geometrically regular structure called a carboxysome, and is catalyzed by the enzyme ribulose 1,5-bisphosphate carboxylase/oxygenase (RuBisCO).
The chemical energy to drive this reaction derives from absorption of two photons. The first photon is absorbed in Photosystem II, which splits a water molecule and transfers an electron to Photosystem I, which absorbs another photon to add to its energy before passing it on to the carboxysome.
Silver and her team adapted parts of this system to produce molecular hydrogen from water. Ordinarily the absorbed photon is excited into iron-sulfur clusters and then tunnels into ferredoxin. "Our idea," Silver said, is to "build a [three-way] fusion protein that directs these electrons efficiently into ferredoxin and then into hydrogenases that produce hydrogen."
The researchers have tested part of this scheme by providing the energetic electrons chemically instead of with light. "We have created the ferredoxin hydrogenase fusion protein," Silver concluded. "Our long term goal is to connect it to the photosystem."
- Can the photosynthetic mechanisms of cyanobacteria be redesigned to produce hydrogen?
- How will the emerging understanding of RNA interference and related phenomena change the modeling of regulatory networks.
- How do enhancers, microRNA, and alternative splicing interact in time- and tissue-specific expression?
- What are the detail sequence codes that affect alternative splicing?
- To what degree have gene-expression profiles been obscured by global responses of the organisms to their environments?
- What aspects of the genome contribute most to the differences between humans and chimpanzees?
- Can the increased diversity of phenotypes seen in cancer be used in treatment?
- What will be the best ways to take advantage of the extensive data collected in the Personal Genome project?
- How can the reverse engineering of biological networks best be assessed?