Support The World's Smartest Network

Help the New York Academy of Sciences bring late-breaking scientific information about the COVID-19 pandemic to global audiences. Please make a tax-deductible gift today.

This site uses cookies.
Learn more.


This website uses cookies. Some of the cookies we use are essential for parts of the website to operate while others offer you a better browsing experience. You give us your permission to use cookies, by continuing to use our website after you have received the cookie notification. To find out more about cookies on this website and how to change your cookie settings, see our Privacy policy and Terms of Use.

We encourage you to learn more about cookies on our site in our Privacy policy and Terms of Use.


RECOMB Regulatory Genomics / Systems Biology / DREAM Conference 2010

RECOMB Regulatory Genomics / Systems Biology / DREAM Conference 2010
Reported by
Don Monroe

Posted April 22, 2011


For the third year, three conferences on genetic regulation, systems biology and network biology joined forces. Over five days, from November 16 through 20, 2010, the meeting at the Riverside Church near Columbia University combined the 7th RECOMB Satellite Conference on Regulatory Genomics, chaired by Manolis Kellis and Ziv Bar-Joseph, with the 6th RECOMB Satellite Conference on Systems Biology and the 5th DREAM Conference, chaired by Gustavo Stolovitzky and Andrea Califano.

In addition to the keynote talks summarized in the meeting report, the conferences featured both oral and poster presentations of exciting new work in these dynamic fields. Furthermore, the DREAM Conference highlighted results of the latest round of "Challenges" to assess the skills of participants in learning about biological networks from blinded data. More detailed reports and video from the conference can be accessed using the navigation bar above.

Use the tabs above to find a meeting report and multimedia from this event.


Presentations available from:

Matti Annala (Tampere University of Technology, Finland)
Nicola Barbarini (University of Pavia, Italy)
Harmen Bussemaker (Columbia University)
Alberto de la Fuente (CRS4)
Tom Gingeras (Cold Spring Harbor Laboratory)
Vân Anh Huynh-Thu (University of Liège, Belgium)
Leonid Kruglyak (Princeton University)
Robert Küffner (Ludwig-Maximilians-Universität, Munich, Germany)
Po-Ru Loh (MIT)
Daniel Marbach (Massachusetts Institute of Technology)
Randall T. Moon (HHMI & the University of Washington)
Raquel Norel (IBM Research)
Yaron Orenstein (Tel Aviv University)
Rob Patro (University of Maryland)
Scott Powers (Cold Spring Harbor Laboratory)
Bobby Prill (IBM Research)
Stuart Schreiber (HHMI, Broad Institute of Harvard and MIT)
Eran Segal (Weizmann Institute of Science, Israel)
Michael Snyder (Stanford University)
Peter Sorger (Harvard Medical School)
John Stamatoyannopoulos (University of Washington)
Hans-Jürgen Thiesen (University of Rostock)
Marc Vidal (Harvard Medical School)
Matthieu Vignes (Institut National de la Recherche Agronomique (INRA), Toulouse, France)
Matthew Weirauch (University of Toronto)

Image: Phylogenetic dependency network for HIV adaptation.
Credit: Jonathan Carlson and David Heckerman (Microsoft Research).

Presented by:

  • IBM Research
  • Center for the Multiscale Analysis of Genomic and Cellular Networks
  • The New York Academy of Sciences

Strategies for Identifying and Validating New Components of Signal Transduction Networks

Randall T. Moon (HHMI, University of Washington)
  • 00:01
    1. Introduction; the Wnt-beta-catenin network
  • 02:52
    2. The signaling pathway; Promotion and inhibition; Context-dependent roles in adults
  • 09:50
    3. Identification and validation of signaling networks
  • 23:15
    4. Proteomic screens; Small molecule screens
  • 29:17
    5. Limitations; Intregrating screens
  • 34:52
    6. Summary; Acknowledgements and conclusio

Transcriptional Lego: Predictable Control of Gene Expression by Manipulating Promoter Building Blocks

Eran Segal (Weizmann Institute of Science)
  • 00:01
    1. Introduction
  • 05:38
    2. The modeling framework
  • 14:10
    3. Measuring expression of a promoter sequence
  • 17:32
    4. Measurement of systematically varied sequence elements
  • 20:55
    5. The presence, length, and strength of the boundary
  • 25:48
    6. Factor and site affinity; The TF site; The importance of distance
  • 32:32
    7. The fine-tuning of expression levels
  • 35:40
    8. The importance of nucleosome disfavoring sequences; Acknowledgements and conclusio

Bridging the Gap with Small-molecule Probes of Cancer

Stuart Schreiber (HHMI, Broad Institute of Harvard and MIT)
  • 00:01
    1. Introduction
  • 04:48
    2. Mapping genotype/SM sensitivity
  • 10:21
    3. Targeting non-oncogene co-dependencies; Modeling
  • 14:58
    4. ROS biology and small-molecule sensitivity; The CTD2 Network
  • 19:23
    5. Cell-line models of cancer; CCLE and the small-molecule probe kit
  • 27:50
    6. Global clustering of pilot-phase data; Studies
  • 38:25
    7. Acknowledgements and conclusio

Interactome Networks and Human Disease

Marc Vidal (Dana–Farber Cancer Institute)
  • 00:01
    1. Introduction
  • 03:34
    2. The network approach and global properties; Biological attributes
  • 09:08
    3. Empirically-controlled mapping; Examining network perturbations, experiments
  • 20:00
    4. Comparing genetic variations and pathogens
  • 28:08
    5. Gene-centered and edge-centered views of evolution; Paralogs; Actin family
  • 35:50
    6. The evolution of interactive networks
  • 41:12
    7. Acknowledgements, summary, and conclusio

Signal Transduction and Pharmaceutical Mechanism from Bottom-Up and Top-Down Perspectives

Peter Sorger (Harvard Medical School)
  • 00:01
    1. Introduction
  • 02:17
    2. EGFR signaling and ErbB receptors; ErbB1 experiment
  • 08:38
    3. ErbB2 and ErbB3; Phospho turnover
  • 15:22
    4. Differences in phospho-dynamics
  • 20:21
    5. Model implementations and rate estimates
  • 23:53
    6. Network context; Inference of differences in topologies
  • 29:30
    7. Comparing and clustering models; Fuzzy logic modeling; Context-specific mapping
  • 33:35
    8. Summary; Grant opportunities; Acknowledgements and conclusio

Identification of Oncogenic Drivers and Predictive Biomarkers in Liver Cancer

Scott Powers (Cold Spring Harbor Laboratory)
  • 00:01
    1. Introduction; HCC treatment options
  • 04:55
    2. Oncogenes activated in human HCC; Oncogenomic cDNA screening
  • 11:13
    3. Prediction algorithms; POFUT1
  • 18:12
    4. CCND1 and FGF19
  • 25:23
    5. Oncogenomic screening in ovarian cancer
  • 27:05
    6. Acknowledgements and conclusio

Genomes and Variation

Michael Snyder (Stanford University)
  • 00:01
    1. Introduction; Sequencing with different technologies
  • 08:25
    2. Mapping structural variations; TF binding variation in yeast
  • 18:41
    3. Ste12 binding and six new factors; Trans QTLs; Amn1 and Flo8
  • 25:47
    4. TF binding variation among people; Mappable variations
  • 34:24
    5. Conclusions and acknowledgement

What Is the Genetic Basis of Phenotypic Variation?

Leonid Kruglyak (Princeton University)
  • 00:01
    1. Introduction; Height and heritability
  • 09:15
    2. Dissecting genetically complex phenotypes through yeast
  • 18:40
    3. 4NQO sensitivity; Assessing effects and interactions; Architectural differences
  • 28:30
    4. Dissecting complex traits in populations; Simple allelic patterns
  • 34:45
    5. Summary and future directions; Acknowledgments and conclusio

Eukaryotic Transcriptomes: Complex, Multifunctional, Compartmentalized, and Elegant

Thomas Gingeras (Cold Spring Harbor Laboratory)
  • 00:01
    1. Introduction
  • 06:05
    2. GENCODE; Changes in RNAseq data
  • 12:30
    3. Sub-cellular compartmentalization; ENCODE transcriptome project; IDR
  • 24:08
    4. Unannotated transcripts; Conclusio

Identifying the Genetic Determinants of TF Activity

Harmen Bussemaker (Columbia University)
  • 00:01
    1. Introduction; Modeling philosophy
  • 03:07
    2. Aspects of TF function; Identifying genetic determinants of TF factor activity
  • 11:31
    3. Identification through protein-protein interaction data
  • 15:10
    4. Summary and conclusio

Learning from Diversity

Rob Patro (University of Maryland)

Prediction of Peptide Reactivity with Human IVIg through a Knowledge-based Approach

Nicola Barbarini (University of Pavia, Italy)

A DREAM5 Best Performing Method for TF Binding Affinity Prediction in PBM Microarrays

Matti Annala (Tampere University of Technology, Finland)

Analyzing PBM Data to Find Binding Site Motifs and Predict TF Binding Intensities

Yaron Orenstein (Tel Aviv University)

Gene Regulatory Network Reconstruction Using Bayesian Networks, the Dantzig Selector, and the Lasso: A Meta-Analysis

Matthieu Vignes (Institut National de la Recherche Agronomique, Toulouse, France)

Max-Correlation Min-Redundancy and Other Regression Variants Predict Phenotype in DREAM5

Po-Ru Loh (Massachusetts Institute of Technology)

Regulatory Network Inference with GENIE3: Application to the DREAM5 Challenge

Vân Anh Huynh-Thu (University of Liège, Belgium)

Inference of GRNs by ANOVA

Robert Küffner (Ludwig-Maximilians-Universität, Munich, Germany)

Epitope–Antibody Recognition (EAR)

Hans-Jürgen Thiesen (University of Rostock)

Learning and Testing Transcription Factor Models Using Protein Binding Microarrays

Matt Weirauch (University of Toronto)

The DREAM5 Systems Genetics Challenges

Alberto de la Fuente (CRS4)

Profiling Network Inference Methods: The DREAM5 Network Inference Challenge

Daniel Marbach (Massachusetts Institute of Technology)

DREAM5 Challenge2 Results

Raquel Norel (IBM Research)

DREAM5 Challenge 1 Results: Epitope–Antibody Recognition (EAR)

Bobby Prill (IBM Research)

DREAM5 Challenge 3 Results: Systems Genetics A & B

Bobby Prill (IBM Research)

DREAM5 Challenge 4 Results: [Gene] Network Inference

Bobby Prill (IBM Research)

Projects, Databases, and Tools

Results for DREAM5 Challenges

Mammalian Gene Collection
This database provides researchers with unrestricted access to sequence-validated full-length protein-coding cDNA clones for human, mouse, and rat genes.

Human Gene Mutation Database
This database collates published gene lesions responsible for human inherited disease.

The Cancer Genome Atlas
The Cancer Genome Atlas is a comprehensive and coordinated effort to accelerate our understanding of the genetics of cancer using innovative genome analysis technologies.

International Cancer Genome Consortium (ICGC)
The ICGC aims to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe.

Cancer Genomics Pathway Portal
This portal provides direct download and visualization of large-scale cancer genomics data sets, currently Prostate Cancer, Sarcoma, and Glioblastoma multiforme. (Demo video)

Cancer Target Discovery and Development (CTD2) Network
The network is aimed at developing new scientific approaches to accelerate the translation of genomic discoveries into new treatments.

ChemBank is a public, web-based informatics environment, including data derived from small molecules and small-molecule screens, and resources for studying the data.

ENCODE Project
The ENCODE Project aims to identify all functional elements in the human genome sequence.

The GENCODE project is a sub-project of the ENCODE scale-up project whose aim is to annotate all evidence-based gene features in the entire human genome at a high accuracy

modENCODE will try to identify all of the sequence-based functional elements in the Caenorhabditis elegans and Drosophila melanogaster genomes.

NIH Roadmap Epigenetics Mapping Consortium
The NIH Roadmap Epigenetics Mapping Consortium aims to produce to a public resource of human epigenomic data to catalyze basic biology and disease-oriented research.

Pathway Commons
Pathway Commons is a tool to search and visualize public biological pathway information.

Online Mendelian Inheritance in Man (OMIM)
OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes.

Pfam is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

UCSC Genome browser
This site contains the reference sequence and working draft assemblies for a large collection of genomes.

1000 Genomes Project
The 1000 Genomes Project aims to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

Catalog of Genome-wide Association Studies
The Catalog of Genome-wide Association Studies lists studies attempting to assay at least 100,000 single nucleotide polymorphisms (SNPs).

Journal Articles

Steven Altschuler

Altschuler SJ, Wu LF. Cellular heterogeneity: do differences make a difference? Cell. 2010;141(4):559-563.

Altschuler SJ, Angenent SB, Wang Y, Wu LF. On the spontaneous emergence of cell polarity. Nature. 2008;454(7206):886-889.

Loo L, Lin H, Singh DK, et al. Heterogeneity in the physiological states and pharmacological responses of differentiating 3T3-L1 preadipocytes. J. Cell Biol. 2009;187(3):375-384.

Singh DK, Ku C, Wichaidit C, et al. Patterns of basal signaling heterogeneity can distinguish cellular populations with different drug sensitivities. Mol. Syst. Biol. 2010;6:369.

Charlie Boone

Baryshnikova A, Costanzo M, Kim Y, et al. Quantitative analysis of fitness and genetic interactions in yeast on a genome scale. Nat. Methods. 2010;7(12):1017-1024.

Costanzo M, Baryshnikova A, Bellay J, et al. The genetic landscape of a cell. Science. 2010;327(5964):425-431.

Dowell RD, Ryan O, Jansen A, et al. Genotype to phenotype: a complex problem. Science. 2010;328(5977):469.

Leidel S, Pedrioli PGA, Bucher T, et al. Ubiquitin-related modifier Urm1 acts as a sulphur carrier in thiolation of eukaryotic transfer RNA. Nature. 2009;458(7235):228-232.

Tong AHY, Lesage G, Bader GD, et al. Global mapping of the yeast genetic interaction network. Science. 2004;303(5659):808-813.

Harmen Bussemaker

Brem R, Hall J. XRCC1 is required for DNA single-strand break repair in human cells. Nucleic Acids Res. 2005;33(8):2512-2520.

Brem RB, Kruglyak L. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci. U.S.A. 2005;102(5):1572-1577.

Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005;436(7051):701-703.

Brown TA. Genomes. 2nd edition. Oxford: Wiley-Liss; 2002.

Bussemaker HJ, Foat BC, Ward LD. Predictive modeling of genome-wide mRNA expression: from modules to molecules. Annu Rev Biophys Biomol Struct. 2007;36:329-347.

Fox KR. The effect of HhaI methylation on DNA local structure. Biochem. J. 1986;234(1):213-216.

Hesselberth JR, Chen X, Zhang Z, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat. Methods. 2009;6(4):283-289.

Joshi R, Passner JM, Rohs R, et al. Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell. 2007;131(3):530-543.

Lee E, Bussemaker HJ. Identifying the genetic determinants of transcription factor activity. Mol. Syst. Biol. 2010;6:412.

Lee AP, Koh EGL, Tay A, Brenner S, Venkatesh B. Highly conserved syntenic blocks at the vertebrate Hox loci and conserved regulatory elements within and outside Hox gene clusters. Proc. Natl. Acad. Sci. U.S.A. 2006;103(18):6994-6999.

MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput. Biol. 2006;2(4):e36.

Macisaac KD, Gordon DB, Nekludova L, et al. A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics. 2006;22(4):423-429.

Noyes MB, Meng X, Wakabayashi A, et al. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res. 2008;36(8):2547-2560.

Noyes MB, Christensen RG, Wakabayashi A, et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell. 2008;133(7):1277-1289.

Rockman MV, Kruglyak L. Genetics of global gene expression. Nat. Rev. Genet. 2006;7(11):862-872.

Suck D, Oefner C. Structure of DNase I at 2.0 A resolution suggests a mechanism for binding to and cutting DNA. Nature. 1986;321(6070):620-625.

Yvert G, Brem RB, Whittle J, et al. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 2003;35(1):57-64.

Tom Gingeras

Birney E, Stamatoyannopoulos JA, Dutta A, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799-816.

Denoeud F, Kapranov P, Ucla C, et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007;17(6):746-759.

Gingeras TR. Implications of chimaeric non-co-linear transcripts. Nature. 2009;461(7261):206-211.

Kapranov P, Drenkow J, Cheng J, et al. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 2005;15(7):987-997.

Li H, Wang J, Mor G, Sklar J. A neoplastic gene fusion mimics trans-splicing of RNAs in normal human cells. Science. 2008;321(5894):1357-1361.

Parra G, Reymond A, Dabbouseh N, et al. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. 2006;16(1):37-44.

Salehi-Ashtiani K, Yang X, Derti A, et al. Isoform discovery by targeted cloning, 'deep-well' pooling and parallel sequencing. Nat. Methods. 2008;5(7):597-600.

Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs. Nature. 2009;457(7232):1028-1032.

Leonid Kruglyak

Demogines A, Smith E, Kruglyak L, Alani E. Identification and dissection of a complex DNA repair sensitivity phenotype in Baker's yeast. PLoS Genet. 2008;4(7):e1000123.

Ehrenreich IM, Torabi N, Jia Y, et al. Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature. 2010;464(7291):1039-1042.

Khan Z, Bloom JS, Garcia BA, Singh M, Kruglyak L. Protein quantification across hundreds of experimental conditions. Proc. Natl. Acad. Sci. U.S.A. 2009;106(37):15544-15548.

Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747-753.

Rockman MV, Skrovanek SS, Kruglyak L. Selection at linked sites shapes heritable phenotypic variation in C. elegans. Science. 2010;330(6002):372-376.

Randall T. Moon

Ahumada A, Slusarski DC, Liu X, et al. Signaling of rat Frizzled-2 through phosphodiesterase and cyclic GMP. Science. 2002;298(5600):2006-2010.

Angers S, Thorpe CJ, Biechele TL, et al. The KLHL12-Cullin-3 ubiquitin ligase negatively regulates the Wnt-beta-catenin pathway by targeting Dishevelled for degradation. Nat. Cell Biol. 2006;8(4):348-357.

Brannon M, Brown JD, Bates R, Kimelman D, Moon RT. XCtBP is a XTcf-3 co-repressor with roles throughout Xenopus development. Development. 1999;126(14):3159-3170.

Cheyette BNR, Waxman JS, Miller JR, et al. Dapper, a Dishevelled-associated antagonist of beta-catenin and JNK signaling, is required for notochord formation. Dev. Cell. 2002;2(4):449-461.

Dorsky RI, Raible DW, Moon RT. Direct regulation of nacre, a zebrafish MITF homolog required for pigment cell formation, by the Wnt pathway. Genes Dev. 2000;14(2):158-162.

James RG, Biechele TL, Conrad WH, et al. Bruton's tyrosine kinase revealed as a negative regulator of Wnt-beta-catenin signaling. Sci Signal. 2009;2(72):ra25.

Kaykas A, Yang-Snyder J, Héroux M, et al. Mutant Frizzled 4 associated with vitreoretinopathy traps wild-type Frizzled in the endoplasmic reticulum by oligomerization. Nat. Cell Biol. 2004;6(1):52-58.

Liu T, DeCostanzo AJ, Liu X, et al. G protein signaling from activated rat frizzled-1 to the beta-catenin-Lef-Tcf pathway. Science. 2001;292(5522):1718-1722.

Major MB, Moon RT. "Omic" risk assessment. Sci Signal. 2009;2(72):eg7.

Major MB, Camp ND, Berndt JD, et al. Wilms tumor suppressor WTX negatively regulates WNT/beta-catenin signaling. Science. 2007;316(5827):1043-1046.

Major MB, Roberts BS, Berndt JD, et al. New regulators of Wnt/beta-catenin signaling revealed by integrative molecular screening. Sci Signal. 2008;1(45):ra12.

McMahon AP, Moon RT. Ectopic expression of the proto-oncogene int-1 in Xenopus embryos leads to duplication of the embryonic axis. Cell. 1989;58(6):1075-1084.

Raevskiĭ KS. [Effect of reserpine on the analgesic effect of morphine and promedol]. Farmakol Toksikol. 1969;32(2):134-137.

Takemaru K, Yamaguchi S, Lee YS, et al. Chibby, a nuclear beta-catenin-associated antagonist of the Wnt/Wingless pathway. Nature. 2003;422(6934):905-909.

Waxman JS, Hocking AM, Stoick CL, Moon RT. Zebrafish Dapper1 and Dapper2 play distinct roles in Wnt-mediated developmental processes. Development. 2004;131(23):5909-5921.

Yost C, Torres M, Miller JR, et al. The axis-inducing activity, stability, and subcellular distribution of beta-catenin is regulated in Xenopus embryos by glycogen synthase kinase 3. Genes Dev. 1996;10(12):1443-1454.

Scott Powers

Bric A, Miething C, Bialucha CU, et al. Functional identification of tumor-suppressor genes through an in vivo RNA interference screen in a mouse lymphoma model. Cancer Cell. 2009;16(4):324-335.

Croft D, O'Kelly G, Wu G, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39(Database issue):D691-697.

Nicholes K, Guillet S, Tomlinson E, et al. A mouse model of hepatocellular carcinoma: ectopic expression of fibroblast growth factor 19 in skeletal muscle of transgenic mice. Am. J. Pathol. 2002;160(6):2295-2307.

Pai R, Dunlap D, Qing J, et al. Inhibition of fibroblast growth factor 19 reduces tumor growth by modulating beta-catenin signaling. Cancer Res. 2008;68(13):5086-5095.

Tomlinson E, Fu L, John L, et al. Transgenic mice expressing human fibroblast growth factor-19 display increased metabolic rate and decreased adiposity. Endocrinology. 2002;143(5):1741-1747.

Wu G, Feng X, Stein L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 2010;11(5):R53.

Chris Sander

Cerami EG, Gross BE, Demir E, et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39(Database issue):D685-690.

Taylor BS, Barretina J, Socci ND, et al. Functional copy-number alterations in cancer. PLoS ONE. 2008;3(9):e3179.

Taylor BS, Schultz N, Hieronymus H, et al. Integrative genomic profiling of human prostate cancer. Cancer Cell. 2010;18(1):11-22.

Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061-1068.

Stuart Schreiber

Druker BJ. Perspectives on the development of imatinib and the future of cancer research. Nat. Med. 2009;15(10):1149-1152.

Kaelin WG, Thompson CB. Q&A: Cancer: clues from cell metabolism. Nature. 2010;465(7298):562-564.

Luo J, Solimini NL, Elledge SJ. Principles of cancer therapy: oncogene and non-oncogene addiction. Cell. 2009;136(5):823-837.

Perlstein EO, Ruderfer DM, Ramachandran G, et al. Revealing complex traits with small molecules and naturally recombinant yeast strains. Chem. Biol. 2006;13(3):319-327.

Perlstein EO, Ruderfer DM, Roberts DC, Schreiber SL, Kruglyak L. Genetic basis of individual differences in the response to small-molecule drugs in yeast. Nat. Genet. 2007;39(4):496-502.

Ramanathan A, Wang C, Schreiber SL. Perturbational profiling of a cell-line model of tumorigenesis by using metabolic measurements. Proc. Natl. Acad. Sci. U.S.A. 2005;102(17):5992-5997.

Schreiber SL, Shamji AF, Clemons PA, et al. Towards patient-based cancer therapeutics. Nat. Biotechnol. 2010;28(9):904-906.

Shaw SY, Blodgett DM, Ma MS, et al. Disease allele-dependent small-molecule sensitivities in blood cells from monogenic diabetes. Proc. Natl. Acad. Sci. U.S.A. 2011;108(2):492-497.

Stockwell BR, Haggarty SJ, Schreiber SL. High-throughput screening of small molecules in miniaturized mammalian cell-based assays involving post-translational modifications. Chem. Biol. 1999;6(2):71-83.

Eran Segal

Field Y, Kaplan N, Fondufe-Mittendorf Y, et al. Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS Comput. Biol. 2008;4(11):e1000216.

Iyer V, Struhl K. Poly(dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J. 1995;14(11):2570-2579.

Kaplan N, Moore IK, Fondufe-Mittendorf Y, et al. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature. 2009;458(7236):362-366.

Kornberg RD, Stryer L. Statistical distributions of nucleosomes: nonrandom locations by a stochastic mechanism. Nucleic Acids Res. 1988;16(14A):6677-6690.

Raveh-Sadka T, Levo M, Segal E. Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Res. 2009;19(8):1480-1496.

Peter Sorger

Chen WW, Niepel M, Sorger PK. Classic and contemporary approaches to modeling biochemical reactions. Genes Dev. 2010;24(17):1861-1875.

Morris MK, Saez-Rodriguez J, Sorger PK, Lauffenburger DA. Logic-based models for the analysis of cell signaling networks. Biochemistry. 2010;49(15):3216-3224.

Saez-Rodriguez J, Alexopoulos LG, Epperlein J, et al. Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Mol. Syst. Biol. 2009;5:331.

Michael Snyder

Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating non-coding regions of the genome. Nat. Rev. Genet. 2010;11(8):559-571.

Habegger L, Sboner A, Gianoulis TA, et al. RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 2011;27(2):281-283.

Kasowski M, Grubert F, Heffelfinger C, et al. Variation in transcription factor binding among humans. Science. 2010;328(5975):232-235.

Raha D, Wang Z, Moqtaderi Z, et al. Close association of RNA polymerase II and many transcription factors with Pol III genes. Proc. Natl. Acad. Sci. U.S.A. 2010;107(8):3639-3644.

Zheng W, Zhao H, Mancera E, Steinmetz LM, Snyder M. Genetic analysis of variation in transcription factor binding in yeast. Nature. 2010;464(7292):1187-1191.

John Stamatoyannopoulos

Bernstein BE, Stamatoyannopoulos JA, Costello JF, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 2010;28(10):1045-1048.

Dostie J, Richmond TA, Arnaout RA, et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 2006;16(10):1299-1309.

Gheldof N, Smith EM, Tabuchi TM, et al. Cell-type-specific long-range looping interactions identify distant regulatory elements of the CFTR gene. Nucleic Acids Res. 2010;38(13):4325-4336.

Hesselberth JR, Chen X, Zhang Z, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat. Methods. 2009;6(4):283-289.

Stamatoyannopoulos JA. Illuminating eukaryotic transcription start sites. Nat. Methods. 2010;7(7):501-503.

Marc Vidal

Braun P, Tasan M, Dreze M, et al. An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods. 2009;6(1):91-97.

Gavin A, Bösche M, Krause R, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141-147.

Gavin A, Aloy P, Grandi P, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440(7084):631-636.

Giot L, Bader JS, Brouwer C, et al. A protein interaction map of Drosophila melanogaster. Science. 2003;302(5651):1727-1736.

Goh K, Cusick ME, Valle D, et al. The human disease network. Proc. Natl. Acad. Sci. U.S.A. 2007;104(21):8685-8690.

Gunsalus KC, Ge H, Schetter AJ, et al. Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis. Nature. 2005;436(7052):861-865.

Hanada K, Kuromori T, Myouga F, Toyoda T, Shinozaki K. Increased expression and protein divergence in duplicate genes is associated with morphological diversification. PLoS Genet. 2009;5(12):e1000781.

Ito T, Chiba T, Ozawa R, et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. U.S.A. 2001;98(8):4569-4574.

Krogan NJ, Cagney G, Yu H, et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006;440(7084):637-643.

Lee MG, Nurse P. Complementation used to clone a human homologue of the fission yeast cell cycle control gene cdc2. Nature. 1987;327(6117):31-35.

Pujana MA, Han JJ, Starita LM, et al. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat. Genet. 2007;39(11):1338-1349.

Venkatesan K, Rual J, Vazquez A, et al. An empirical framework for binary interactome mapping. Nat. Methods. 2009;6(1):83-90.

Whyte P, Buchkovich KJ, Horowitz JM, et al. Association between an oncogene and an anti-oncogene: the adenovirus E1A proteins bind to the retinoblastoma gene product. Nature. 1988;334(6178):124-129.

Zhong Q, Simonis N, Li Q, et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 2009;5:321.

DREAM Challenges

Bhaskaran R, Ponnuswamy PK. Dynamics of amino acid residues in globular proteins. Int. J. Pept. Protein Res. 1984;24(2):180-191.

Bhaskaran R, Yu C. NMR spectra and restrained molecular dynamics of the mushroom toxin viroisin. Int. J. Pept. Protein Res. 1994;43(4):393-401.

Butte AJ, Kohane IS. Unsupervised knowledge discovery in medical databases using relevance networks. Proc AMIA Symp. 1999:711-715.

Candès and Tao. 2007. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat.35:2313-2351.

Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 1978;47:45-148.

Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet. Epidemiol. 2002;23(1):70-86.

Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5(1):e8.

Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000;7(3-4):601-620.

Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432-441.

Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE. 2010;5(9).

Janin J. Surface and inside volumes in globular proteins. Nature. 1979;277(5696):491-492.

Jansen RC, Nap JP. Genetical genomics: the added value from segregation. Trends Genet. 2001;17(7):388-391.

Kolaskar AS, Tongaonkar PC. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett. 1990;276(1-2):172-174.

Margolin AA, Nemenman I, Basso K, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7 Suppl 1:S7.

Margolin AA, Wang K, Lim WK, et al. Reverse engineering cellular networks. Nat Protoc. 2006;1(2):662-671.

Parker JM, Guo D, Hodges RS. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry. 1986;25(19):5425-5432.

Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21(6):754-764.

Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 2010;11(11):751-760.

Tiengo A, Barbarini N, Troiani S, Rusconi L, Magni P. A Perl procedure for protein identification by Peptide Mass Fingerprinting. BMC Bioinformatics. 2009;10 Suppl 12:S11.


Ziv Bar-Joseph, PhD

Carnegie Mellon University
e-mail | website | publications

Ziv Bar-Joseph is an Associate Professor in the Lane Center for Computational Biology at Carnegie Mellon University. Before starting this academic post, Bar-Joseph spent 4 years (1999–2003) in Cambridge, Massachusetts earning his PhD in computer science under the guidance of David Gifford and Tommi Jaakkola. He did his masters and undergraduate work at Hebrew university, acquiring a BSc in computer science and mathematics and then an M Sc in computer science.

His work at Carnegie Mellon is in computational biology, bioinformatics and machine learning. Bar-Joseph also leads the systems biology group where researchers develop computational methods for understanding the interactions, dynamics and conservation of complex biological systems. Some of his previous work has focused on the exciting areas of distributed computing and computer graphics.

Andrea Califano, PhD

Columbia University
e-mail | website | publications

Andrea Califano is professor of biomedical informatics at Columbia University, where he leads several cross-campus activities in computational and system biology. Califano is also codirector of the Center for Computational Biology and Bioinformatics, director of the Center for the Multiscale Analysis of Genetic Networks, and associate director for bioinformatics at the Irving Cancer Research Center.

Califano completed his doctoral thesis in physics at the University of Florence and studied the behavior of high-dimensional dynamical systems. From 1986 to 1990, he was on the research staff in the Exploratory Computer Vision Group at the IBM Thomas J. Watson Research Center, where he worked on several algorithms for machine learning, including the interpretation of two- and three-dimensional visual scenes. In 1997 he became the program director of the IBM Computational Biology Center, and in 2000 he cofounded First Genetic Trust, Inc., to pursue translational genomics research and infrastructure related activities in the context of large-scale patient studies with a genetic components.

Manolis Kellis, PhD

e-mail | website | publications

Manolis Kellis is an Associate Professor of Computer Science at MIT and a member of the Computer Science and Artificial Intelligence Laboratory and of the Broad Institute of MIT and Harvard, where he directs the MIT Computational Biology Group. His group has recently been funded to lead the integrative analysis efforts of the modENCODE project for Drosophila melanogaster and also for integrative analysis of the NIH Epigenome Roadmap Project. He has received the US Presidential Early Career Award in Science and Engineering (PECASE) for his NIH R01 work in Computational Genomics, the NSF CAREER award, the Alfred P. Sloan Fellowship, the Karl Van Tassel chair in EECS, the Distinguished Alumnus 1964 chair, and the Ruth and Joel Spira Teaching Award in EECS. Kellis obtained his PhD from MIT, where he received the Sprowls award for the best doctorate thesis in computer science, and the first Paris Kanellakis graduate fellowship. Prior to computational biology, he worked on artificial intelligence, sketch and image recognition, robotics, and computational geometry, at MIT and at the Xerox Palo Alto Research Center.

Gustavo Stolovitzky, PhD

IBM Research
e-mail | website | publications

Gustavo Stolovitzky is manager of the Functional Genomics and Systems Biology Group at the IBM Computational Biology Center in IBM Research. The Functional Genomics and Systems Biology group is involved in several projects, including DNA chip analysis and gene expression data mining, the reverse engineering of metabolic and gene regulatory networks, modeling cardiac muscle, describing emergent properties of the myofilament, modeling P53 signaling pathways, and performing massively parallel signature sequencing analysis.

Stolovitzky received his PhD in mechanical engineering from Yale University and worked at The Rockefeller University and at the NEC Research Institute before coming to IBM. He has served as Joliot Invited Professor at Laboratoire de Mecanique de Fluides in Paris and as visiting scholar at the physics department of The Chinese University of Hong Kong. Stolovitzky is a member of the steering committee at the Systems Biology Discussion Group of the New York Academy of Sciences.


Steven Altschuler, PhD

University of Texas Southwestern
e-mail | website | publications

Matti Annala

Tampere University of Technology, Finland

Nicola Barbarini, PhD

University of Pavia, Italy
e-mail | website | publications

Charlie Boone, PhD

University of Toronto, Canada
e-mail | website | publications

Harmen Bussemaker, PhD

Columbia University
e-mail | website | publications

Alberto de la Fuente, PhD

CRS4 Bioinformatica, Italy
e-mail | website | publications

Tom Gingeras, PhD

Cold Spring Harbor Laboratory
e-mail | website | publications

Vân Anh Huynh-Thu

University of Liège, Belgium
e-mail | publications

Leonid Kruglyak, PhD

Princeton University
e-mail | website | publications

Robert Küffner, PhD

Ludwig-Maximilians-Universität, Munich, Germany
e-mail | website | publications

Po-Ru Loh

e-mail | website | publications

Randall T. Moon, PhD

HHMI, University of Washington
e-mail | website | publications

Raquel Norel, PhD

IBM Research

Yaron Orenstein

Tel Aviv University

Rob Patro

University of Maryland
e-mail | website

Scott Powers, PhD

Cold Spring Harbor Laboratory
e-mail | website

Bobby Prill, PhD

IBM Research
e-mail | website | publications

Chris Sander, PhD

Memorial Sloan-Kettering Cancer Center, Sloan-Kettering Institute
e-mail | website | publications

Stuart Schreiber, PhD

HHMI, Broad Institute of Harvard and MIT
e-mail | website | publications

Eran Segal, PhD

Weizmann Institute of Science, Israel
e-mail | website | publications

Michael Snyder, PhD

Stanford University
e-mail | website | publications

Peter Sorger, PhD

Harvard Medical School
e-mail | website | publications

John Stamatoyannopoulos, PhD

University of Washington
e-mail | website | publications

Gustavo Stolovitzky, PhD

IBM Research
e-mail | website | publications

Marc Vidal, PhD

Harvard Medical School
e-mail | website | publications

Matthieu Vignes, PhD

Institut National de la Recherche Agronomique (INRA), Toulouse, France
e-mail | website | publications

Matthew Weirauch, PhD

University of Toronto, Canada
e-mail | publications

Don Monroe

Don Monroe is a science writer based in Murray Hill, New Jersey. After getting a PhD in physics from MIT, he spent more than fifteen years doing research in physics and electronics technology at Bell Labs. He writes on biology, physics, and technology.


  • IBM Research
  • Center for the Multiscale Analysis of Genomic and Cellular Networks
  • The New York Academy of Sciences

For the third year, three conferences on genetic regulation, systems biology and network biology joined forces. Over five days, the meeting at the Riverside Church near Columbia University combined the 7th RECOMB Satellite Conference on Regulatory Genomics, chaired by Manolis Kellis and Ziv Bar-Joseph, with the 6th RECOMB Satellite Conference on Systems Biology and the 5th DREAM Conference, chaired by Gustavo Stolovitzky and Andrea Califano.

In addition to the keynote talks summarized below, the conferences featured both oral and poster presentations of exciting new work in these dynamic fields. In addition, the DREAM Conference highlighted results of the latest round of "Challenges" to assess the skills of participants in learning about biological networks from blinded data.


The expression patterns of cells depend in part on the cellular context provided by signaling pathways such as the Wnt and ErbB pathways, each of which are implicated in cancers. Stimulation of the versatile Wnt signaling pathway has markedly different effects, depending on when and where it occurs. To find molecules that control this sensitivity, Randall Moon uses both frog and fish embryo assays and cell-based assays using siRNA, proteomics, and small molecules. Integrating the results from these screens helps compensate for their individual weaknesses, but multiple validation steps are still critical, he said. Peter Sorger integrates signaling experiments and models, including that in the ErbB pathway, at various levels of detail. Sorger's models range from differential equations for detailed dynamics to Boolean models for large networks. He noted an inherent conflict between the level of biological detail and the ability to determine model parameters.


Liver cancers show widespread changes in gene copy numbers. Scott Powers explored genomic regions that are frequently amplified in these cancers and found 18 genes whose overexpression causes liver cancer in a mouse model. These genes may be useful as biomarkers for the pathways that are disrupted in particular tumors. Cancer genome projects show that the mutations in each patient are different, but they tend to affect a common group of pathways. Chris Sander thinks the best therapies will recognize these common modules but use combinations of drugs aimed at particular subgroups of patients.


Heterogeneity in cell populations is ubiquitous, said Steven Altschuler, and is often biologically important. Because individual cells may behave completely differently from the population average, researchers need to justify their use of averages. Recognizing the heterogeneous distribution of transcribed RNA in different subcellular compartments, said Thomas Gingeras, lets researchers find rare transcripts. About half of the genome is transcribed and processed, he says, and appears to include many types of RNA whose function is not yet understood.


The analysis of the response of cells to small molecules, said Stuart Schreiber, provides biological insight into cellular processes as well as the potential for therapeutics. He and his colleagues are systematically cataloguing dose-response relationships for a panel of highly specific molecules, which together with genetic characterization of cells should provide an important tool for translating science into treatments. Marc Vidal said that the increasing mapping and understanding of networks such as those of protein interactions give important biological insight. More than half of the mutations leading to human disease appear to change the way macromolecules interact, rather than crippling the individual molecular species. These "edgetic" mutations can help explain the close relationships between many diseases.

Genetic interactions

The patterns of genetic interaction, such as which gene pairs show synthetic lethality when both are deleted, give powerful insight into their functions, said Charlie Boone. He and his collaborators are using quantitative analyses to assess all 18 million possible double mutants of the 6000 yeast genes, and building networks based on the results. But natural variability is much more complicated, because gene variants typically have small effects and appear against a background of variations in other genes. The background can modify the effect of a variant and confound attempts to trace heritable traits to individual variants. Leonid Kruglyak has developed methods to map all the genetic loci that contribute to continuously variable traits, revealing many loci and complex interactions among them.

Genome-wide regulation

Differences in regulatory interactions, in particular in binding of transcription factors, are behind much individual variation as well as disease, said Michael Snyder. The sequence differences that underlie binding variations often affect not the motifs of the transcription factors themselves, but of other cofactors that can be just as important for the few genes they regulate. The overall accessibility of DNA to regulatory molecules can be mapped using the endonuclease DNase I, said John Stamatoyannopoulos, and the pattern of its binding over the genome mirrors developmental lineages. At a single-nucleotide level, cleavage gives a specific fingerprint for each transcription factor.

DNA–protein interactions

The organization of nucleosomes and transcription factors is largely explained by their sequence-dependent affinity, said Eran Segal. His team has developed a quantitative expression assay to assess the effects of sequence changes, including the dramatic effect of rigid poly-adenosine sequences in inhibiting the formation of nucleosomes and thus allowing transcription factors access to DNA. The calculated binding affinity of proteins to DNA should include dependencies between different positions, said Harmen Bussemaker, as shown by sequence-dependent cleavage of DNA by DNase I. Using predicted affinities allows researchers to precisely extract the genetic determinants of transcription factors.

The DREAM challenges

The DREAM challenges critically assess methods for making deductions about biological systems from high-throughput data. Because of the diversity of measurement techniques and biological problems, the various challenges are customized and adjusted from year to year.

This year, the co-organizers for the challenges were Gustavo Stolovitzky, Robert Prill, and Julio Saez-Rodriguez. Scoring was led by Prill and Raquel Norel, with website support from Tom Garben. There were four challenges for DREAM5.

The Epitope-Antibody Recognition (EAR) Challenge required the teams to predict which peptides would react with a panel of antibodies, based on the known reactivity of a similar group of peptides. The data were assembled by Hans-Juergen Thiessen and his colleagues. The best performing teams, represented by Rob Patro and Nicola Barbarini both used machine classification based mainly on numerous features of peptide sequence data.

The Transcription-Factor-DNA-Motif Recognition Challenge data were assembled by Matt Weirauch and Tim Hughes from Protein-Binding Microarrays (PBMs). The main challenge was to predict the binding preferences of transcription factors to a wide range of DNA sequences, based on the measured specificities of a training set of transcription factors. A bonus challenge was to identify the transcription factors assayed in each experiment. The best performing team for the main challenge, represented by Matti Annala, used a linear-affinity model based on the most-informative short sequences. That team shared the best performer title for the bonus round with a team represented by Yaron Orenstein employing the Amadeus motif finder, also based on the most informative short sequences.

The Systems Genetics Challenge used the type of data that emerges from crosses between two very different pure strains. The cross produces a large number of distinct offspring, each with a genotype that has one or the other parent's allele for each gene. The data were assembled by Alberto de la Fuente and his colleagues. Part A of this challenge used simulated data from a 1000-gene network with both cis and trans genetic variations. The best performing team, represented by Matthieu Vignes, used several different algorithms and combined the results. Part B used experimental data for mold resistance in soybeans. The best performing team, represented by Po-Ru Loh, used a rank-ordering transformation to avoid being swamped by extreme outliers, and included Boolean logical combinations to account for possible interactions between genes.

The Network Inference Challenge represented a recurring challenge at DREAM meetings, of deducing a gene network from expression levels in various perturbed states. Daniel Marbach, Jim Costello, Diogo Camacho, and Jim Collins assembled the data. They included data from a perfectly known in silico network, as in previous DREAM challenges, but also three sets of biological data. Yeast and E. coli data were scored based on well accepted networks for those species. In contrast, data for Staphylococcus aureus was not scored but will be used to generate a community prediction, since there is no accepted network for this microbe. The best performer overall and in silico was represented by Vân Anh Huynh-Thu, and, as in their best-performer analysis from DREAM4, used a decision-tree model. The best performer for in vivo data, represented by Robert Küffner, was also a repeat from DREAM4, and used the ANOVA test.

The inclusion of biological and simulated data side by side represents an important maturation of the DREAM challenges, and may help them garner more attention from biologists for these inference methods.

Randall Moon, University of Washington
Peter Sorger, Harvard Medical School and Massachusetts Institute of Technology


  • The Wnt signaling pathway is involved in many processes, including embryonic development and cell regeneration, but its effect varies strongly with the cellular context.
  • Large-scale, cell-based assays, including siRNA knockdowns, proteomic networks, and small-molecule screens give important clues about which genes and proteins modulate signaling on this pathway.
  • The integration of data from multiple screens identifies candidates more accurately than single analyses, but multiple levels of validation are still critical to assuring accurate conclusions.
  • Dynamic measurements of the Erb signaling pathway showed a rapid turnover of phosphorylation state that contrasts with the normal long recovery times, and is critical for modeling of drug responses.
  • There is an inherent conflict between comprehensive biological modeling and the ability to determine parameter values, but choosing the most informative experiments can vastly improve the efficiency of this process.
  • Modeling of large-scale networks requires approaches such as Boolean or fuzzy logic that are more efficient than exhaustive differential equations.

Dissecting Wnt signaling

Biological signaling defines the biological context within which genetic programs are executed. In his 25 years of studying the Wnt signaling pathway, Randall Moon has learned the importance of independent checks of biological significance. "Multiple levels of validation are incredibly important," he says, especially as researchers apply new high-throughput tools that generate many molecular candidates for roles in a pathway.

"Multiple levels of validation are incredibly important."

The transmembrane Wnt receptor initiates signals that affect a wide variety of processes. In development, for example, this signal is central to the organization of tissues and organs, so that frog embryos develop a second head in response to extraneous Wnt signaling. In adults, the pathway is important for regeneration and stem-cell homeostasis, while excessive Wnt signaling leads to various tumors.

"Wnt signaling can have quite a few different effects depending on when and where it is expressed," Moon emphasized, producing almost opposite changes in identical cells whose age differs by only a few hours. "The hallmark of Wnt signaling is that it is context dependent."

"The key control of this pathway centers around the regulation of stability of the β-catenin protein," he noted, which acts by translocating to the nucleus to modify transcription. β-catenin activity is regulated by ubiquitination, which targets the protein for proteosomal degradation. Wnt signaling inhibits this ubiquitination.

Many of Moon's experiments are performed with embryos, for example of frogs and zebrafish, where he noted that "the Wnt signaling pathway is used in a completely normal context" in various tissues. To identify other candidate actors in Wnt signaling, Moon uses cell-based assays, such as siRNA knockdowns, proteomics, and small-molecule assays. Still, he stressed, it's important "to maximize the validity of any hits, so that you don't spend all your time chasing off-target hits."

Moon described an siRNA screen based on an optimized luciferase reporter for the presence of β-catenin, which in one cell type produced 804 hits out of a 22,325 gene library. A secondary screen, which used three additional siRNAs for each gene and required response in multiple cell lines, reduced the number of candidates to 310. A third screen, which quantified the expression of the endogenous genes, not just exogenous reporters, whittled the number to 63 genes.

Final validation looked for either stereotypical Wnt-mutant phenotypes in zebrafish embryos or a proteomic analysis that positioned the gene in a protein–protein interaction network with other elements of the Wnt/β-catenin pathway. In one example, this process identified AGGF1, which "is required for modulating about half of β-catenin target genes," Moon said.

In a similar success, proteomic screens suggested the tumor suppressor protein WTX interacted with β-catenin. To check this prediction, Moon's team verified that recombinant WTX interacts with β-catenin, probably increasing its ubiquitination and thus keeping its level low. "Anything that comes out of a proteomics screen or an siRNA screen, ultimately you need to push it towards the level of biochemical understanding," he emphasized.

In a third cell-based assay, Moon said, "we use small-molecule screens to develop potential therapeutics, but also to identify components of signaling pathways." Surveying a library of compounds, similar to that described by Stuart Schreiber, identified an FDA-approved drug called Riluzole. The glutamate receptor GRM1 that this drug targets had not previously been recognized in β-catenin signaling, so the technique points to new biology as well as a potential therapeutic. "Using small-molecule screens to identify components of a signaling pathway is very powerful," Moon noted.

Although each cell-based assay is useful on its own, "integrating these techniques is a powerful way of compensating for the deficiency of any one screen," Moon said. "The biggest limitation of siRNAs is that they give a lot of off-target hits, and also that you get no insight into signaling mechanism," whereas proteomics gives you "lots of data but absolutely no clue if your hits are functional." If researchers set a high threshold for significance in these tests, they risk missing promising candidates. By combining siRNA and small-molecule screens, Moon and his colleagues identified a particular kinase as a contributor to the Wnt/β-catenin pathway, and validated it using mass spectrometry.

Modeling signaling dynamics

A complete analysis of signaling requires knowing not just which molecules interact, but the details of how they influence one another. Peter Sorger described some of his team's efforts to create more complete models at several different levels of description.

One project involved measuring the dynamics of epidermal-growth-factor receptor (EGFR, also called ErbB1 and Her1). Binding of extracellular EGF to this transmembrane molecule causes it to homo-dimerize or to hetero-dimerize with other members of the ErbB family, which are often targeted in cancer therapy. The dimer is then phosphorylated, which lets it dock intracellular proteins that have SH2 or PTB binding domains and activate them for further signaling.

"As we get more sophisticated with the underlying hypotheses, we have greater difficulty building a rigorous framework for future inference."

"This is a well understood class of proteins," Sorger observed, and studies have shown that the adaptation to a ligand persists for several hours. Several drugs inhibit the receptor response, some by binding to the ATP pocket of ErbB1. Surprisingly, in response to one of these drugs, gefitinib, the receptor is dephosphorylated in tens of seconds, not hours. "It was a much faster dephosphorylation reaction than we might have imagined," Sorger noted. The ErbB2 or ErbB3 binding partners, as well as Shc and other downstream proteins, were also rapidly dephosphorylated.

The researchers successfully modeled this behavior using differential equations that account for the very high background concentration of ATP. "These drugs are trying to get access to their binding pocket in the presence of a 2 millimolar competitor, which is one of their biggest problems therapeutically," Sorger noted.

To model the effects of another drug, lapatinib, required a more sophisticated model with 47 ordinary differential equations that include transitions of the receptor to an inactive configuration. "Rapid turnover is necessary to see the difference between the two drugs," Sorger observed. "Looking at what seems a 25-year-old piece of biology in a simple dynamic setting has led to the notion that in fact these signalosomes are highly dynamic."

The increasing complexity of these models, although biologically grounded, poses an "inevitable tradeoff," Sorger said. "As we get more sophisticated with the underlying hypotheses, we have greater difficulty building a rigorous framework for future inference." Sorger's team addresses this problem in part by deriving statistical ranges of values rather than deceptively precise values.

Comparing models for four cancer cell lines: The most efficient models for representing the responses of four different cancer cell lines have significant differences in their interactions (different-colored arrows).

Postdoc William Chen analyzed how well different RNAi knockdown experiments determined the parameters for a known network topology. He found that choosing the most informative species for knockdown, based on detailed analysis, was much more efficient than choosing experiments at random. "The three best experiments are better, on average, than 25 randomly chosen RNAi," Sorger said, although the best choices were "not obviously intuitive" even when the network was known. Combining RNAi with other experiments was even more efficient.

Although detailed mathematical models of signaling can be very useful, the difficulty in determining parameters makes this level of detail impractical for more comprehensive networks. One manifestation of this difficulty is the lack of consensus on the topology of networks, Sorger noted. "Depending on where you go in the literature you're going to find a different idea about what the network should look like."

Sorger and his coworkers have been extensively characterizing an experimental system consisting of cultures of primary human liver cells and hepatic-cancer cell lines. They expose these lines to many perturbations and measure many responses, such as cytokine production and reaction with phospho-specific antibodies.

In modeling this system, Julio Saez-Rodriguez assembled a consensus network from information in the literature and then refined the model based on experiments. The consensus was "wrong a fairly large fraction of the time," Sorger said, mainly because the literature indicated links that had no support in the experiments. Using a simple two-state-logic description for the nodes, and penalizing extra complexity, the researchers were gratified to find that the data could be described by a much simpler model than they started with. The resulting models, built separately for each of the cell types, allowed the team to determine how the primary and tumor cell lines differ in network topology.

Sorger's team is also trying to bridge the detailed mathematical models and the simplified Boolean models by looking at a "fuzzy logic" description. "Fuzzy logic allows you to take what would be a straight on-off transition in discrete logic and instead code it as a gradual transition," he said, but is still rather efficient.

Scott Powers, Cold Spring Harbor Laboratory
Chris Sander, Memorial Sloan-Kettering Cancer Center


  • Deletions and insertions that are detected frequently in genetic analysis of liver tumors can point to possible protective genes and oncogenes, respectively.
  • Transfecting mice with cDNA from such amplified genes confirmed 18 known and new oncogenes.
  • These genes may be more useful in leading to biomarkers indicating which signaling pathways are disrupted in a particular patient, rather than as direct therapeutic targets.
  • Genetic analysis shows that patients who apparently have the same cancer often have different specific mutations, but these mutations each disrupt common modules.
  • Combinatorial therapy that is tailored to individual variations is likely to be the best way to address cancer.
  • For large networks, first deducing statistical properties and then selecting specific examples may be more effective than generalizing from individual solutions.

Finding new oncogenes in liver cancer

Genetic alterations are a common feature in cancers, and make them much more diverse and difficult to treat. Recent projects, including the Cancer Genome Atlas and the International Cancer Genome Consortium are extensively characterizing the genetic variation in several cancers with the hope of finding common features and new ways to treat them.

In liver cancer, of the well-known cancer genes targeted by existing therapies, "none are actually mutated with any frequency," said Scott Powers. He and his colleagues have used cancer genome data to find new candidate genes for therapeutics or for new biomarkers to guide therapy.

"About 80% of liver cancers have extensive DNA copy-number variations," Powers noted. He and his Cold Spring Harbor colleague Scott Lowe previously looked at regions that were frequently deleted in the cancers to identify protective genes using short-RNA knockdown experiments. In his current work, Powers said, "we took amplified regions and looked to see which ones contained oncogenes, not by knocking them down with RNAi but by overexpressing them using cDNAs." Using verified sequences from the Mammalian Gene Collection, his team transfected mouse hepatocytes that had been modified to be prone to cancer. In these cells, the tumor suppressor p53 was lost and the Myc oncogene was overexpressed—"two very common genetic alterations in liver cancer."

Cells transfected with candidate oncogenes were injected into the spleen. "If you do this carefully, you'll get dispersal of the cells throughout the liver" via the blood system, Powers said. Out of 124 cDNAs chosen for their overexpression in cancers, 18 generated new liver cancers in this mouse model.

In addition, genes from small amplified regions were much more likely to be true oncogenes, rather than "passengers." For the largest amplified regions, more than 10 megabases, the chances of a gene being a driver were not much higher than for a control set, Powers said. "In the future, to go beyond these small amplicons, we have to develop a hybrid approach of computational selection."

"This was quite the largest data set of cDNAs of oncogenes that had ever been constructed," Powers said, so the researchers could check which computational approaches might have predicted the oncogenes. They found that neither the widely used RNA expression level nor the GRAIL method made statistically significant predictions.

"The interface of computational and functional validation is going to be increasingly important to enable the productive analysis of cancer genome projects."

The method that worked best based its ranking on a protein functional interaction network, containing some 20,000 interactions. "The final algorithm is guilt by association," Powers said. The final score for a gene is based on the highest score of the proteins its product interacts with. "The interface of computational and functional validation is going to be increasingly important to enable the productive analysis of cancer genome projects," Powers said.

One of the new oncogenes is POFUT1, which acts in Notch signaling. Follow-up experiments confirmed that cells with amplified POFUT1 seemed to be more sensitive to gamma-secretase inhibition of the Notch pathway.

Another oncogene is FGF19, which was a surprise since it lies in a region that has been extensively studied in cancer. Powers suggested that the effect had been missed because different tissues vary widely in expressing the gene, for example "in liver cancer but not in breast cancer." Powers said that, in contrast to more familiar ErbB signaling, raising FGF19 turns off β-catenin without activating MAP kinase. "In liver cancer maybe you can just give monoclonal antibody to FGF19 to patients who have an amplification of this locus."

In both of these cases, "the most interesting data we get is testing for dependency," Powers said. "We haven't really discovered new targets per se as much as we've discovered new biomarkers for administration of treatments."

Cancer networks

Cancer genomes also allow researchers to go beyond individual genes to look at their networks of interaction. One profound result for glioblastoma multiformae, one of the first cancers analyzed in The Cancer Genome Atlas, was that there was "incredible diversity" in the affected genes, said Chris Sander. "Even though this is all glioblastoma, the differences are quite substantial."

The amount of data makes it impractical to manually compare genetic profiles of the tumors with background knowledge such as biological pathways, Sander observed. "It's got to be done computationally." This sort of analysis shows that "any individual gene does not make a contribution that is consistent across all these tumors, but what is consistent is the modules, collections of genes that appear together. These modules are recurrent in essentially all of these individuals, but the implementation is different from one individual to the next."

Addressing this diversity requires that the cancer patient population be partitioned, at least into major groups, to ensure that treatment is aimed at their specific genetic complexions, Sander stressed. "I'm confident that the approach of combinatorial therapy, targeted to modules but modified from one individual to the next is the right way to go."

"Combinatorial therapy, targeted to modules but modified from one individual to the next, is the right way to treat cancer."

In other cancers, DNA copy-number changes are strikingly different from those in glioblastoma, Sander said. Prostate tumors differ significantly in the extent of copy-number changes, and metastatic tumors have many more alterations. Even before metastases are evident, the copy-number changes are predictive, in that patients with low copy numbers have better survival. This test is "more predictive than the Gleason grade, which is what pathologists would report," Sander said. "The question is whether this can be translated into a clinical test." He added that "you have to have a reasonable level of prediction and certainty to be able to actually go there. Psychologically, people want to be treated."

On the conceptual side, Sander is working on a "flavor of systems biology" that he called "Perturbation Cell Biology." The goal is to model the responses of cell lines to systematic perturbations such as drugs and combinations of drugs, as reflected in rich observations including cellular phenotypes and molecular measurements.

The underlying mathematical description is differential equations describing concentrations of molecules in different phosphorylation states, similar to those discussed by Peter Sorger. The traditional approach is to determine a locally-optimum set of model parameters, and then repeat the process many times with new starting conditions. "Then you report the aggregate statistical properties of this set of solutions and draw the map." For small systems, Sander said. "You recover some textbook biology."

"The challenge is to scale this up to larger systems," Sander stressed, because "network inference problems actually become quite unmanageable in larger systems." In collaboration with Riccardo Zecchina of Politecnico di Torino, Sander is looking at statistical physics ideas that are "a flavor of global-to-local algorithm." Rather than build probability distributions by averaging over individual solutions, the researchers first derive distributions for each of the parameter values, such as those that describe the interaction between two species. In building this "factor graph," the other interactions enter in an averaged way, and the network is inferred using "belief propagation." Only then do the researchers generate particular solutions. "It's much more efficient," Sander said. But this is still a work in progress.

Steven Altschuler, University of Texas Southwestern Medical Center
Thomas Gingeras, Cold Spring Harbor Laboratory


  • Different cells in a population often behave differently, but this heterogeneity is usually ignored even though none of the cells may behave like the average.
  • The tremendous epigenetic diversity of cultured cancer cells reflects varying populations of a few types, and the relative populations of each type fall into patterns that predict response to the drug taxol.
  • In a simple positive-feedback model, heterogeneity within a cell population always accompanies the conditions that allow cell polarity to develop.
  • Different subcellular compartments contain RNA transcripts that arise from different sections of the genome.
  • Tracking transcripts by compartment lets researchers identify transcripts they might miss in the entire cell, and shows that almost half of the genome is both transcribed and spliced into mature RNA.
  • Chimeric splices between transcripts originating from different chromosomes are present at low levels in some cellular compartments.

Cell heterogeneity

Systems biologists are constructing complex network models for many aspects of biology. But "these networks are almost entirely derived from population-averaged measurements," cautions Steven Altschuler. "In many cases, you can have very predictive responses in perturbed populations, but your average measurement corresponds to not a single cell in your entire assay."

"If you're going to make the assumption that the mean is a good representation of your cells, you have to prove that," Altschuler asserted. He described three projects from the lab he runs with his wife, Lani Wu, in which heterogeneity was not merely noise, but was biologically significant. "Cells that are different from the mean can be very important."

"If you're going to make the assumption that the mean is a good representation of your cells, you have to prove that."

The first example concerned development of adipocytes, or fat cells. The well known molecular circuits underlying this process include the master regulator PPARγ driving steady growth of lipid droplets in the cells as well as the level of adiponectin. A natural expectation would be that individual cells would follow the same trajectory, with correlated growth of droplets and adiponectin levels.

"If you look at the cells, it's rather disturbing," Altschuler said, because most cells have either large lipid droplets or high adiponectin levels, but never both. Tracking individual cells shows that in "virtually all" of them, the adiponectin level rises first, with small lipid droplets. Later, the droplets grow, accompanied by a drop in the adiponectin level, the opposite of what was expected from the population-averaged measurements.

"The correlation is an illusion," Altschuler concluded, because a large population of cells that's still going through the early differentiation skews the average. Moreover, added compounds almost always affect different subpopulations differently, rather than just moving them all in the same way. As a result, Altschuler said, by studying the effects on subpopulations, "we actually have a way to identify new targets of compounds."

In the second part of his talk, Altschuler turned to cancer. "Almost always, heterogeneity is ignored, because you just don't know what to do with it," he said. His team compared a group of 49 clones from a lung-cancer cell line. "I presume that a lot of the differences we're seeing are epigenetic," Altschuler said.

Using markers for signaling, "you see great diversity," Altschuler noted. "You're thinking cancer must be infinitely complicated." But feature extraction algorithms make the classification manageable, and principal-component analysis reduced 1000 features to about 20 eigendimensions per cell. Moreover, all of the variation between cells could be captured using about five subpopulations, Altschuler said. "This no longer felt like a problem of infinite complexity."

In analyzing the 49 clones, Altschuler said, "the most amazing thing happened: they group into six or seven different clades," each with a characteristic fraction of the cells in the different subpopulations. Moreover, this classification almost perfectly segregated the clones with respect to their response to the drug taxol. "The ensemble subpopulations allowed us, before we ever gave these populations any drug, the ability to distinguish whether they would be drug sensitive or not," Altschuler concluded.

Altschuler's third topic explored how heterogeneity arises in a theoretical model of cell polarization. The model includes active particles on the cell membrane that can recruit inactive particles from the cytosol and make them active. "It's your classic positive feedback loop." Altschuler said. In addition, particles in the membrane diffuse around the surface of the cell.

It turns out that there is one key parameter in the model: the number of particles per cell. If this number is large, particles diffuse everywhere and no polarization develops. (This differs from the classic cases studied by Turing and others, in which pattern formation reflects both positive and negative feedback.) If the number of particles is small, feedback is too small for polarity to arise. Only for intermediate numbers of particles does this model produce cell polarity. But for the values of parameters for which polarization occurs, it only occurs for about half of the cells. "Heterogeneity is mathematically unavoidable here," Altschuler said.

Subcellular localization of transcribed RNA

Heterogeneous expression, even within a cell, is also important for RNA. One way to surmise how much of the genome is functional is to measure how much of it is transcribed into RNA and processed into usable forms. The large GENCODE team of the ENCODE project is reviewing published, full-length cDNA sequences and hand curating them to assure "quality, coding capacity, apparently legitimacy of splicing sites, and so forth," said Thomas Gingeras. About 142,000 transcripts have been well annotated, he said, half of which appear to be non-protein-coding. "About 70% of the transcripts are unannotated," Gingeras noted, so there is much to be learned.

An important aspect of this project is tracking, in 15 cell lines, transcripts that appear in different subcellular compartments, including the cytoplasm and nucleus, and for one cell line in the nucleolus, nucleoplasm, and chromatin. Recognizing the heterogeneous composition of these compartments suggests the possibility of different functions for transcripts.

Looking at different compartments also highlights the importance of rare transcripts. "This enrichment is allowing you to see fewer things that otherwise would have been in the tail of the distribution," Gingeras said. In contrast, "if you treat the whole cell as a bag of molecules, you'd better sequence the hell out of it."

"The cell is putting into these compartments transcripts that are initiated in different parts of the genome."

The researchers obtained some 400 million sequence reads in each compartment. "That seems to be a lot, but it only allows us to see a glimpse of the low-copy-number transcripts that are in certain compartments," Gingeras said. But he emphasized that most of the reads passed a stringent "irreducible discovery rate," or IDR, the chance that a repeated measurement would not measure the same result. "These data that we're using are very conservative." Even with an IDR of 0.1, he said, "the genome is almost half covered with transcripts that are processed and spliced."

Of the many intriguing classes of transcripts, "one class stands out," Gingeras said. The transcription of these strands appears to start in the 3′-untranslated regions (UTRs) at the tail of other transcripts. Such transcripts occur in 80% of expressed genes in fly and 62% of expressed human genes examined so far. "It looks like a different kind of regulated region for expression," Gingeras said.

Transcribed sequences in different compartments come from very different regions. For example, "in the nucleus and the chromatin, capped 5′ ends are most prominently found emanating from the intergenic regions, not from annotated transcripts." Gingeras observed. "The cell is putting into these compartments transcripts that are initiated in different parts of the genome."

Populations of transcribed RNA differ widely, depending on the cellular compartment they are isolated from.

Gingeras also described "chimeric" RNAs, which merge segments transcribed from different chromosomes. These odd combinations have been described by others in the literature, but their function and even their existence have been controversial. After extensive experimental cross-checks, Gingeras said, "clearly these molecules exist in the cells where we had identified them, albeit at much lower copy number within that cell type than the normal-spliced forms."

Among other things, the researchers found that chimeras tend to join regions that are close together in three-dimensional chromatin. "Seventy-six percent of the chimeric RNAs that we see fall in the regions where the DNA, by crosslinking experiments, are close enough to be cross-linked in a 5C experiment. It looks like those genomic regions are brought together for transcriptional purposes."

"We don't think these are random events, even though they're present at fairly low copy number," Gingeras said. Overall, "the transcription landscape contains a whole variety of transcripts whose function remains to be determined, but whose characteristics we were unaware of."

Stuart Schreiber, Broad Institute
Marc Vidal, Dana-Farber Cancer Institute and Harvard Medical School


  • Response to small molecules can serve as a classifier for the presence or subtype of cancer.
  • Researchers are systematically cataloging the molecular responses of cells to varying levels of highly specific small molecules, to help generate hypotheses for disease biology and treatment.
  • Studying networks, for example of protein–protein interactions, has provided important insights between genotype and phenotypes such as disease.
  • About half of mutations causing human disease seem to be "edgetic": modifying the interactions between proteins rather than their presence.
  • Finding proteins targeted by viruses and genes that cause the same disease identifies new candidates for intervention.
  • Interaction networks seem to evolve faster than protein-coding sequences.

Small-molecule probes of cancer

Extensive knowledge of the molecular networks underlying disease are not of much use to patients without ways to manipulate those networks, for example using small-molecule drugs. At the same time, the response to small molecules that perturb specific nodes in a network can be yield powerful biological insight about how that node interacts with others. To speed both therapeutics and basic understanding, Stuart Schreiber and his colleagues are assembling a comprehensive catalog of dose-dependent response of cell cultures to a library of narrowly targeted compounds. "We're trying to look at cancer therapeutics in an integrated way," he said, supplementing earlier catalogs of responses with detailed genetic characterization of cells.

"We're trying to look at cancer therapeutics in an integrated way."

Some small molecules are extraordinarily effective against particular genetic versions of cancers. Imatinib, marketed in the U.S. as Gleevec, is essentially 100% effective against chronic myelogenous leukemia, for example. But "less than 1% of cancer patients today benefit from this dramatic clinical outcome," Schreiber noted, because no analogous drug is known for their cancers. His new project aims to achieve broader benefits by linking genetically distinct patient populations to targets for drugs or combinations of drugs.

Schreiber's team has recently shown that the response of cultured cells to various small molecules can identify patients with a genetic form of diabetes called MODY1. "You can use small molecules as a classifier and predict whether the cells came from affected or unaffected individuals," he said. This work was one inspiration for their project, funded by NCI's Cancer Target Discovery and Development Network, or CTD2, to translate cancer genome data (discussed by Scott Powers and Chris Sander) into clinical applications.

"What we don't know is whether small molecules that target non-oncogene co-dependencies, using the principle of synthetic lethality, could have the same kind of clinical outcomes," Schreiber warned. Such dependencies are common in cancer, because, as oncogenes co-opt pre-existing signaling pathways for cancer proliferation and survival, they enlist the support of other proteins to enable these pathways.

For example, oncogenes are often temperature-sensitive labile proteins, so they acquire a need for chaperones. Early exploration of the accumulating data showed that the effect an inhibitor of such a chaperone, the heat-shock protein HSP70, increased in cells with amplified Myc, which are found frequently in cancers.

Another small molecule has a much larger effect in cells with activating mutations in the β-catenin oncogene. Schreiber suggested that this effect related not to β-catenin's role in Wnt signaling (discussed by Randall Moon) but to the small molecule's effect in neutralizing reactive oxygen species, which relate to the unusual metabolism of cancer cells.

These early test cases, and others, support the efforts of Schreiber and his team to systematically catalog the dose-response relationship of cells lines to various compounds. These compounds, which make up the CTD2 probe kit, are chosen "primarily on the basis of there being evidence that the compound is highly selective," Schreiber said. "We call them 'narrowly active compounds'." Using extensive automation, the researchers characterize the molecular and phenotypic response of 1000 genetically characterized cells to a range of concentrations of the small molecules. In addition to individual responses, Schreiber stressed, "you can use these genetically characterized cells to look at combinations of compounds."

Another project Schreiber and his colleagues are working on is called the Cancer Cell Line Encyclopedia Project, a collaboration with the Novartis Institute for Biomedical Research. This resource will soon provide extensive characterization data for many publically available cell lines, including genome-wide copy-number data, gene expression, and mutations of target oncogenes, plus extensive exome sequencing. Schreiber hopes these resources can dramatically change the traditional serial "bucket-brigade" of pharmaceutical development.

Connecting genotype and phenotype through interaction networks

Genotype data for diverse healthy individuals and for tumors is become widely available. Relationships between these genotypes and phenotypes, such as the susceptibility to diseases "are the most interesting questions in biology," says Marc Vidal. But the connection is complex. Even simple Mendelian traits show incomplete penetrance, multiple effects of mutations, and modification by other genes. For complex traits, the connection is even less direct. "To understand genotype-phenotype relationships, which are far from linear, we need to understand systems," Vidal said.

"To understand genotype–phenotype relationships, which are far from linear, we need to understand systems."

One of the most effective ways to describe this nonlinear connection is with the language of networks, with macromolecular species represented as nodes and their interactions as edges. Based on a decade or so of progress, Vidal said," we can safely say that there are really global properties in cellular interactome networks, AND that those properties relate to biology."

Vidal is a pioneer in exhaustive measurements of protein–protein interactions, in particular through the yeast two-hybrid method. So far, only about 20% of the yeast interactome and about 5% of the human interactome are known. But Vidal thinks in another 10 years some 70%–90% of those networks will be mapped with high quality, and will continue to generate biological insights.

In the context of human disease, Vidal noted that "in many cases you have mutations in several genes that can cause one disorder, and the reverse, which is that different mutations in the same gene can give rise to different disorders." To explore these relationships, he and his collaborators mined the Online Mendelian Inheritance in Man (OMIM) database to construct the "diseaseome." By connecting diseases that share a gene and genes that share a disease, they created a bipartite graph that helps to illustrate the complex relationships between diseases.

Looking at this relationship between diseases leads to new questions, Vidal said, such as "How do we explain that different mutations in the same gene give rise to different disorders, from a network perspective?" He suggested that some diseases arise not because a particular node, representing macromolecule, is missing from the graph. Instead, a network edge, or interaction, could be altered. "Perturbation of another edge might give a different phenotype," he said.

Roughly half of the mutations associated with human diseases appear to disrupt the interactions between proteins, or edges, rather than disabling the proteins themselves.

Vidal and his colleagues used sequence data from the Human Gene Mutation Database to test this possibility, hypothesizing that sequence changes like premature stop codons probably represent node perturbations, while missense or in-frame mutations are likely to be "edgetic," affecting protein interactions. They found that roughly half of mutations known to be associated with disease look edgetic.

In follow up tests, Vidal said, "every time, we said, according to this simple model, this gene might actually give rise to edgetic perturbations, we could verify experimentally that this was indeed the case." Looking at proteins that have multiple binding domains and whose genes are connected with at least two diseases, he said, showed that the different disorders always reflected mutations in different domains, as expected.

The idea of changing edges in networks also provides a new window on evolution, Vidal said. His team is exploring how the wiring of networks changes during evolution, rather than the sequence of genes themselves. They exploit empirical data from plants, which have many paralogous pairs that look alike and probably emerged from a duplication in a common ancestor. The analysis so far suggests that the interaction profiles for duplicated genes diverge faster than the corresponding sequences.

In another ongoing project, Vidal and his team are looking at disease-causing viruses, and comparing them with genetic mutations that cause the same disease. They have confirmed that the protein targets of viruses are close in the interaction network to the products of genes involved in the same disease. "The shortest paths help us to pose hypotheses for disease etiology," Vidal said.

Charlie Boone, University of Toronto
Leonid Kruglyak, Princeton University


  • Non-additive interactions, such as synthetic lethality, between pairs of deleted genes are highly informative about the genes' relationships, and are being systematically catalogued in yeast.
  • Genes with similar patterns of interaction with others are often in the same pathway, and imply networks with recognizable biological functions.
  • In comparison with double mutants, the full natural variation revealed by crosses between genetically diverse strains is much more complex.
  • The heritability of human diseases is not completely explained by the small effects of individual variants that genome-wide studies have identified.
  • Yeast crosses generate millions of genetically distinct strains, allowing quantitative assessment of the contributions of different genetic loci to a trait.

Surveying yeast double mutants

Systematic deletion of pairs of genes in yeast gives rich information about how the genes interact with one another, said Charlie Boone. Networks based on these genetic interactions recapitulate known biology and reveal new aspects of pathways and complexes. But crosses between strains show that the effect of natural variation is not easily explained in terms of even the interacting pairs.

Of the 6000 genes in budding yeast, said Boone, 5000 can be deleted without killing the organisms. For the remaining 1000 "essential" genes, researchers are developing temperature-sensitive (ts) alleles that effectively delete them after development. Altogether, this means that researchers can make about 18 million distinct double-deletion mutants, each of which Boone aims to characterize.

"Genetic interaction occurs when something weird happens," Boone said. The most obvious example is synthetic lethality, when neither gene is essential on its own but deleting both is deadly. This can occur when two genes lie on redundant pathways, and the cell needs at least one pathway to survive. "Many pathways are not essential ... because there's a backup pathway," Boone said. The opposite interaction happens when the either gene causes some reduction in fitness, but since they both disrupt the same pathway deleting both doesn't make things any worse.

"Many pathways are not essential because there's a backup pathway."

"Genes with the same pattern of synthetic lethal interactions are often in the same pathway," Boone said. Connecting genes that have similar interaction patterns creates genetic interaction networks analogous to those based on physical protein interactions or similarities in gene expression. "In the end, the position of a gene on the network and its connectivity define its function," Boone said.

Clustering genes based on their similarity of interactions leads to networks in which genes with familiar biologically functions are grouped together, together with some previously unannotated genes. "Inevitably, when we test it, these genes are new components of the pathway that they're linked to."

Boone, Brenda Andrews, and their collaborators hope to extend these measurements to cover all possible pairs within a couple of years. "One of the major challenges with these projects is to take them to completion," he said. In addition to automating the setup and the measurement of colony size as an indicator of fitness, one big challenge was developing a quantitative measure of the deviation of double mutants from the expected effect of combining individual deletions. These deviations can be either "negative" as for backups, or "positive," when the genes work together.

The genetic interaction network inferred solely from unexpected results of double mutations often connects genes of related biological function.

Comparing the genetic interaction network with known protein–protein interactions yields some surprises. For example, physically interacting proteins within a complex would be expected show positive genetic interactions, so that the pair is less detrimental than their expected combined effect, since the complex is already disabled. But although "there are positive interactions that overlap with physical interactions, there are just many negative interactions that overlap with physical interactions," Boone said. In addition, "many of the positive interactions also occur between pathways."

Ultimately, researchers would like to understand how different variants of genes interact in their complex natural setting, not just in isolated pairs. To learn about this natural variability, Boone and his colleagues studied crosses between two well characterized laboratory strains.

The team looked for "conditional essential" genes, which are essential in one strain but not in the other. They then performed crosses between a strain where a gene was essential but present with another where it was not essential and deleted. They expect that survival of the hybrid would frequently hinge on the presence of an allele of a different gene that rendered the deleted gene fatal. Looking at the statistics of the crosses, "we can assess whether these conditional essentials are due to a simple case of synthetic lethality or not," Boone said. But "it was never a simple case of a single modifier leading to a synthetic lethal interaction," he found to his surprise. "Our conclusion is that genotype to phenotype is an incredibly complex problem."

Quantifying genetic contributions to traits

Although some human traits and diseases follow simple Mendelian inheritance, said Leonid Kruglyak, "most of the things we really care about ... follow much more complicated inheritance patterns." Although genome wide association studies in the past few years have turned up nearly 1000 genetic regions associated with diseases, the total effect of these genes doesn't usually explain the known heritability.

Human height, for example, is 80% heritable, but the 180 known loci explain only 10% of population variance. One possible source of the "missing heritability" is that there are many variants with a larger effect size, but these are too rare to see convincingly in most studies.

"Because we're dealing with small effects, we need very large sample sizes," Kruglyak said. Although characterizing 100,000 humans is a major challenge, "it's no problem growing large populations in yeast." His team picked a number of traits, sensitivity to different drugs, which are continuously variable in yeast in a genetically complex manner. They then asked if they could find all the genes involved in these traits' variation. "Posing the question this way almost assures us of failure," Kruglyak said, "but we'd like to get as far as we can."

As in Boone's experiments, crossing two different strains yields a wide variation, in this case in drug sensitivity. Rather than phenotype the individual progeny, the researchers decided to take advantage of the large numbers and phenotype the population—but only the outliers. "Most of the genetic information is contained in the phenotypically extreme individuals," Kruglyak said.

The resulting population, containing thousands or tens of thousands of genetically distinct strains, is still too large to genotype individually, though. So "instead of genotyping them one at a time, we just measure the frequencies of the two parental alleles across the genome." For alleles that push the trait towards an extreme value, the expected frequency should deviate from the 50/50 ratio for the entire population.

"The trick is to do this quantitatively enough," Kruglyak said. But with care, "we can detect loci even when they have quite small phenotypic effects." One important trick for improving the signal-to-noise ratio is using custom microarrays with probes for the allele from each strain, rather than just inferring the allele from the presence or absence of a signal.

"We can detect loci even when they have quite small phenotypic effects."

Kruglyak had previously studied sensitivity to 4-nitroquinoline 1-oxide, or 4NQO, linking sensitivity to this DNA-damaging agent to a particular gene called RAD5 that acts in DNA damage repair. "It explained some of the variation but it didn't explain all of it," he recalled.

From a cross, his team selected those segregants with extreme resistance to 4NQO, and quantitatively genotyped this subpopulation. "In addition to RAD5, which showed up as our clearest, strongest selection as it should, there are about a dozen or so contributions of other loci from both the parent strains," Kruglyak said.

This pooled approach identifies important loci, but is not very accurate about the size of their effect or whether and how they interact, Kruglyak said. "You can get back at that by making collections of individual segregants, measuring their phenotypes, and just genotyping them at the positions where you found the loci. So you don't need to pay either the experimental or the statistical cost for searching the whole genome."

This analysis showed that the other loci have much smaller effects than RAD5, which explained about 40% of the variance. The effects of the other loci are all below about 5%, which would not have been statistically significant in a traditional genome-wide linkage scan.

The researchers went on to test for sensitivity to "about 20 other chemical compounds and other ways of making yeast cells unhappy," Kruglyak said. "The genetic architectures can look quite different." For some insults, the sensitivity is dominated by a single locus with a Mendelian inheritance pattern. In other cases, there are as many as 20 statistically significant loci that contribute to variation.

The technique is not limited to enrichment for drug resistance. One powerful extension is using cell sorting to isolate individuals with extreme phenotypes. This technique could then be used for any property for which there is an appropriate reporter. Kruglyak illustrated sorting based on mitochondrial output, but it could be applied to tracing the genetic loci for many types of phenotypic variation.

Michael Snyder, Stanford University
John Stamatoyannopoulos, University of Washington


  • Regulatory difference due to variable binding of transcription factors underlies much individual variation as well as disease.
  • Specialized transcription factors that regulate only a few genes can be as important, for those genes, as master regulators that have widespread effects.
  • Personalized genome sequencing needs better accuracy, in particular in assessing the number of gene copies, and its interpretation is often uncertain.
  • Mapping of cleavage by DNase I gives direct, genome-wide indications of locations accessible to DNA regulatory features like nucleosome formation and transcription factor binding.
  • The patterns of DNase I hypersensitive sites in different cells mirror the cells' developmental relationships.
  • Deep sequencing of cleavage sites reveals characteristic patterns of cleavage, at single-nucleotide resolution, for different transcription factors.

Variations in transcription-factor binding

As an expert in genomic techniques, Michael Snyder decided to check out the increasingly affordable options for personal genome analysis. He compared the results from Complete Genomics and Illumina, both of which identified more than 3 million single-nucleotide polymorphisms (SNPs). Each had several hundred thousand calls that were not in the other set, a difference that Snyder ascribed to missing data. "The number-one problem with getting your genome sequenced is that they're not deep enough in all regions."

Even SNPs reported by both companies disagreed in many cases on zygosity, which "makes a big difference. It's no big deal for 1000 genomes at low coverage," Snyder said. But "I don't care about average. I care about me." Overall he said, there is still a long way to go in terms of accuracy and interpretation of personal genome data. The techniques are even less reliable for structural variants, such as insertions, deletions, and inversions, Snyder said.

In their primary research, Snyder and his team are exploring variation both between related species and between individuals. In particular, they are looking to see how much variability arises from differences in transcription-factor binding.

In one study, they mapped the genome-wide binding, using chromatin immunoprecipitation and sequencing (ChIP-seq), of the transcription factor Ste12 in yeast. They exploited natural variation among 45 segregant strains from a cross between two lab strains and tracked the binding, as well as gene expression, after exposing the yeast to a pheromone.

Most of the sites (about 70%) showed classic Mendelian segregation, binding Ste12 in one genetic background and not in the other. But other sites showed "transgression" from this expectation, for example binding in some of the segregants when neither parent did.

Snyder's team then looked for quantitative trait loci (QTLs) that contribute to binding at these highly variable sites. Of 195 sites with a unique QTL, 166 are cis (close to the binding region), while 35 are trans (a few are both). "Most variable binding sites are linked in cis to the QTLs," Snyder said.

The simplest explanation would be that the differences in binding reflect changes in the sequence of the binding site for Ste12. "That turns out to be true, but it's only true in 36 of the 166 cis-variable regions," Snyder said. For the remainder, there appear to be variations in the sequences that code cofactors that help Ste12 to bind. Using a test for what they called "Allele Binding Cooperativity," or ABC, the team found six new binding sites for factors whose motif covaries with Ste12 binding.

"Cooperative binding is rampant throughout the genome."

"None of these factors were known previously to work with Ste12," Snyder said, which is a master regulator that binds about 1000 sites across the genome. "These guys are only operating on a subset of regions, but they have a really strong effect" in the places they bind. "We think this kind of cooperative binding is rampant throughout the genome," Snyder said. But because the effect only occurs at a few sites, it's hard to detect in a genome wide scan, he said. "This is what's going to make the regulatory code very, very hard to decipher."

In related work in humans, Snyder and his colleagues mapped the binding of two factors, RNA polymerase II (Pol-II) and NFκB. They compared ChIP-seq data for cells from ten individuals, finding variation at 7.5% of sites for NFκB and 25% for Pol-II. "There are a number of variable binding regions out there," Snyder observed, and on average the binding does correlate with gene expression.

Only about 7% of the variation in binding corresponds to deviations from the consensus binding motif. There are also some sites whose binding correlates with copy-number variants, as well as with inversions (together about 3%). Another 31% of sites have a SNP nearby, but for two-thirds of sites, "we have no idea what's going on," Snyder said.

Using their ABC test, Snyder said, "we found five different factors whose motif varies in accordance with their NFκB binding," but not in the NFκB binding site. The results suggest that, as in yeast, some sites are controlled not just by master regulators but also by other, locally powerful factors. "This is a nice way of seeing which factors are working together."

Genome-wide Mapping of proteins on DNA

"Finding regulatory factors on the genome, by itself, doesn't necessarily indicate what they're doing, but it does serve as an incredibly useful generic marker of the whole wide range of classes of elements," said John Stamatoyannopoulos. He helped popularize the genome-wide mapping of DNA cleavage by deoxyribonuclease I, or DNase I, in projects including ENCODE and the Roadmap Epigenomics Mapping Consortium. DNase-I hypersensitive sites, or DHSs, are DNA regions that are particularly accessible to cleavage, which often reflects the presence of regulatory sequences like promoters.

So far the ENCODE and Roadmap Epigenomics projects have mapped DHSs, with roughly 150 base-pair resolution, in over 100 cell types and tissues and developmental stages, Stamatoyannopoulos said. "You find between 100,000 and 275,000 DNase hypersensitive sites per cell type, or 0.5%–1.5% of genome," even with a stringent 1% false discovery rate. "The real numbers are a bit higher," he said.

Across all cell types, "we detect about 2.2 million distinct DNase-I hypersensitive site positions on the human genome," Stamatoyannopoulos said. Comparing with the literature, these sites encompass about 96% of all known non-promoter regulatory elements, such as enhancers, silencers, and insulators.

Differences between different cells in chromatin structure and other regulatory interactions can modify binding at these sites. "About 340,000 are cell-type specific," Stamatoyannopoulos said, while "about 7500 are present in every single cell type." The rest of the sites show rich intermediate patterns of expression. A clustering analysis of these patterns reveals a hierarchical relationship that precisely mirrors the relationships of the corresponding cells, he said. "We're looking at an encoding of early developmental processes and developmental lineages in the patterns of regulatory DNA that persist into adults."

One mechanism for regulating expression in different tissues is the large-scale organization of chromatin, which can bring together sequences that are on very distant parts of the DNA molecule. To capture these physical interactions in vivo, Stamatoyannopoulos and his collaborators are using the cross-linking technique known as Chromosome Conformation Capture Carbon Copy, or 5C. "We get very, very quantitative information on these interactions," Stamatoyannopoulos said, with a resolution of about a kilobase.

"Deep sequencing DNase-I data can reveal transcription-factor binding at nucleotide resolution."

At a genome-wide scale, DNase-I cleavage reveals areas where the enzyme, and presumably transcription factors, have free access to the DNA. But at a finer scale, a transcription-factor protein that does bind then blocks access by Dnase I, "leaving behind a negative image of the protein," Stamatoyannopoulos said. "By deep sequencing the DNase-I data, you can effectively transform the mapping data into footprint data to reveal transcription-factor binding at nucleotide resolution."

On the scale of tens of bases, "every kind of different transcription factor binding site has its own stereotypical DNase-I cleavage pattern, sort of a fingerprint," Stamatoyannopoulos said. "These cleavage patterns match extremely closely with structural motifs that are identified in crystallography," he added, and can also be used to locate specific factors in a scan of the genome. He stressed that these fingerprints do not simply reflect the sequence-dependent cut rates described by Harmen Bussemaker.

The depth of the footprints in DNase-I activity can be used to track how frequently a transcription factor occupies a site. Stamatoyannopoulos and his colleagues found that this occupancy changes precisely as expected during changes in cellular conditions, for example during differentiation. "These data are both qualitative and quantitative in terms of measures of occupancy," he said.

The researchers also developed techniques for detecting specific factors at particular sites. They first replicated the sequence in the footprint and tagged it to create a probe specific to that region. They then used one of two techniques to detect binding of a transcription factor to the probe. Factors for which there are appropriate antibodies could be detected using a Western blot. For other factors, they used targeted mass spectrometry to find signature peptides whose size-to-charge ratio is specific for particular proteins. "We don't need antibodies anymore as long as you can clone the transcription factors," Stamatoyannopoulos said. "With this approach, you can prove that a protein is actually engaging a specific motif sequence, even in a competitive context with other proteins."

Eran Segal, Weizmann Institute of Science
Harmen Bussemaker, Columbia University


  • A framework based on statistical mechanics predicts the probability of any configuration of nucleosomes and transcription factors on DNA, based on their sequence-dependent affinities.
  • An experimental yeast system allows comparison of the effect on expression of different promoter sequences with an accuracy better than 10%.
  • Much, but not all, of the organization of nucleosomes in vivo is determined by their DNA sequence preferences.
  • Poly-adenosine sequences, which are too rigid to easily wind into nucleosomes, significantly modify the expression controlled by binding of transcription factors nearby, and appear to have been used for this purpose during yeast evolution.
  • The common weight-matrix description, describing the sequence dependence of the binding affinity between a protein and DNA, ignores potentially important dependencies between bases at different positions.
  • The cleavage rate by DNase I varies by several orders of magnitude with the local DNA sequence, and gives information about the affinity with single-nucleotide resolution.
  • Combining models of affinity with genetic crosses lets researchers find regions that affect the activity of transcription factors, which is more powerful than locus identification for other traits.

Sequence specificity of nucleosome organization

Understanding the rules that determine how transcription is regulated, analogous to our understanding of the genetic code, would be tremendously useful in biology, says Eran Segal. But "despite many years of study we really still don't understand a lot of the basics and many fundamental questions are still open." Some of the more complicated questions involve the roles of distant elements like enhancers, chromatin structure, and the cooperative interactions of multiple regulatory events. Clarifying these complex issues requires a quantitative understanding of how gene expression is affected when nearby DNA is bound by transcription factors or wraps around histone proteins to form nucleosomes.

To explore these issues, Segal and his colleagues have developed a modeling framework for predicting sequence-dependent binding, and an experimental system that can quantitatively distinguish even small transcriptional effects of sequence changes. By varying the sequences and comparing the results with models, they are unraveling the rules of sequence-dependent organization of transcription factors and nucleosomes.

The modeling framework starts with an "affinity landscape," which describes how the affinity between a particular molecule and DNA varies along the sequence. For transcription factors, the affinity is determined by relatively short sequences, and can be deduced from protein-binding microarray data. Analogous experiments reveal the sequence-sensitive binding of DNA in nucleosomes, which reflects larger regions of 147 bases.

The experimental affinities are measured in vitro. "We'd like to understand how, in that affinity landscape, in a dynamic situation, you can get different configurations of actual bound molecules," Segal said. Using a statistical-mechanics model, "under the assumption of thermodynamic equilibrium we can compute exactly the probability that the system will be in any one of these configurations." The predicted positions of nucleosomes match well with in vivo experiments in yeast. "Much, but certainly not all, of the organization of nucleosomes in vivo is dictated by nucleosome sequence preferences," Segal concluded. "We understand to a large degree the rules that govern nucleosome sequence preferences."

"We understand to a large degree the rules that govern nucleosome sequence preferences."

One biological consequence of such binding preferences is their effect on expression of nearby genes. Segal and his team have developed an experimental system in yeast that allows quantitative assessment of expression changes that result from sequence changes in promoter regions. Because the genomic context is always the same, "the system controls for many different things," Segal said. "We can distinguish expression differences that are as small as 5 or 10%."

The researchers have been using this experimental system to clarify how sequence changes affect expression, by swapping in both natural and synthetic promoters and making systematic changes in regulatory elements. Segal discussed in detail the role of poly-adenosine (poly-A or poly(dA:dT)) sequences, which are abundant in eukaryotic genomes, especially in promoter regions. "They repel nucleosomes due to their rigidity and inability to conform to the sharp bending of the DNA that is required by the nucleosome structure," he said. Deletion of such sequences near a binding site for the transcription factor GCN4 in yeast was shown fifteen years ago to reduce expression of the gene it regulates.

The proximity and position of a nearby poly-adenosine sequence in the DNA has a strong effect on the binding of a transcription factor and the resulting gene expression.

"We wanted to examine these questions in a more systematic and comprehensive way," Segal said. Without changing the GCN4 binding site, the researchers modulated expression through various changes to the nearby poly-A sequences, which changed the likelihood of a nucleosome forming nearby and thus blocking transcription-factor access to the binding site. "By making changes only to poly-A sequences ... we can get dramatic influences on gene expression levels," Segal concluded.

The resulting expression changes are as large as those resulting from sequence changes in the binding site, and may provide a way to fine tune expression. To see if evolution has exploited this mechanism, Segal's group compared promoters for various ribosomal components, which need to be produced in similar amounts. They found that, for genes that have only a single copy, the associated promoters are much more likely to have nearby poly-A sequences that make them highly expressed, in comparison with genes that have multiple copies. This suggests that the fine transcriptional control provided by nucleosome organization has truly been exploited during evolution to compensate for copy-number variations.

Modeling DNA–protein interactions

Harmen Bussemaker and his colleagues use a biophysically motivated position-specific affinity matrix to capture sequence specificity. They then use the calculated binding affinity of cis-regulatory regions to estimate the regulatory activity of each transcription factor in a particular cell state.

The regulatory activity of transcription factors, which is "how much more transcriptional activity you get when the promoter affinity increases," Bussemaker said, can be regarded as a trait. The researchers map the genetic influences on this activity to quantitative trait loci, or "aQTLs," from yeast segregants. "We can determine not only how mRNA levels are determined by non-coding sequence but go one level upstream and understand how the activities of the transcription factors themselves are determined," Bussemaker said.

Combining calculated promoter binding affinities with expression data lets researchers infer which loci (aQTLs) affect regulatory activity.

"There's quite good statistical power for this," Bussemaker said. Trends in activity are less noisy than the individual activity levels used in expression QTLs. In addition, the number of tests is limited to about 100 transcription factors, rather than thousands of gene expression levels.

The aQTLs typically cover 10 or 20 genes, Bussemaker said. "This genetic variation causally influences the expression, through the transcription factors, but in general we don't know the mechanism." In contrast, protein–protein interactions give mechanistic molecular information but may not be relevant in a particular cell state. Combining the two can narrow the field down to one particular gene.

In another project Bussemaker collaborated with John Stamatoyannopoulos to explore the sequence specificity of cleavage by DNase I. Sequencing the resulting fragments shows that the cut rate varies by two or three orders of magnitude, "much more than you might have expected based on the literature," Bussemaker said. Because the enzyme position can be determined to within a single base pair, this study provides an "ideal case for modeling."

The researchers determined the cut rate for all possible hexamer sequences straddling the cut. A position-weight-matrix model predicts cut rate much more poorly than does the full hexamer sequence, "so there have to be significant dependencies between nucleotide positions," Bussemaker concluded. Using the full data set let the researchers systematically quantify the strength of these dependencies.

The position-weight matrix description of binding affinity considers each base within a motif independently, but Bussemaker cautions that this calculation is too simplistic. "It's important to be quantitative and go beyond the independence assumption of these weight matrices to be able to discriminate between these factors," he said.

"It's important to be quantitative and go beyond the assumption that different sequence positions contribute independently."

Bussemaker and his colleagues also studied the sequence specificity of Hox proteins. Experimental in vitro "monomer specificities cannot really explain the variation in target specificity of Hox proteins in vivo," he said. Columbia colleagues Barry Honig and Richard Mann had suggested that the specificity in vivo arises when the minor groove of DNA interacts with the junction between the Hox protein and a cofactor called extradenticle (Exd).

In collaboration with Mann's group, postdocs Matt Slattery and Todd Riley devised an extension of SELEX (Systematic Evolution of Ligands by Exponential Enrichment), which exploits laboratory selection of high-affinity DNA binding. By stopping before the enrichment is saturated and then sequencing the enriched population, Bussemaker said, researchers "get quantitative information about the rate at which different DNA sequences are selected, and that's a good source of sequence specificity models."

The team compared DNA binding of two Hox proteins, Ubx and Scr, both in the presence of Exd. The strength binding was strongly altered by the two central bases in the binding motif. "Hopefully that ultimately will allow people to understand in vivo why these Hoxes can have such different targets," Bussemaker said.

Overall Coordinators:
Gustavo Stolovitzky, IBM
Robert Prill, IBM
Raquel Norel, IBM
Challenge Speakers:
Hans-Juergen Thiesen, University of Rostock
Rob Patro, University of Maryland
Nicola Barbarini, University of Pavia
Matt Weirauch, University of Toronto
Matti Annala, Tampere University of Technology
Yaron Orenstein, Tel Aviv University
Alberto de la Fuente, CRS4
Matthieu Vignes, INRA-Toulouse
Po-Ru Loh, Massachusetts Institute of Technology
Daniel Marbach, Massachusetts Institute of Technology
Vân Anh Huynh-Thu, University of Liège
Robert Küffner, Ludwig Maximilian University


  • DREAM, the Dialog for Reverse Engineering Assessments and Methods, lets researchers collaborate by competition to solve biological or biologically inspired problems with known, but withheld, answers.
  • The challenges this year included two that combine simulated data from a known network with measured biological data.
  • The proper combination of predictions from all teams usually beats even the best team, because of the complementary strengths and weaknesses of different techniques.


A continuing goal of the DREAM (Dialog for Reverse Engineering Assessments and Methods) conference is to determine as objectively as possible how well researchers can infer and predict biological reality. The blinded competitions known as the DREAM Challenges are the vehicle for that assessment. Long before each meeting, organizers Gustavo Stolovitzky, Robert Prill, and Julio Saez-Rodriguez worked with other researchers to assemble four sets of unpublished or disguised data.

The tasks change from year to year, and are chosen to illuminate important biological issues and challenging, but hopefully solvable, computational problems. One recurring issue in choosing problems is the conflict between perfect mathematical specification and biological accuracy. As part of the continuing effort to test biological relevance of the tasks, two of this year's challenges used both real biological data and simulated data in different parts of the challenge.

The data were made available to numerous teams of researchers, who sought to extract the undisclosed rules or structure that give rise to the data, or to make predictions about additional data that had been withheld. Predictions from 73 teams, whose membership was not made public, were collated, scored and compared by Prill, Raquel Norel, and Gustavo Stolovitzky, and organized in the DREAM project website with support from Tom Garben and Aris Floratos at Columbia University. In most cases, the combined performance of all the predictions was better than any individual prediction, and Prill, Norel, and Daniel Marbach described ways that the community might make use of that collective wisdom.

The best-performing individual teams for each of the four challenges were invited to speak briefly at the conference about their methods. The organizers also took note of other teams for honorable mention. The following summarizes the different challenges, the overall results, and the approaches taken by the best performers.

DREAM Challenge 1: Epitope–antibody recognition

The first challenge asked participants to predict whether individual peptides would react strongly, or not at all, with a commercially available mixture of antibodies. Hans-Juergen Thiesen and his colleagues assembled the experimental data for the challenge of describing rule sets for epitope–antibody recognitions (EAR).

The diverse antibody mixture, called intravenous immunoglobulin, or IVIG, is used clinically and was obtained from 10,000 to 100,000 healthy people. The peptides mostly matched sequences from the human genome, but some were modified slightly and others were random. These peptides were synthesized and arrayed at high density on glass slides for quantitative readout. Teams received a list of more than 13,000 peptide sequences that had reacted strongly or not at all with the IVIG, and were given a similar number of sequences to classify. In principle, the challenge was to find common rules or attributes that determine the interaction of antibodies with peptide sequences, exemplifying the interaction of antibodies with linear epitopes. A "bonus round" subchallenge asked teams to predict peptide sequences that qualify to be strongly or not binding at all to IVIG.

The two best performers for challenge 1 were significantly better at prediction than the rest of the teams. The Best Performer was Team Pythia, formed by Rob Patro and Carl Kingsford of the University of Maryland. They settled on a support-vector-machine implementation, and combined a large number of candidate features for classification. The single best classifier was local amino-acid composition, Patro said, so "simple features should not be discounted." The structural calculation of the best docking geometry using Zdock performed worst as a single classifier. "There's lots of room for improvement," he observed.

Team Pavia, consisting of Nicola Barbarini, Alessandra Tiengo and Riccardo Bellazzi of the University of Pavia, were a close second. They evaluated a large number of sequence features, including some proxies for structural features but no comprehensive structural modeling. They used a leave-one-out approach to train various algorithms, and found that the best performance came with a linear regression model, and exploited 28 attributes. No single rule dominated the classification.

Peptides predicted in the bonus round by both best performing groups are currently validated experimentally by the group of Hans-Juergen Thiesen.

DREAM Challenge 2: Transcription-factor–DNA motif recognition

The second challenge concerned the prediction of transcription-factor binding motifs in DNA sequences. Matt Weirauch and Tim Hughes of the University of Toronto assembled the data from Protein-Binding Microarrays (PBMs).

The current paradigm for evaluating sequences, Weirauch noted, is the position-weight matrix, which simply combines contributions from the nucleotide at each position. "It's becoming more obvious that there are problems with this approach," he said. In particular, it cannot handle variable-width gaps between sections of the motif, transcription factors with multiple binding modes, and dependencies between residues at different positions as described by Harmen Bussemaker.

Participants received binding specificity data for 20 different transcription factors from two PBM arrays containing different probe sequences. The probes in each array are designed such that all possible 10-base sequences are present once, so all possible 8-mer sequences are present 32 times. The teams then predicted the affinity for 66 more factors, 33 for each type of array. A "bonus round" subchallenge asked teams to name the anonymized transcription factors.

The best performer in both the main challenge and the bonus round was Team csb_tut, consisting of Matti Annala of Tampere University of Technology, Kirsti Laurila, Matti Nykter, and Harri Lähdesmäki. They used a linear-affinity model that included k-mers of length between 4 and 8, but regularized the overconstrained data by only retaining the most informative k-mers. They performed several corrections to the PBM data for artifacts and signal saturation, and found it important to include the linker sequences used to build the arrays in their analysis. To identify the names of the transcription factors, they assessed the similarity of the sequences with motifs in the TRANSFAC and JASPAR databases.

Sharing the best performer in the bonus round was Team ACGT, Yaron Orenstein, Chaim Linhart, and Ron Shamir of Tel Aviv University. They used their labs' Amadeus motif finder, which was designed to find sequences in promoter regions. The most obvious way to apply this tool, however, simply giving it the probes with the highest binding, "failed miserably" on the training set, Orenstein said. What did work was to rank all k-mers based on the probe binding, and give those most informative k-mers to Amadeus. In particular, they averaged the binding strengths of all probes containing each 9-mer and gave the top-ranked 1000 as input sequences to Amadeus to find a motif position weight matrix of width 8.

DREAM Challenge 3: Systems genetics

The third challenge concerned data from segregating populations, a field known as systems genetics or genetical genomics. The data include both simulated data and measured data from plants, and were assembled by Alberto de la Fuente and his colleagues.

Combined genetic and phenotypic data from segregants that result from crosses between inbred strains was discussed in several of the keynote talks at this conference, including those by Charlie Boone, Leonid Kruglyak, Michael Snyder, and Harmen Bussemaker. As those talks illustrate, the natural but highly constrained genetic variation among segregants yields powerful information about the genetic contributors to phenotype. The systems-genetics DREAM challenge should provide continuing insight into the evaluation of this type of data.

Part A of the challenge used simulated systems-genetics data. The researchers first generated 1000-gene networks with a modular scale-free topology using SysGenSIM, a tool developed by the labs of de la Fuente and Ina Hoeschele. They modeled the interaction between genes using nonlinear differential equations. The parameters of this model describing the basal transcription rate (cis) or its effect on a target gene (trans) were chosen from two values, representing the parent alleles, and the steady-state gene expression was calculated.

Participants were given both the expression levels and the corresponding parental allele for all 1000 genes, for simulated crosses between parents. Subchallenges A1, A2, and A3 had populations of 100, 300, and 999 offspring. The teams then reported edges on a directed graph, in order of confidence. "These networks are much bigger than we've had in DREAM before," commented Prill, "so we asked for just the first 100,000 edges."

The best performer in Part A was Team SaAB_meta and SaAB Dantzig, Matthieu Vignes, J. Vandel, N. Ramadan, D. Allouche, C. Cierco, S. De Givry, Brigitte Mangin, and Thomas Schiex of INRA-MIA in Toulouse. They first did a regression test to distinguish cis- and trans-acting alleles. For further analysis they ran three different algorithms: a Bayesian network, Lasso regression, and the Danzig Selector. They then combined the three techniques into the Meta algorithm that gave their best-performer results.

Part B used data from soybean plants, produced at the Virginia Bioinformatics Institute, to see if participants could predict two phenotypes measuring their susceptibility to mold. The plants came from crosses between an ancestor that was resistant to the pathogen and one that was sensitive. Genotypes for 941 genes and pre-exposure gene expression for 28,397 genes were provided for 200 different plants. Teams were asked to predict the phenotype for 30 other offspring, using only genotype (B1), only pre-exposure gene expression (B2), or both (B3).

Overall, the results were "not too good," de la Fuente said, especially on genotype alone, so perhaps the task needs to be simpler. Prill commented that, particularly for challenge B1, "all the teams are correlated with each other, and none of them are correlated with the gold standard" (measured data). Still, two performers made statistically significant phenotype predictions.

The best performer for Part B2 was Team orangeballs, Po-Ru Loh, George Tucker, Michael Yu, and Bonnie Berger of the Massachusetts Institute of Technology. With expression data for so many genes, the challenge is figuring out "which of those 20,000 are going to be the ones that actually tell you about phenotype," Loh said. The challenge is made worse by the possibility for correlations among the predictors, both genotypes and phenotypes. The variation was dominated by extreme outliers, which the team de-emphasized by using a rank-ordering transformation. To account for possible nonlinear interactions between the predictors, they included Boolean combinations of genotypes. In the end, a handful of well-chosen predictors achieved most of the performance.

The best performers in Part B3, Team RNI_group, consisting of Madhuchhanda Bhattacharjee of the University of Pune and Mikko Silanpää of the University of Helsinki were not able to present their methods.

Challenge 4: Network Inference

The fourth challenge assessed the common biological goal of inferring four transcriptional regulatory networks from expression data following perturbations. The data were assembled by Daniel Marbach, Jim Costello, Diogo Camacho, and Jim Collins.

This challenge builds on experience from previous years with "in silico" networks, where the network that generated the data is precisely known. For DREAM5, the simulated network, inspired by Escherichia coli, provided just one of the four data sets. The second data set was based on expression for Staphylococcus aureus, where there is not yet any reference network that can be regarded as a gold standard. "Hopefully the biologists will be more excited if we start focusing not only on the benchmarking but on this community-based prediction," Marbach said. The third and fourth data sets were measured in E. coli and the budding yeast Saccharomyces cerevisiae, where the underlying networks are quite well established.

Participants were given a list of genes and a large amount of microarray expression data, anonymized from the original data. They were also given supplementary information such as the conditions of the experiments, and also some candidate transcription factors.

The predictions were judged for consistent performance across the networks, but "I see great diversity in the performances of the various teams" on the different networks, Prill commented. In particular, the yeast predictions were "terrible." The S. aureus network was not scored, since there was no gold standard, but will be used as the basis for a community prediction. The two best overall performers were both return leaders from DREAM4.

The best performer overall and in silico was Team ulg_biomod, consisting of Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts, of the University of Liège, and Yvan Saeys of Ghent University. They used a decision-tree based model, based only on expression data. Huynh-Thu noted that the predictions improve dramatically when the transcription factors are known. In addition, although the team did much better than others on the in silico data, their performance for the in vivo data was "only competitive."

Best performer in vivo, and second place overall, was Team Amalia, including Robert Küffner, Florian Erhard, Tobias Petri, Lukas Windhager, and Ralf Zimmer of Ludwig Maximilian University. To rank candidate interactions between transcription factors and possible target genes, the team used the ANOVA test. This neither requires linearity, as assumed in a correlation coefficient, nor discretization of the data, which is needed for Bayesian network or mutual information techniques. This technique worked well for E. coli. But for the yeast network, where all teams did poorly, and the unscored S. aureus network, the number of perturbation experiments in the data set was too small to expect reliable results, Küffner said. Nonetheless, he regarded the inclusion of in vivo data to be a "big step forward" for DREAM.

Will small molecules that target co-dependencies in cancer be as profoundly effective as those that target oncogenes directly?

Can structural variants, like copy-number variation, and zygosity be cheaply and accurately included in personal genome tests?

How can increasingly genetically-targeted therapies, such as combination therapies for cancer, be designed and clinically tested?

Can the power of detailed interaction models be effectively folded into large networks in which determining parameters is impractical?

How can the graphical representation of biological networks better illustrate functional relationships?