Institute for Systems Biology  CSB Home
 CSB Home

Model Systems

Halobacterium
Yeast Peroxisome Biogenesis
Yeast Cell Differentiation

Mammalian Systems

Disease Diagnostics
Innate Immunity

Computational Biology Research

Computational Biology
Research and Scientists

Computational Biology:




The computational biology group, let by Ilya Shmulevich, works with diverse data types to construct models of biomolecular systems for gaining insight into key biological processes and predicting cellular responses to environmental cues. The group also studies the theory of complex dynamical systems, develops machine learning methods for biomarker discovery, particularly in the context of cancer, and develops image analysis approaches for high-throughput cellular imaging and microfluidic platforms.

Systems Biology Theory:
Criticality is a dynamical property of complex systems in which, on average, small perturbations to a system's state are neither dampened nor amplified, but are propagated throughout the system over long periods of time. In a study published in Physical Review Letters [P. Krawitz, I. Shmulevich, "Basin Entropy in Boolean Network Ensembles," Physical Review Letters, Vol. 98, No. 15, pp. 158701(1-4), 2007.], we showed that systems that are critical can store an arbitrarily large amount of information that is only limited by the size of the system. On the other hand, both ordered systems, in which perturbations are dampened, and chaotic systems, in which perturbations are amplified, can only store a limited amount of information that cannot increase regardless of how large the system gets. To demonstrate this, we introduced a novel parameter, the basin entropy, for the dynamical uncertainty or information storage capacity of a network.

This study is significant because criticality affords the system an ability to coordinate complex macroscopic behaviors that strike an optimal balance between stability and adaptability. Because of this, it has long been hypothesized that living systems are critical. This is supported by recent evidence from genomic studies. As information processing and storage is the basis for learning, the results of this study support the idea that living systems maintain criticality during evolution in order to maximize their capacity to learn, and hence be maximally fit in a changing environment.

We also studied the basin entropy numerically and in several model networks, such as the mammalian cell cycle and developed an approach to estimate the basin entropy from time series data, thereby demonstrating that the basin entropy is a useful global biological observable that can be estimated from data without the knowledge of the underlying network [P. Krawitz, I. Shmulevich, "Entropy of complex relevant components in Boolean networks," Physical Review E (in press)].

We addressed the hypothesis of whether cells are critical experimentally for system-wide gene expression dynamics in the macrophage. To this end, we have developed a novel method, based on algorithmic information theory, to assess macrophage criticality and validated the method on networks with known properties. Using global gene expression data from macrophages stimulated with a variety of Toll-like receptor agonists, we found that macrophage dynamics are indeed critical, providing the most compelling evidence to date for this general principle of dynamics in biological systems.

Network structure strongly constrains the range of dynamic behaviors available to a complex system. Numerous studies have shown that the most complex dynamics arise near the critical regime. We used an information theoretic approach to study structure-dynamics relationships in networks within a unified framework and demonstrated that these relationships are indeed most diverse in the critical regime.

Cell types can be thought of as attractors of underlying genetic regulatory networks, with differentiation being a route or trajectory through the state space of such networks. We used the HL-60 leukemia cell line as a model for differentiation and stimulated it with different dosages and durations of vitamin A, a known neutrophil differentiation agent. We found sets of genes that did not exhibit a convergence to an attractor starting from two different dose/duration stimuli that were experimentally determined to result in the same stage of differentiation according to flow cytometry data collected by monitoring a cell surface differentiation marker (CD11b). Cells were live-sorted and cultured in fresh media. Under one stimulus the cells terminally differentiated and under the other, they reverted back to their undifferentiated state. The genes that did not exhibit a convergence were hypothesized to be implicated in the neutrophil differentiation process. By scanning the promoter regions of these genes for transcription factor binding sites we identified enriched sets of transcription factors and through protein-protein interactions, predicted novel co-activators in neutrophil differentiation.

Inference and Modeling of Genetic Regulatory Networks:
Revealing regulatory mechanisms by which transcription factors (TF) bind and regulate gene expression is one of the key problems in understanding genome-wide transcriptional regulation. Many TF binding sites are relatively short and degenerate and hence they occur frequently in a genome by chance. These presumably non-functional sequences cause traditional TF binding site prediction methods to have unacceptably high false positive rates. A natural way to improve specificity of TF binding site predictions is to incorporate additional information, such as phylogenetic footprinting.

We have developed a principled way of combining multiple data sources, such as evolutionary conservation, regulatory potential, nucleosome positioning, ChIP-chip, CpG islands, and other prior knowledge, into a unified probabilistic framework to improve TF binding predictions. The proposed method estimates not only the probability of binding at each base pair position but also the probability of transcriptional regulation, i.e., the probability that the whole promoter has a binding site.

Dynamic regulatory models of such genetic control mechanisms are typically learned from time series measurements. However, it is not always possible to monitor changes over time, and only steady state behavior can be measured. Inference is then commonly restricted to learning static models. We developed a rigorous Bayesian method for learning the structure of dynamic regulatory networks from time-series and steady-state measurements.

Gene regulatory networks are nonlinear dynamical systems in which gene expression is often initiated by a complex interaction of TFs. We have used kernel learning theory to non-parametrically learn a nonlinear differential equation model of the gene regulatory architecture controlling the activation program of the Toll-like receptor (TLR)-stimulated macrophage. We demonstrated the effectiveness of this approach in faithfully reconstructing the underlying network in terms of the specificity and sensitivity of the reconstruction, in a simulation framework with random ensembles of networks. This network inference framework enables us to integrate a variety of large-scale data sources (e.g., gene expression profiles, transcription factor binding data) and specifically addresses both the combinatorial and nonlinear nature of genetic control.

In the TLR-stimulated murine macrophage, clusters of co-expressed genes are observed to be induced or repressed in a series of "waves" post stimulation; the earliest clusters are found to be enriched for induced TFs. We investigated the use of the time-lagged correlation (TLC), a technique from signal processing, to identify likely regulatory associations between induced transcription factors and gene clusters responding at nonzero time lags relative to the TFs. While the TLC is not a definitive source of evidence of regulatory association by itself, we hypothesized that the TLC can be effectively used in combination with statistical testing of TF binding site motif enrichment within the gene cluster. We developed a novel unbiased statistical test for assessing the significance of an observed TLC, and for combining the TLC evidence with motif scanning evidence to obtain an overall significance score for the association between an induced TF and a gene cluster. Applying this method to a large-scale microarray expression dataset from TLR-stimulated macrophages, we obtained a matrix of TF-cluster associations that recapitulates many of the key TFs known to regulate macrophage activation, and provides insights into several novel potential regulators.

We demonstrated that TLRs differentially induce an array of TFs, and used computational tools including promoter analysis, kinetic model selection, and enhancesome prediction to identify possible gene regulatory networks that drive macrophage differentiation. In addition to identifying likely transcriptional regulators of TLR-activated genes, we also predicted the relative importance of candidate regulatory factors, and whether they act as activators or repressors. Specific regulatory hypotheses were evaluated for the NFkB, AP-1, and the CREB/ATF families of TFs, using genome-wide localization and gene deletion studies. Our studies also suggested a role for TFs not previously associated with macrophage activation. The postulated networks show promise in global prediction of transcription during macrophage activation. We have reported a general methodology for translating promoter analysis, transcriptional complex identification, and gene expression data into predictions of transcriptional regulatory networks. For the case of macrophages, our predictions offered numerous experimentally testable hypotheses.

Dynamical simulations using ordinary differential equations or stochastic Markov chain models can be useful in revealing the system-wide behavior of a complex biomolecular network, as well as the response of the network to a perturbation or knock-out of a key element (e.g., a transcription factor). We have used dynamical modeling to better understand the dynamics of two systems, the transcriptional response of yeast to the fatty acidoleate, and the transcriptional response of the macrophage to stimulation of TLRs. In each system, we focused on a small sub-network including key transcription factors and selected target genes that are representative of the induction of key functional modules. In the case of the macrophage, the dynamical model enables us to predict the effect of a double knock-out of the transcription factors ATF3 and C/EBP-delta on the induction of the cytokine IL-6 in response to the TLR agonist lipopolysaccharide. In the case of the yeast oleate network, the model is being used to investigate the specific functional role of the recently discovered transcription factor OAF3 in regulating the transcriptional response to oleic acid.

In order to make the computational predictions easily accessible and browsable, we have reported the development of a database, the Innate Immune Database (IIDB), of computationally predicted transcription factor binding sites and related genomic features for a set of over 2000 murine immune genes of interest. The database indexes genes and annotations of the immediate surrounding genomic regions. To facilitate both gene-specific and systems-oriented research, our database provides the means to analyze individual genes or an arbitrary set of genes. The database can be interrogated via a web interface and genomic annotations, and binding site predictions can be automatically viewed with a customized version of the Argo Genome Browser. The IIDB is freely accessible.

Bioinformatic Analysis Tools:

  1. Transcript Enumeration Clustering
    Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern'', are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques, since they ignore certain fundamental properties of this space. We developed a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. Simcluster is available as a stand-alone command-line C package and as a user-friendly on-line tool. [R. Vêncio, L. Varuzza, C. Pereira, H. Brentani, I. Shmulevich, "Simcluster: clustering enumeration gene expression data on the simplex space," BMC Bioinformatics, Vol. 8, No. 246, 2007.]

  2. Enrichment Analysis with Probabilistic Annotation
    As in many other areas of science, Systems Biology makes extensive use of statistical association and significance measurements in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods, which are generally based on Fisher-like significance tests, largely ignore this information. We have developed an analysis framework, as well as a software tool, that does not require a static contingency table but rather deals with the probabilistic nature of this categorical data. An on-line web interface was created to allow use by non-programmers.

Image Analysis:

  1. Quantification of peroxisomes in yeast
    We have developed segmentation methods for detecting yeast cells and their peroxisomes. Yeast cells are visualized with bright field microscopy and GFP-tagged peroxisomes are visualized by fluorescence microscopy. For the case of 2-D yeast images, we developed a segmentation method that is based on clustering of the local mean-variance space. The assumption in this method is that cell membranes areas are darker than other areas and also have a higher local variance. Segmentation of peroxisomes is based on thresholding. These two segmentations allow us to obtain such statistics as the number of peroxisomes in each cell, the area of each peroxisome, and the average intensity of each peroxisome [A. Niemistö, J. Selinummi, R. Saleem, I. Shmulevich, J. Aitchison, and O. Yli-Harja, "Extraction of the number of peroxisomes in yeast cells by automated image analysis," in Proceedings of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'06), New York, USA, August 30 -- September 3, 2006, pp. 2353--2356].

    For the case of 3-D image stacks obtained by confocal microscopy, we have developed a segmentation method that detects the yeast cells by K-means clustering [A. Niemistö, T. Korpelainen, R. Saleem, O. Yli-Harja, J. Aitchison, I. Shmulevich, "A K-Means Segmentation Method for Finding 2-D Object Areas Based on 3-D Image Stacks Obtained by Confocal Microscopy," 29th International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, August 23-26, 2007].

    We have also worked on analyzing and quantifying the 3-D peroxisome shapes by spherical harmonics and other shape descriptors [J. Selinummi, A. Niemistö, R. Saleem, G. W. Carter, J. Aitchison, O. Yli-Harja, I. Shmulevich, and J. Boyle, "A Case Study on 3-D Reconstruction and Shape Description of Peroxisomes in Yeast, " IEEE International Conference on Signal Processing and Communication (ICSPC07), Dubai, United Arab Emirates (UAE), November 24-27, 2007].

  2. Tracking of yeast cells in microfluidics
    Microfluidics provides a way of conducting highly parallel studies of live single cells under well-defined and time-varying chemical stimuli. We have developed image analysis methods for segmenting yeast cells from time-course images taken of microfluidic chips. Our system also measures the integrated intensity of a GFP reporter. The segmentation method is based on clustering of the local mean-variance space; however features of the microfluidic device, such as edges of flow and control channels that are visible as dark lines, require a large number of pre- and post-processing steps to minimize the number false positives.

    In addition to the segmentation method, we have also developed a cell tracking method, which allows us to follow the integrated GFP reporter values of single cells over time. The Hungarian algorithm provides the core of our tracking method. However, since we cannot expect fully accurate segmentation, we have developed mechanisms that are able to deal with false negatives and false positives.

Institute for Systems BiologyCenter for Systems Biology at the Institute for Systems Biology
1441 N. 34th Street, Seattle, WA 98103
Phone: 206.732.1200 | Fax: 206.732.1299 | Email:

© 2007, Institute for Systems Biology, All Rights Reserved