This is a very exciting time to do biomedical research. With past and ongoing breakthroughs in screening technologies, we are witnessing an unprecedented explosion in data generation. This development is an obvious boon for computational and theoretical researchers. For perhaps the first time in history, some disciplines within biology may become "analysis-limited" rather than being "data-limited", which has been the historical situation. In other words,interpretation of available data, using computational and theoretical tools may become the limiting and the most important step. Hence, the potential value-added of computational biology is greater than ever.
Specifically, we believe that integrative approaches that combine the knowledge from different fields of bioscience, such as structural biology, systems biology, functional genomics and comparative genomics will be especially powerful. We have pioneered integrative approaches combining structural and systems knowledge (see Kim et al. Science 2006), as well as methodologies from genetics and network biology (see Kim et al. PNAS 2007) and strongly believe that these kinds of approaches will prove even more fruitful in the future. We are now applying these approaches in a translational setting and are actively developing inhibitors as lead compounds for cancer therapeutics.
Hence, our interests span the range from protein biophysics to genome evolution, as many of these fields will converge in the future. While our interests are not confined to these, most of our recent work can be catorized into either of two topics: Protein interactions / Signaling pathways / Cancer Inhibitors or Genetic Variation / Alternative Splicing
Analysis of signaling pathways and protein interactions:
In-depth analysis of signaling pathways and biophysics of kinase and domain binding
Modular protein domains have been shown to mediate important interactions such as signal transduction. Prior studies have used various ways to predict these peptide binding domain targets, however, to date the prediction of their biologically relevant targets has not been addressed in an automated and integrated fashion. Therefore, we have developed a motif analysis pipeline which predicts the target binding peptides of these domains by analyzing peptides identified from experimental screens or pre-made Position Weight Matrix with comparative genomic, structural genomic and genomic data.
It complements the motif score with a variety of pre-computed features, such as conservation, surface propensity, and disorder, which have been previously shown to determine biologically relevant targets. It also integrates genomic features such as interaction and localization data to further improve the prediction. Finally, it applies a Bayesian learning algorithm to integrate all scores and give an optimal target prediction based upon a validated training data set. It aims to provide a comprehensive platform for researchers to predict biologically significant targets that are potentially recognized and bound by a particular domain of interest.
The SH3 domain is a common peptide recognition module conserved from yeast to humans and appears in numerous signaling proteins that regulate important biological processes such as actin cytoskeleton reorganization. Given that many yeast SH3 domains have been shown to recognize similar peptides, how can SH3 domains assemble protein interaction networks with high specificity? Although the structural basis for SH3 domain ligand recognition and binding is well established, much remains to be learned about how primary domain sequence relates to binding specificity and how these domains are capable of distinguishing between potential ligands. To address this question, we generated a high-resolution data set of yeast SH3 domain binding profiles using large-scale phage display analysis. This dataset was integrated with SPOT system data for the complement of yeast SH3 domains. Comparison of these newly derived binding profiles indicates that, in contrast to those previously known, SH3 domains appear to have a high degree of ligand specificity with little overlap. We have utilized these enhanced specificity profiles to generate a highly accurate quantitative protein interaction network for the yeast SH3 domains that will serve as an adequate list of candidate interactions for detailed biological investigations.
Disordered regions in protein networks
Recent studies have emphasized the value of including structural information into the topological analysis of protein networks. We utilized structural information to investigate the role of intrinsic disorder in these networks. Hub proteins tend to be more disordered than other proteins (i.e. the proteome average); however, we find this only true for those with one or two binding interfaces (‘single'-interface hubs). In contrast, the distribution of disordered residues in multi-interface hubs is indistinguishable from the overall proteome. Surprisingly, we find that the binding interfaces in single-interface hubs are highly structured, as is the case for multi-interface hubs. However, the binding partners of single-interface hubs tend to have a higher level of disorder than the proteome average, suggesting that their binding promiscuity is related to the disorder of their binding partners. In turn, the higher level of disorder of single-interface hubs can be partly explained by their tendency to bind to each other in a cascade. A good illustration of this trend can be found in signaling pathways and, more specifically, in kinase cascades.
3D Structural Modeling of the Interactome
Protein interaction networks are a principal component of a systems level description of the cell. Network topology has been clearly linked to protein function, expression dynamics and other genomic features. However, most network studies thus far operate on a relatively high level of abstraction and treat all proteins as simple nodes and all interactions as simple edges, neglecting the structural and chemical aspects of each interaction. We utilize atomic-resolution information from 3D protein structures to further characterize proteins and interactions in the network, thereby giving a chemical reality to nodes and edges. This differentiation also helps to resolve the current debate on the correlation between a proteins degree and its evolutionary rate, as only multi-interface hubs appear to evolve slower than the average protein. Finally, we show that current models of network evolution can only explain the topology of interactions sharing the same interface, while they fail to explain the growth of multi-interface proteins, and they likely need to be revisited.
Novel Tools and Methods for Network Modeling
Biological processes involve complex networks of interactions between molecules. Various large-scale experiments and curation efforts have led to preliminary versions of complete cellular networks for a number of organisms. To grapple with these networks, we developed the TopNet-like Yale Network Analyzer (tYNA), a novel Web system for managing, comparing and mining multiple networks. tYNA efficiently implements methods that have proven useful in network analysis, including identifying defective cliques, finding small network motifs (such as feed-forward loops), calculating global statistics (such as the clustering coefficient and eccentricity), and identifying hubs and bottlenecks.
Genetic variation and genome evolution:
Signatures of recent adaptive evolution in the interaction network
Because of recent advances in genotyping and sequencing, human genetic variation and adaptive evolution in the primate lineage have become major research foci. We examined the relationship between genetic signatures of adaptive evolution and network topology. We find a striking tendency of proteins that have been under positive selection (as compared with the chimpanzee) to be located at the periphery of the interaction network. Our results are based on the analysis of two types of genome evolution, both in terms of intra- and interspecies variation. First, we looked at single-nucleotide polymorphisms and their fixed variants, single-nucleotide differences in the human genome relative to the chimpanzee. Second, we examine fixed structural variants, specifically large segmental duplications and their polymorphic precursors known as copy number variants. We propose two complementary mechanisms that lead to the observed trends. First, we can rationalize them in terms of constraints imposed by protein structure: We find that positively selected sites are preferentially located on the exposed surface of proteins. Because central network proteins (hubs) are likely to have a larger fraction of their surface involved in interactions, they tend to be constrained and under negative selection. Conversely, we show that the interaction network roughly maps to cellular organization, with the periphery of the network corresponding to the cellular periphery (i.e., extracellular space or cell membrane). This suggests that the observed positive selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.
Mechanisms of formation of Copy Number Variants and Segmental Duplications
In addition to variation in terms of single nucleotide polymorphisms (SNP), whole genomic regions differ in copy number among individuals. These differences are referred to as Copy Number Variants (CNVs) which recent mapping studies have shown to be prevalent in mammalian genomes. CNVs that reach fixation in the population give rise to Segmental Duplications (SDs). SDs, in turn, are operationally defined as long (>1kb) stretches of duplicated DNA with high sequence identity. Here, we investigate formation signatures for both phenomena. To this end, we examine in detail co-occurrence patterns of different genomic repeat features with both CNVs and SDs. First, we analyzed the localization of SDs with other SDs (i.e. their co-localization) and find that SDs are significantly co-localized with each other, resulting in a highly skewed “power-law” distribution. This observation suggests a preferential attachment mechanism, i.e. existing SDs are likely to be involved in creating new ones nearby. Furthermore, we observe a significant association of CNVs with SDs, but show that an SD-mediated mechanism could only account for a fraction (maximally 28%) of CNVs. As another major contributor to SD formation, Alu elements a type of repeat had previously been identified by virtue of their strong association with SDs. While we also observe this association, we find that it sharply decreases for younger SDs. Continuing this trend, we find only weak associations of CNVs with Alu elements. In the same vein, we report an association of SDs with processed pseudogenes, which is decreasing for younger SDs and absent for CNVs. Finally, we find a number of other repeat elements, namely LINEs and microsatellites, to be significantly more associated with CNVs than SDs, which may explain their formation. Overall, we find that a shift in predominant formation mechanism occurred in the recent evolutionary history. About 40 Mya ago, during a burst in retrotransposition activity (the “Alu burst”), non-allelic homologous recombination (NAHR), mediated by Alus, was the main driver of such genome rearrangement; however, its relative importance has decreased markedly since then, with proportionally more events now being associated with other repeats and with non-homologous end-joining (NHEJ).