Computational Biology of Gene Expression
Computational Molecular Biology/ Bioinformatics. Our goal is to understand the rules of RNA splicing specificity: how the precise locations of introns and splice sites are identified in primary transcripts. We are developing computational methods to identify splicing enhancer and repressor motifs and to identify genes in eukaryotic genomes. We are also using a combination of computational and experimental methods to study alternative splicing, a common mechanism of gene regulation in vertebrates.
RNA splicing specificity: Most eukaryotic genes contain one or more introns which must be removed from the primary transcript by the RNA splicing machinery in order to create the proper mRNA sequence to direct protein synthesis. This process must be highly accurate in order to ensure production of adequate amounts of correctly processed mRNA. The problem of RNA splicing specificity is to describe the set of ErulesD which govern choice of intron and splice site locations in primary transcripts by the nuclear splicing machinery and to understand the molecular basis for these rules. This problem is analogous to the problem faced by biochemists in the early 1960s of identifying the rules governing translation of mRNAs into specific peptide sequences by the ribosome, the solution of which was the genetic code. The rules governing splicing are likely to be more complicated than those for translation and are not exactly the same in all organisms. On the other hand, progress in large scale sequencing efforts is providing a wealth of data related to this problem in the form of thousands of gene sequences of known exon-intron structure.
A typical human primary transcript is about 30 kilobases long and contains several exons separated by much larger and more variably sized introns. The discrepancy between human exon and intron lengths led to the Uexon definition^ model of splicing in which splice sites are first paired across exons, with spliceosome assembly proceeding through subsequent pairing of exon units. In the alternative Uintron definition^ model, splice sites are initially paired across introns rather than exons. Intron definition is thought to be the predominant mode of splicing in transcripts containing short introns and long exons.
We have analyzed sequence features involved in recognition of short introns using available transcript data from five eukaryotes with complete or nearly complete genomic sequences. The information content of five different transcript features was measured using methods from information theory, and Monte Carlo simulations were used to determine the amount of information required for accurate rocognition of short introns in each organism. We found that short introns in Drosophila melanogaster and Caenorhabditis elegans contain essentially all of the information for their recognition by the splicing machinery, and computer programs which simulate splicing specificity can predict the exact boundaries of approximately 95% of short introns in both organisms. In yeast the 5Dss, branch signal and 3Dss can accurately identify intron locations but do not precisely determine the location of 3D cleavage in every intron. The 5Dss, branch signal and 3Dss are clearly not sufficient to accurately identify short inrons in plant and human transcripts, but specific subsets of short, intron-biased motifs can be identified in both human and Arabidopsis, which contribute dramatically to the accuracy of splicing simulators, suggesting that intronic splicing enhancers play a large role in these organisms.
It is well established that many exons contain internal sequences which either enhance or repress splicing and that other enhancers and repressors are commonly found in introns. We are developing computational methods for identifying novel splicing enhancer motifs based on the hypothesis that motifs which function as exonic enhancers should have two essential properties: 1) significantly higher frequency in exons than introns; and 2) significantly higher frequency in exons with EweakD (non-consensus) splice signals than in exons with strong consensus-matching splice signals. This screen clearly identifies several known classes of splicing enhancers including purine-rich elements in exons and GGG motifs in introns. Several novel classes of candidate enhancers are also identified. Both known and candidate enhancer motifs tend to be preferentially located at specific distances from splice junctions. The next step is to test the functions of candidate enhancers using in vitro and in vivo splicing assays. A similar approach will be used to screen for intronic enhancers and for splicing repressors.
Gene finding: We have recently developed a new algorithm for identifying the locations and exon-intron structures of genes in genomic sequences, GenomeScan. This algorithm is related to our previous Genscan algorithm but achieves higher accuracy by taking into account BLASTX similarity to available proteins. Application of this method to the assembled draft + finished human genome sequence identifies approximately 25,000 human genes which are homologous to known proteins. Adapting GenomeScan for other eukaryotic genomes and using the genes identified with this approach for comparative genomics studies is planned.
Alternative splicing: To study the process of alternative splicing, we are constructing databases of alternatively spliced genes and identifying genes which exhibit conserved patterns of alternative splicing between human and mouse and conserved regions in introns flanking the alternatively spliced exons, suggesting the presence of regulation. Experiments are underway to study the regulation and possible function of one particularly interesting alternatively spliced gene, a member of the MAP kinase family.