Next-generation gene prediction: alternative splicing and evolutionary conservation
Alternative processing of primary RNA transcripts has been found across all eukaryotes, in unicellular parasites and flagellates as well as in green algae, plants and, of course, fungi, and animals. Alternative splicing, therefore, is a characteristic of the last eukaryotic common ancestor. It is used to increase proteome diversity and has been shown to be highly regulated in many species. Many different types of alternative splicing exist, like differential inclusion of exons, intron retention, or alternative 5’- and 3’-splicing of exons. A particularly interesting case is mutually exclusive splicing, in which neighboring exons are spliced in a mutually exclusive manner into the mature transcript. The most extreme case reported so far is the Drosophila Down Syndrome Cell Adhesion Molecule (Dscam) gene that contains four clusters of mutually exclusive spliced exons (MXEs) with 93 alternative exons in the genomic sequence (Figure 1). Although MXEs within a cluster are relatively similar, they cannot substitute each other if one is damaged. In humans, mutations in MXEs have been shown to cause diseases like the Timothy syndrome, cardiomyopathy, or cancer. The regulation of MXE splicing and their evolutionary conservation have been studied in great detail for a few example genes like Dscam and the insect muscle myosin heavy chain (Mhc) genes. However, a concise analysis of MXEs within an entire genome has been missing to date.
Although bioinformatical methods are routinely used in the gene prediction process, a sufficiently sensitive and comprehensive solution is not yet available. Typically, these methods use complex probabilistic models for coding and non-coding regions, but are not able to predict alternative splice variants. In order to improve gene prediction by including biological knowledge and evolutionary conservation, we have developed a new algorithm for predicting MXEs. The new software takes the following criteria derived from biological knowledge into account: A) MXEs must be neighboring exons. B) MXEs must be translated in the same reading frame and the splice sites must be compatible. C) MXEs must have about the same length, because they code for the same structural region in the resulting protein, and length differences are only possible in loop regions. D) The protein sequences coded by the MXEs are supposed to be similar, because they code for the same region in the protein and developed most probably by exon duplication during evolution. The new software requires the exon-intron structure of the gene as input. Subsequently, the surrounding introns of each original exon are searched for MXE candidates. Using this approach, we could reproduce the known cases like Dscam and Mhc. In order to test the new method in a genome-wide context and to assess its predictive power, we further applied it to one of the best-annotated model organisms, the fruit fly Drosophila melanogaster (Figure 2).
In our recent publication highlighted here, we could show that the method has a very high sensitivity. Altogether, we have predicted about two times more internal MXEs than already annotated. This was very surprising given a recent exhaustive exploration of the developmental transcriptome of Drosophila at high coverage (1,200-fold and 5,900-fold for the genome and transcriptome, respectively). In order to obtain further evidence for the predicted MXE candidates we A) mapped publicly available EST- and RNA-Seq data, B) analyzed conservation of the MXE candidates in other Drosophila species and arthropods, C) performed ab initio predictions of exonic regions in the respective introns, and D) searched for competing RNA secondary structures reported to be essential for mutually exclusive splicing. More than half of the predicted MXEs were supported by several data types. Our data strongly suggests that our method can and should be applied to every newly sequenced genome.
While most of the predicted MXEs in Drosophila melanogaster are conserved in other Drosophila species and arthropods, it is highly likely that D. melanogaster also lost MXEs still present in other insects. Therefore, we reconstructed the mutually exclusive exomes of further eleven Drosophila species and compared them with that of D. melanogaster (Figure 3). Our analysis showed that there is a continuous and rapid gain and loss of MXEs since the divergence of the Drosophila species started about 50 million years ago. We were very surprised to identify dozens of MXE clusters unique to every single Drosophila species.