spliceNest Help     Home
SpliceNest [1] is a web based graphical tool to explore gene structure, including alternative splicing, based on a mapping on the EST consensus sequences (contigs) from GeneNest [2] to the complete human genome. SpliceNest is integrated with GeneNest and the SYSTERS [3] protein sequence cluster set in one framework, permitting an overall exploration of the whole sequence space covering protein, mRNA and EST sequences, as well as genomic DNA [4].

How to find the alignments

An alignment between an EST cluster and a genomic sequence can be found in several different ways:

The alignment display

The graphics shows the alignment with the genomic sequence and the exon/intron structure for each contig. Possible alternative splice sites are highlighted with yellow bands. Moving the mouse over most items displays more details; clicking links to further information, such as GeneNest assemblies, detailed alignments, or EMBL databank sequences. A panel below the graphics permits to zoom into the alignment and to switch on and off certain features.

Sample alignment screenshot with explanation

Methods and presentation details

Data sources

The EST contigs are from the
GeneNest assembly [2] of the Mar 2001 version of the NCBI UniGene [5] clustering of human genes.

The chromosomes are the Apr 1, 2001 freeze of the HUGO Golden Path assembly [6] of the complete human genome.

Mapping and alignment

The matching pairs of EST contigs and chromosome fragments were found by searching all matches of length 100 with at most 3 errors (mismatches or indels) of all contigs of all clusters against the complete genome. This was done using the fast search program/algorithm vmatch by Stefan Kurtz. [7] The algorithm exploits a modified suffix tree data structure.

Before searching, repeat elements were filtered out using the program RepeatMasker by A.F.A. Smit and P. Green.

For each mathing cluster a refined search against the matching chromosome was made, in order to approximately determine a gene region containing all exons. In the refined search, all matches of length 30 with at most 2 errors were found.

Finally, for each cluster a spliced alignment of all contigs against the matching chromosome region(s) was determined using the program sim4 by Florea et al. [8] The exon positions, percentage identity and splice signals from this aligment are shown in the graphics.

The main criterion for including a match is that it contains a 100 bp region with at most 3 errors. Most such matches are aligned, but the following exceptional cases are skipped:

Automatic analysis and classification

The alignments are automatically parsed and analysed to help classifying matches and find candidates for alternative splicing.

Contig quality

In an alignment, a contig is classified as good (or OK) if the following criteria are satisfied.
  1. the average alignment identity is at least 95%.
  2. it spans at least one intron, where each of the flanking exons is at least 20 bp long, with at least 80% alignment identity.
In the graphics, good contigs are labeled with black numbers and bad contigs with grey numbers. By checking "Skip bad contigs", only good contigs are displayed.

Match quality

An match between a cluster and a matching genomic region is classified as good if its alignment contains at least one
good contig. When browsing matches in a chromosome, it is possible to select only the good matches.

When a cluster has more than one match, a table of all matches is shown when the cluster is searched. (When browsing matches, use the link containing the number of matches, just above the graphics, to display this table.) The matches are sorted by quality, starting with the best, according to the follwing criteria (by priority):

  1. good matches before bad matches
  2. highest number of exons
  3. highest effective alignment length (#bp aligned in contig multiplied by % identity)
The alignment length and % identity are given for the "best" contig (i.e., highest number of exons). Matches with possible alternative splicing are displayed as orange table rows, other good matches as pink, and bad matches as light blue.

Alternative splice candidates

Possible sites of alternative splicing are marked by yellow bands in the graphics (when "Highlight inconsistent regions" is checked). The detailed analysis in each case is available by clicking on the yellow band, or on the "view" links near the display options. Further indication is also available on the graphics by checking "Indicate possible splice variants" (then missing exons are marked by green, putative introns by blue, and alternative donor/acceptor sites by red).

When browsing matches in a chromosome, it is possible to select only matches where alternative splice candidates are found (that is, there are yellow bands).

Only the good contigs are considered when detecting splice variants. If a match contains at least two good contigs, overlapping by at least 100 bp in the genomic sequence, the splice sites according to the alignments are compared and inconsistencies reported as possible splice variants. They are classified as missing exons, putative introns, or alternative donor/acceptor sites (must differ by at least 9 bp to be marked), or as combinations of these.

Splice variants

References

[1] Coward,E., Haas,S.A., and Vingron,M. (2002). SpliceNest: visualization of gene structure and alternative splicing based on EST clusters. Trends Genet., 18 (1), 53-55.
[2] Haas,S.A., Beissbarth,T., Rivals,E., Krause,A., and Vingron,M. (2000). GeneNest: automated generation and visualization of gene indices. Trends Genet. 16 (11), 521-523.
[3] Krause,A., Stoye,J., and Vingron,M. (2000) The SYSTERS Protein Sequence Cluster Set. Nucleic Acids Res. 28 (1), 270-272.
[4] Krause,A., Haas,S.A., Coward,E., and Vingron,M. (2002). SYSTERS, GeneNest, SpliceNest: Exploring sequence space from genome to protein. Nucleic Acids Res., 30 (1), 299-300.
[5] Schuler,G.D. (1997). Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 75, 694-698.
[6] International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.
[7] Kurtz,S., Choudhuri,J.V., Ohlebusch,E., Schleiermacher,C., Stoye,J., and Giegerich,R. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633-4642.
[8] Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M., and Miller,W. (1998). A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967-974.

Top Home


Eivind Coward
Last modified: Tue Aug 20 16:35:32 MET DST 2002