Diogenes sequence analysis
Diogenes is a sequence analysis code developed for use in high-throughput operations. It targets short genomic sequences, searching for and reporting likely protein-encoding regions.
It does not attempt to replicate standard gene-finding software, but rather complements them in the cases they are the weakest: dealing with short sequences on the order of 300 to 800 bp.
|
How it works
Diogenes operates on your sequences in the following way:
- Identifies ORF candidates in all six reading frames
- Assigns measures of coding potential to each candidate region and then
- Decides if the potential is high enough to report as a likely protein encoding region. If it is, the prediction is reported.
Coding potential is measured using statistics gathered from organism-specific training sets.
|
About the predictions
Predictions made by Diogenes are regions that appear to encode a portion
of a protein. Each prediction includes the following parameters:
- strand
- "+" means the prediction is with respect to the sequence as you provide it; "-"
means the reverse complement was used
- from, to
- Locations of the prediction on the sequence with respect to the strand
indicated; zero-based. When strand = "+" the first base of the sequence has location
0; when strand = "-" the first base of the reverse complement sequence has location 0.
NOTE: This will change in the next release when we switch over to GFF-style
coordinates for reporting.
- p-value
- A significance measure. In practice p = 1.0e-3 indicates a very, very weak prediction that
you'd probably ignore; p = 1.0e-6
is decent; p = 1.0e-10 is strong; p = 0.0 is strongest.
The p-value is obtained by comparing the computed
score with the scores obtained from regions not encoding a protein, and it indicates
the probability that a non protein-encoding region candidate would produce a score
higher than the one reported.
- predicted
- The nucleotide sequence corresponding to the prediction
- translation
- The peptide sequence corresponding to the prediction
|
Citing
The Diogenes publication has not yet appeared. In the meantime
please cite it in your work as:
Crow JA, Retzel EF. Diogenes -- Reliable prediction of protein-encoding regions in short genomic sequences,
http://analysis.ccgb.umn.edu/diogenes (2005)
|