Schema for AUGUSTUS - AUGUSTUS ab initio gene predictions v3.1
  Database: hg38    Primary Table: augustusGene    Row Count: 34,161   Data last updated: 2018-08-11
Format description: A gene prediction with some additional info.
fieldexampleSQL type info description
bin 585smallint(5) unsigned range Indexing field to speed chromosome range queries.
name g1.t1varchar(255) values Name of gene (usually transcript_id from GTF)
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand -char(1) values + or - for strand
txStart 14395int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 29445int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 14695int(10) unsigned range Coding region start (or end position for minus strand item)
cdsEnd 24886int(10) unsigned range Coding region end (or start position for minus strand item)
exonCount 8int(10) unsigned range Number of exons
exonStarts 14395,14969,16853,17232,179...longblob   Exon start positions (or end positions for minus strand item)
exonEnds 14829,15038,17055,17364,180...longblob   Exon end positions (or start positions for minus strand item)
score 0int(11) range score
name2 g1varchar(255) values Alternate name (e.g. gene_id from GTF)
cdsStartStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS start annotation (none, unknown, incomplete, or complete)
cdsEndStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS end annotation (none, unknown, incomplete, or complete)
exonFrames 1,1,0,0,0,2,0,-1,longblob   Exon frame {0,1,2}, or -1 if no frame for exon

Sample Rows
 
binnamechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsscorename2cdsStartStatcdsEndStatexonFrames
585g1.t1chr1-14395294451469524886814395,14969,16853,17232,17914,18267,24737,29320,14829,15038,17055,17364,18061,18379,25008,29445,0g1cmplcmpl1,1,0,0,0,2,0,-1,
585g1.t2chr1-14395294451469524886814395,14969,16853,17232,17914,18267,24737,29320,14829,15038,17055,17364,18061,18379,24891,29445,0g1cmplcmpl1,1,0,0,0,2,0,-1,
585g1.t3chr1-14395294751469524886814395,14969,16853,17232,17914,18267,24737,29320,14829,15038,17055,17364,18061,18379,24891,29475,0g1cmplcmpl1,1,0,0,0,2,0,-1,
586g2.t1chr1-1596301832651610691828365159630,164770,165883,178140,182819,161113,164791,165942,178239,183265,0g2cmplcmpl1,1,2,2,0,
586g3.t1chr1-1849111999351852161954118184911,185490,186316,187375,187754,188790,195262,199836,185350,185559,186469,187577,187886,188902,195533,199935,0g3cmplcmpl1,1,1,0,0,2,0,-1,
586g3.t2chr1-1849111999951852161954118184911,185490,186316,187375,187754,188790,195262,199836,185350,185559,186469,187577,187886,188902,195533,199995,0g3cmplcmpl1,1,1,0,0,2,0,-1,
586g3.t3chr1-1849111999951852161954118184911,185490,186316,187375,187754,188790,195262,199836,185350,185559,186469,187577,187886,188902,195416,199995,0g3cmplcmpl1,1,1,0,0,2,0,-1,
73g4.t1chr1-25848662929525854062897811258486,363864,377226,379768,450778,490448,523366,524479,592928,607954,628918,258886,363977,377358,379870,451716,490505,523387,524538,593041,608056,629295,0g4cmplcmpl2,0,0,0,1,1,1,2,0,0,0,
73g5.t1chr1-6297347786756348017709388629734,633431,685754,725451,765226,766328,770927,778558,632611,634825,686692,725508,765247,766387,770946,778675,0g5cmplcmpl-1,0,1,1,1,2,0,-1,
591g6.t1chr1+7889457902907893487900272788945,789681,789566,790290,0g6cmplcmpl0,2,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

AUGUSTUS (augustusGene) Track Description
 

Description

This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone.

For more information on the different gene tracks, see our Genes FAQ.

Methods

Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2).

The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies.

Assembly Group Training Species
Fish zebrafish
Birds chicken
Human and all other vertebrates human
Nematodes caenorhabditis
Drosophila fly
A. mellifera honeybee1
A. gambiae culex
S. cerevisiae saccharomyces

This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used.

Credits

Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke.

References

Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656

Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192