Schema for GENCODE V44 - GENCODE V44
  Database: hg38    Primary Table: knownGene    Row Count: 276,905   Data last updated: 2023-08-23
Format description: GENCODE bigGenePred
On download server: MariaDB table dump directory
This track is available both in ASCII MariaDB table dump format and bigGenePred (bigBed) format.
bigBed File Download: /gbdb/hg38/gencode/gencodeV44.bb
fieldexampleSQL type info description
name ENST00000456328.2varchar(255) values Ensembl ID
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand +char(1) values + or - for strand
txStart 11868int(10) unsigned range  
txEnd 14409int(10) unsigned range  
cdsStart 11868int(10) unsigned range  
cdsEnd 11868int(10) unsigned range  
exonCount 3int(10) unsigned range  
exonStarts 11868,12612,13220,longblob    
exonEnds 12227,12721,14409,longblob    
proteinID  varchar(40) values  
alignID uc286dmu.1varchar(255) values  

The data is stored in the binary BigBed format.

Connected Tables and Joining Fields
        hg38.bioCycPathway.kgID (via knownGene.name)
      hg38.ccdsKgMap.geneId (via knownGene.name)
      hg38.ceBlastTab.query (via knownGene.name)
      hg38.dmBlastTab.query (via knownGene.name)
      hg38.drBlastTab.query (via knownGene.name)
      hg38.foldUtr3.name (via knownGene.name)
      hg38.foldUtr5.name (via knownGene.name)
      hg38.gnfAtlas2Distance.query (via knownGene.name)
      hg38.gnfAtlas2Distance.target (via knownGene.name)
      hg38.gnfU95Distance.query (via knownGene.name)
      hg38.gnfU95Distance.target (via knownGene.name)
      hg38.humanHprdP2P.query (via knownGene.name)
      hg38.humanHprdP2P.target (via knownGene.name)
      hg38.humanVidalP2P.query (via knownGene.name)
      hg38.humanVidalP2P.target (via knownGene.name)
      hg38.humanWankerP2P.query (via knownGene.name)
      hg38.humanWankerP2P.target (via knownGene.name)
      hg38.keggPathway.kgID (via knownGene.name)
      hg38.kgAlias.kgID (via knownGene.name)
      hg38.kgColor.kgID (via knownGene.name)
      hg38.kgProtAlias.kgID (via knownGene.name)
      hg38.kgSpAlias.kgID (via knownGene.name)
      hg38.kgTargetAli.qName (via knownGene.name)
      hg38.kgXref.kgID (via knownGene.name)
      hg38.knownAttrs.kgID (via knownGene.name)
      hg38.knownBlastTab.query (via knownGene.name)
      hg38.knownBlastTab.target (via knownGene.name)
      hg38.knownCanonical.transcript (via knownGene.name)
      hg38.knownCds.name (via knownGene.name)
      hg38.knownGeneMrna.name (via knownGene.name)
      hg38.knownGenePep.name (via knownGene.name)
      hg38.knownIsoforms.transcript (via knownGene.name)
      hg38.knownToEnsembl.name (via knownGene.name)
      hg38.knownToGnfAtlas2.name (via knownGene.name)
      hg38.knownToHprd.name (via knownGene.name)
      hg38.knownToKeggEntrez.name (via knownGene.name)
      hg38.knownToLocusLink.name (via knownGene.name)
      hg38.knownToLynx.name (via knownGene.name)
      hg38.knownToMrna.name (via knownGene.name)
      hg38.knownToMrnaSingle.name (via knownGene.name)
      hg38.knownToMupit.name (via knownGene.name)
      hg38.knownToNextProt.name (via knownGene.name)
      hg38.knownToPfam.name (via knownGene.name)
      hg38.knownToRefSeq.name (via knownGene.name)
      hg38.knownToSuper.gene (via knownGene.name)
      hg38.knownToTag.name (via knownGene.name)
      hg38.knownToU133.name (via knownGene.name)
      hg38.knownToU95.name (via knownGene.name)
      hg38.knownToVisiGene.name (via knownGene.name)
      hg38.knownToWikipedia.name (via knownGene.name)
      hg38.mmBlastTab.query (via knownGene.name)
      hg38.rnBlastTab.query (via knownGene.name)
      hg38.scBlastTab.query (via knownGene.name)
      hg38.ucscRetroInfo9.kgName (via knownGene.name)
      hg38.ucscScop.ucscId (via knownGene.name)

Sample Rows
 
namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsproteinIDalignID
ENST00000456328.2chr1+11868144091186811868311868,12612,13220,12227,12721,14409,uc286dmu.1
ENST00000619216.1chr1-17368174361736817368117368,17436,uc031tla.1
ENST00000473358.1chr1+29553310972955329553329553,30563,30975,30039,30667,31097,uc057aty.1
ENST00000469289.1chr1+30266311093026630266230266,30975,30667,31109,uc057atz.1
ENST00000607096.1chr1+30365305033036530365130365,30503,uc031tlb.1
ENST00000417324.1chr1-34553360813455334553334553,35276,35720,35174,35481,36081,uc001aak.4
ENST00000461467.1chr1-35244360733524435244235244,35720,35481,36073,uc057aua.1
ENST00000642116.1chr1+57597641165759757597357597,58699,62915,57653,58856,64116,uc286dmy.1
ENST00000641515.2chr1+65418715856556470008365418,65519,69036,65433,65573,71585,A0A2U3U0J3uc001aal.2
ENST00000466430.5chr1-892941209328929489294489294,92090,112699,120774,91629,92240,112804,120932,uc057aub.1

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

GENCODE V44 (knownGene) Track Description
 

Description

The GENCODE Genes track (version 44, July 2023) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. By default, only the basic gene set is displayed, which is a subset of the comprehensive gene set. The basic set represents transcripts that GENCODE believes will be useful to the majority of users.

The track includes protein-coding genes, non-coding RNA genes, and pseudo-genes, though pseudo-genes are not displayed by default. It contains annotations on the reference chromosomes as well as assembly patches and alternative loci (haplotypes).

The following table provides statistics for the v44 release derived from the GTF file that contains annotations only on the main chromosomes. More information on how they were generated can be found in the GENCODE site.

GENCODE v44 Release Stats
GenesObservedTranscriptsObserved
Protein-coding genes19,396Protein-coding transcripts89,067
Long non-coding RNA genes19,922- full length protein-coding63,968
Small non-coding RNA genes7,566- partial length protein-coding25,099
Pseudogenes14,735Nonsense mediated decay transcripts21,384
Immunoglobulin/T-cell receptor gene segments647Long non-coding RNA loci transcripts58,246
Total No of distinct translations65,342Genes that have more than one distinct translations13,594

For more information on the different gene tracks, see our Genes FAQ.

Display Conventions and Configuration

By default, this track displays only the basic GENCODE set, splice variants, and non-coding genes. It includes options to display the entire GENCODE set and pseudogenes. To customize these options, the respective boxes can be checked or unchecked at the top of this description page.

This track also includes a variety of labels which identify the transcripts when visibility is set to "full" or "pack". Gene symbols (e.g. NIPA1) are displayed by default, but additional options include GENCODE Transcript ID (ENST00000561183.5), UCSC Known Gene ID (uc001yve.4), UniProt Display ID (Q7RTP0). Additional information about gene and transcript names can be found in our FAQ.

This track, in general, follows the display conventions for gene prediction tracks. The exons for putative non-coding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker.

Coloring for the gene annotations is based on the annotation type:

  • coding: protein coding transcripts, including polymorphic pseudogenes
  • non-coding: non-protein coding transcripts
  • pseudogene: pseudogene transcript annotations
  • problem: problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain)

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. There is also an option to display the data as a density graph, which can be helpful for visualizing the distribution of items over a region.

Squishy-pack Display

Within a gene using the pack display mode, transcripts below a specified rank will be condensed into a view similar to squish mode. The transcript ranking approach is preliminary and will change in future releases. The transcripts rankings are defined by the following criteria for protein-coding and non-coding genes:

Protein_coding genes
  1. MANE or Ensembl canonical
    • 1st: MANE Select / Ensembl canonical
    • 2nd: MANE Plus Clinical
  2. Coding biotypes
    • 1st: protein_coding and protein_coding_LoF
    • 2nd: NMDs and NSDs
    • 3rd: retained intron and protein_coding_CDS_not_defined
  3. Completeness
    • 1st: full length
    • 2nd: CDS start/end not found
  4. CARS score (only for coding transcripts)
  5. Transcript genomic span and length (only for non-coding transcripts)
Non-coding genes
  1. Transcript biotype
    • 1st: transcript biotype identical to gene biotype
  2. Ensembl canonical
  3. GENCODE basic
  4. Transcript genomic span
  5. Transcript length

Methods

The GENCODE v44 track was built from the GENCODE downloads file gencode.v44.chr_patch_hapl_scaff.annotation.gff3.gz. Data from other sources were correlated with the GENCODE data to build association tables.

Related Data

The GENCODE Genes transcripts are annotated in numerous tables, each of which is also available as a downloadable file.

One can see a full list of the associated tables in the Table Browser by selecting GENCODE Genes from the track menu; this list is then available on the table menu.

Data access

GENCODE Genes and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. The genePred format files for hg38 are available from our downloads directory or in our GTF download directory. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog.

Credits

The GENCODE Genes track was produced at UCSC from the GENCODE comprehensive gene set using a computational pipeline developed by Jim Kent and Brian Raney. This version of the track was generated by Jonathan Casper.

References

Frankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland JE, Mudge JM, Sisu C, Wright JC, Arnan C, Barnes I et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D942-D949. PMID: 36420896; PMC: PMC9825462

A full list of GENCODE publications is available at The GENCODE Project web site.

Data Release Policy

GENCODE data are available for use without restrictions.