Web Sequences Track Settings
 
DNA Sequences in Web Pages Indexed by Bing.com / Microsoft Research   (All Phenotype and Literature tracks)

Display mode:   
View table schema
Data last updated: 2013-11-22

Description

This track is powered by Bing! and Microsoft Research. UCSC collaborators at Microsoft Research (Bob Davidson, David Heckerman) implemented a DNA sequence detector and processed thirty days of web crawler updates, which covers roughly 40 billion webpages. The results were mapped with BLAT to the genome.

Display Convention and Configuration

The track indicates the location of sequences on web pages mapped to the genome, labelled with the web page URL. If the web page includes invisible meta data, then the first author and a year of publication is shown instead of the URL. All matches of one web page are grouped ("chained") together. Web page titles are shown when you move the mouse cursor over the features. Thicker parts of the features (exons) represent matching sequences, connected by thin lines to matches from the same web page within 30 kbp.

Methods

All file types (PDFs and various Microsoft Office formats) were converted to text. The results were processed to find groups of words that look like DNA/RNA sequences. These were then mapped with BLAT to the human genome using the same software as used in the Publication track.

Credits

DNA sequence detection by Bob Davidson at Microsoft Research. HTML parsing and sequence mapping by Maximilian Haeussler at UCSC.

References

Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31. PMID: 18271954; PMC: PMC2374703

Haeussler M, Gerner M, Bergman CM. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011 Apr 1;27(7):980-6. PMID: 21325301; PMC: PMC3065681

Van Noorden R. Trouble at the text mine. Nature. 2012 Mar 7;483(7388):134-5.