BINC Web Notes

Wednesday, December 28, 2011

Protein Sequence Databases

Protein databases are more specialized than primary sequence databases. They contain information derived from the primary sequence databases. Some contain protein translations of the nucleic acid sequences. Some contain sets of patterns and motifs derived from sequence homologs.

UniProtKB UniProt Knowledgebase is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. About 85 % of the protein sequences provided by UniProtKB are derived from the translation of the coding sequences (CDS) which have been submitted to the public nucleic acid databases, the EMBL-Bank/GenBank/DDBJ databases (INSDC). All these sequences, as well as the related data submitted by the authors, are automatically integrated into UniProtKB/TrEMBL

SWISS-PROT & TrEMBL - SWISS-PROT is a curated protein sequence database. is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

PIR Protein Information Resource -a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database.

PIR-PSD: PIR-International Protein Sequence Database (PIR-PSD), the world's first database of classified and functionally annotated protein sequences that grew out of the Atlas of Protein Sequence and Structure. PIR-PSD has been the most comprehensive and expertly-curated protein sequence database in the public domain for over 20 years. In 2002, PIR joined EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to form the UniProt consortium. PIR-PSD sequences and annotations have been integrated into UniProt Knowledgebase. Bi-directional cross-references between UniProt (UniProt Knowledgebase and/or UniParc) and PIR-PSD are established to allow easy tracking of former PIR-PSD entries. PIR-PSD unique sequences, reference citations, and experimentally-verified data can now be found in the relevant UniProt records.

DDBJ: Nucleotide Sequence Database

DDBJ; DNA Data Bank of Japan is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. DDBJ exchange the collected data with EMBL-Bank/EBI; European Bioinformatics Institute and GenBank/NCBI; National Center for Biotechnology Information on a daily basis, the three data banks share virtually the same data at any given time. The virtually unified database is called "INSD; International Nucleotide Sequence Database". DDBJ collects sequence data mainly from Japanese researchers, but of course accepts data and issue the accession number to researchers in any other countries.

Nucleotide database can be searched by:

Getentry: Data retrieval by accession numbers, etc.
ARSA: All-round Retrieval of Sequence and Annotation.
TXSearch: Retrieval of unified taxonomy databas.
BLAST: Homology Search.
DDBJ Vector Screening System

EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.

The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. The current database release (Release 110, December 2011), with according Release notes and user manual are available from the EBI servers. A sample database entry is shown on right.

The EMBL nucleotide sequence database forms part of the European Nucleotide Archive, an EBI project led by Guy Cochrane as part of the The Protein and Nucleotide Database Group (PANDA) under Ewan Birney.

GenBank: Nucleotide Sequence Database

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 135,440,924 sequence records in the traditional GenBank divisions and 62,715,288 sequence records in the WGS (Whole genome sequence) division as of April 2011.

GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper. There are several options for submitting data to GenBank:

BankIt, a WWW-based submission tool for convenient and quick submission of sequence data
Sequin, NCBI's stand-alone submission software for MAC, PC, and UNIX platforms, is available by FTP. When using Sequin, the output files for direct submission should be sent to GenBank by e-mail.
tbl2asn, a command-line program, automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.
Barcode Submission Tool, a WWW-based tool for the submission of GenBank sequences and trace data for Barcode of Life projects. Currently, only mitochondrial cytochrome c oxidase subunit I (COI) genes are being accepted with this tool. For the submissions of loci other than COI please use either Bankit or Sequin .

Revisions or updates to GenBank entries can be made by the submitters at any time. Updates should be sent via e-mail or the UpdateMacroSend form. Send updates and revisions to gb-admin@ncbi.nlm.nih.gov. Be sure to give the accession number of the sequence to be updated in the subject line.

There are several ways to search and retrieve data from GenBank.

a) Search GenBank for sequence identifiers and annotations with Entrez Nucleotide, which is divided into three divisions:

CoreNucleotide (the main collection), dbEST (Expressed Sequence Tags), and dbGSS (Genome Survey Sequences).

b) Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). BLAST searches CoreNucleotide, dbEST, and dbGSS independently; see BLAST info for more information about the numerous BLAST databases.

c) Search, link, and download sequences programatically using NCBI e-utilities.

GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//".

The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.

Sequence database

In the field of bioinformatics, a sequence database is a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae), or it can include sequences from all organisms whose DNA has been sequenced.

Some widely used sequence databases are:

Nucleic acid sequence databases: GenBank, EMBL, DDBJ
Protein sequence databases: Uniprot-KB: SWISS-PROT, TrEMBL, PIR-PSD
Repositories for high throughput genomic sequences: EST, STS GSS, etc.
enome Databases at NCBI, EBI, TIGR, SANGER
Viral Genomes.
Archeal and Bacterial Genomes.
Eukaryotic genomes.

PLoS: Public Library of Science

PLoS (Public Library of Science) is a nonprofit publisher and advocacy organization with a mission of leading a transformation in scientific and medical research communication. Everything PLoS publish is open-access – freely available online which benefits researchers, educators, and patient advocates to funders, policymakers, and the public.

To provide open access (OA), PLoS journals use a business model in which their expenses are recovered in part by charging a publication fee to the authors or research sponsors for each article they publish.

PLoS entered the publishing arena in October 2003 with the launch of PLoS Biology, followed by PLoS Medicine. PLoS later launched four discipline-based community journals — PLoS Genetics, PLoS Pathogens, and PLoS Computational Biology and PLoS Neglected Tropical Diseases. PLoS pushed the OA envelope yet again with the PLoS ONE and PLoS Currents- make research available to the public in as little as 24 hours.

PLoS launched a new Blog network for discussing science and medicine in public, covering topics in research, culture, and publishing. PLoS also launched PLoS Hubs: Biodiversity which aggregates content from a wide range of publishers, and expert Curators select the articles.

BioMed Central: Literature database

BioMed Central (BMC) is a UK-based, for-profit STM (Science, Technology and Medicine) publisher specialising in open access journal publication. BMC, and its sister companies Chemistry Central and PhysMath Central, publish over 200 scientific open access, online, peer-reviewed journals. All original research articles published by BioMed Central are made freely and permanently accessible online immediately upon publication. BioMed Central levies an article-processing charge to cover the cost of the publication process. Authors publishing with BioMed Central retain the copyright to their work, licensing it under the Creative Commons Attribution License which allows articles to be re-used and re-distributed without restriction, as long as the original work is correctly cited.

BioMed Central was founded in 2000 as part of the Current Science Group (now Science Navigation Group, SNG), a nursery of scientific publishing companies. SNG chairman Vitek Tracz developed the concept for the company after NIH director Harold Varmus's PubMed Central concept for open-access publishing was scaled back.

BioMed Central owns and produces in-house six flagship journals: Journal of Biology, Genome Biology, Genome Medicine, Arthritis Research and Therapy, Breast Cancer Research, and Critical Care. It also produces the BMC series of 64 journals covering the fields of biology and medicine, and including the leading titles BMC Biology and BMC Medicine.

Chemistry Central Journal and the PhysMath series of journals are also produced by the company.