Wednesday, December 28, 2011

Protein Sequence Databases

Protein databases are more specialized than primary sequence databases. They contain information derived from the primary sequence databases. Some contain protein translations of the nucleic acid sequences. Some contain sets of patterns and motifs derived from sequence homologs. 

UniProtKB UniProt Knowledgebase is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. About 85 % of the protein sequences provided by UniProtKB are derived from the translation of the coding sequences (CDS) which have been submitted to the public nucleic acid databases, the EMBL-Bank/GenBank/DDBJ databases (INSDC). All these sequences, as well as the related data submitted by the authors, are automatically integrated into UniProtKB/TrEMBL
 
SWISS-PROT & TrEMBL - SWISS-PROT is a curated protein sequence database. is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.
 
PIR Protein Information Resource -a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database.
PIR-PSD: PIR-International Protein Sequence Database (PIR-PSD), the world's first database of classified and functionally annotated protein sequences that grew out of the Atlas of Protein Sequence and Structure. PIR-PSD has been the most comprehensive and expertly-curated protein sequence database in the public domain for over 20 years. In 2002, PIR joined EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to form the UniProt consortium. PIR-PSD sequences and annotations have been integrated into UniProt Knowledgebase. Bi-directional cross-references between UniProt (UniProt Knowledgebase and/or UniParc) and PIR-PSD are established to allow easy tracking of former PIR-PSD entries. PIR-PSD unique sequences, reference citations, and experimentally-verified data can now be found in the relevant UniProt records.

DDBJ: Nucleotide Sequence Database

DDBJ; DNA Data Bank of Japan is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. DDBJ exchange the collected data with EMBL-Bank/EBI; European Bioinformatics Institute and GenBank/NCBI; National Center for Biotechnology Information on a daily basis, the three data banks share virtually the same data at any given time. The virtually unified database is called "INSD; International Nucleotide Sequence Database". DDBJ collects sequence data mainly from Japanese researchers, but of course accepts data and issue the accession number to researchers in any other countries.


Nucleotide database can be searched by:

Getentry: Data retrieval by accession numbers, etc.
ARSA: All-round Retrieval of Sequence and Annotation.
TXSearch: Retrieval of unified taxonomy databas.
BLAST: Homology Search.
DDBJ Vector Screening System


EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications.

The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. The current database release (Release 110, December 2011), with according Release notes and user manual are available from the EBI servers. A sample database entry is shown on right.

The EMBL nucleotide sequence database forms part of the European Nucleotide Archive, an EBI project led by Guy Cochrane as part of the The Protein and Nucleotide Database Group (PANDA) under Ewan Birney

GenBank: Nucleotide Sequence Database

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 135,440,924 sequence records in the traditional GenBank divisions and 62,715,288 sequence records in the WGS (Whole genome sequence) division as of April 2011.

GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper. There are several options for submitting data to GenBank:
  • BankIt, a WWW-based submission tool for convenient and quick submission of sequence data
  • Sequin, NCBI's stand-alone submission software for MAC, PC, and UNIX platforms, is available by FTP. When using Sequin, the output files for direct submission should be sent to GenBank by e-mail.
  • tbl2asn, a command-line program, automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.
  • Barcode Submission Tool, a WWW-based tool for the submission of GenBank sequences and trace data for Barcode of Life projects. Currently, only mitochondrial cytochrome c oxidase subunit I (COI) genes are being accepted with this tool. For the submissions of loci other than COI please use either Bankit or Sequin 
Revisions or updates to GenBank entries can be made by the submitters at any time. Updates should be sent via e-mail or the UpdateMacroSend form. Send updates and revisions to gb-admin@ncbi.nlm.nih.gov. Be sure to give the accession number of the sequence to be updated in the subject line. 

There are several ways to search and retrieve data from GenBank.
a) Search GenBank for sequence identifiers and annotations with Entrez Nucleotide, which is divided into three divisions: 
CoreNucleotide (the main collection), dbEST (Expressed Sequence Tags), and dbGSS (Genome Survey Sequences).

b) Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). BLAST searches CoreNucleotide, dbEST, and dbGSS independently; see BLAST info for more information about the numerous BLAST databases.
c) Search, link, and download sequences programatically using NCBI e-utilities

GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//".

The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.  



Sequence database

In the field of bioinformatics, a sequence database is a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae), or it can include sequences from all organisms whose DNA has been sequenced.

Some widely used sequence databases are:
  • Nucleic acid sequence databases: GenBank, EMBL, DDBJ
  • Protein sequence databases: Uniprot-KB: SWISS-PROT, TrEMBL, PIR-PSD
  • Repositories for high throughput genomic sequences: EST, STS GSS, etc.
  • enome Databases at NCBI, EBI, TIGR, SANGER
  • Viral Genomes.
  • Archeal and Bacterial Genomes.
  • Eukaryotic genomes.

PLoS: Public Library of Science

PLoS (Public Library of Science) is a nonprofit publisher and advocacy organization with a mission of leading a transformation in scientific and medical research communication. Everything PLoS publish is open-access – freely available online which benefits researchers, educators, and patient advocates to funders, policymakers, and the public. 

To provide open access (OA), PLoS journals use a business model in which their expenses are recovered in part by charging a publication fee to the authors or research sponsors for each article they publish.

PLoS entered the publishing arena in October 2003 with the launch of PLoS Biology, followed by PLoS Medicine. PLoS later launched four discipline-based community journals — PLoS Genetics, PLoS Pathogens, and PLoS Computational Biology and PLoS Neglected Tropical Diseases. PLoS pushed the OA envelope yet again with the PLoS ONE and PLoS Currents- make research available to the public in as little as 24 hours. 

PLoS launched a new Blog network for discussing science and medicine in public, covering topics in research, culture, and publishing. PLoS also launched PLoS Hubs: Biodiversity which aggregates content from a wide range of publishers, and expert Curators select the articles.

BioMed Central: Literature database

BioMed Central (BMC) is a UK-based, for-profit STM (Science, Technology and Medicine) publisher specialising in open access journal publication. BMC, and its sister companies Chemistry Central and PhysMath Central, publish over 200 scientific open access, online, peer-reviewed journals. All original research articles published by BioMed Central are made freely and permanently accessible online immediately upon publication. BioMed Central levies an article-processing charge to cover the cost of the publication process. Authors publishing with BioMed Central retain the copyright to their work, licensing it under the Creative Commons Attribution License which allows articles to be re-used and re-distributed without restriction, as long as the original work is correctly cited. 

BioMed Central was founded in 2000 as part of the Current Science Group (now Science Navigation Group, SNG), a nursery of scientific publishing companies. SNG chairman Vitek Tracz developed the concept for the company after NIH director Harold Varmus's PubMed Central concept for open-access publishing was scaled back.

BioMed Central owns and produces in-house six flagship journals: Journal of Biology, Genome Biology, Genome Medicine, Arthritis Research and Therapy, Breast Cancer Research, and Critical Care. It also produces the BMC series of 64 journals covering the fields of biology and medicine, and including the leading titles BMC Biology and BMC Medicine

Chemistry Central Journal and the PhysMath series of journals are also produced by the company. 

MeSH: Medical Subject Headings Vocabulary

MEDLINE uses a controlled vocabulary, meaning that there is a specific set of terms used to describe each article. Familiarity with this vocabulary will make you a better PubMed searcher.

MeSH vocabulary is organized by 16 main branches:
  1. Anatomy
  2. Organisms
  3. Diseases
  4. Chemical and Drugs
  5. Analytical, Diagnostic and Therapeutic Techniques and Equipment
  6. Psychiatry and Psychology
  7. Biological Sciences
  8. Natural Sciences
  9. Anthropology, Education, Sociology and Social Phenomena
  10. Technology, Industry, Agriculture
  11. Humanities
  12. Information Science
  13. Named Groups
  14. Health Care
  15. Publication Characteristics
  16. Geographic Locations 
NLM indexers examine articles and assign the most specific MeSH heading(s) that appropriately describes the concept(s) discussed. Every drug and chemical MeSH heading has been assigned one or more headings that describe known pharmacological actions (PA).

PubMed: Literature database

PubMed comprises more than 21 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

PubMed is a Web-based retrieval system developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine. It is part of NCBI's vast retrieval system, known as Entrez. PubMed is a database of bibliographic information drawn primarily from the life sciences literature. It contains links to full-text articles at participating publishers' Web sites as well as links to other third party sites such as libraries and sequencing centers. It provides access and links to the integrated molecular biology and chemistry databases maintained by NCBI.
NLM has been indexing the biomedical literature since 1879, to help provide health professionals access to information necessary for research, health care, and education. What was once a printed index to articles, the Index Medicus, became a database now known as MEDLINE. MEDLINE contains journal citations and abstracts for biomedical literature from around the world. Since 1996, free access to MEDLINE has been available to the public online via PubMed.

Tuesday, December 27, 2011

NCBI bioinformatics tools

Amino Acid Explorer: This tool allows users to explore the characteristics of amino acids by comparing their structural and chemical properties, predicting protein sequence changes caused by mutations, viewing common substitutions, and browsing the functions of given residues in conserved domains.

Assembly Archive: Links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.


BLAST Microbial Genomes: Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

BLAST RefSeqGene: Performs a BLAST search of the genomic sequences in the RefSeqGene/LRG set. The default display provides ready navigation to review alignments in the Graphics display.


Batch Entrez: Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

BioAssay Services: Tools that summarize the biological test results in the PubChem database and provide alternative ways to view bioassay results and structure-activity relationships. Users also can download their analyses and data tables.

CDTree: A stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own. CDTree is tightly integrated with Entrez CDD and Cn3D, and allows users to create and update protein domain alignments.

COBALT: COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.


Coffee Break: Part of the NCBI Bookshelf, Coffee Break combines reports on recent biomedical discoveries with use of NCBI tools. Each report incorporates interactive tutorials that show how NCBI bioinformatics tools are used as a part of the research process.

Concise Microbial Protein BLAST: A specialized BLAST service in which the queried database consists of all proteins from complete microbial (prokaryotic) genomes. NCBI has precalculated clusters of similar proteins at the genus-level and one representative is chosen from each cluster in order to reduce the dataset, thereby reducing search time and providing a broader taxonomic view.

Conserved Domain Architecture Retrieval Tool (CDART): Displays the functional domains that make up a given protein sequence. It lists proteins with similar domain architectures and can retrieve proteins that contain particular combinations of domains.


Digital Differential Display (DDD): A tool for comparing EST profiles in order to identify genes with significantly different expression levels.

E-Bench: This interactive tool allows users to build E-utility URLs, either from a form or by hand, and then view their raw output. The tool provides a simple environment for testing E-utility URLs before including them in applications.


Ebot: A tool that allows users to construct an E-utility analysis pipeline using an online form, and then generates a Perl script to execute the pipeline.

Electronic PCR (e-PCR): A computational procedure that is used to identify sequence tagged sites (STSs) within DNA sequences. e-PCR looks for potential STSs in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that could represent the PCR primers used to generate known STSs.

Frequency-weighted Link (FLink): FLink is a tool that enables you to link from a group of records in a source database to a ranked list of associated records in a destination database based on frequency-weighted statistics.

Gene Expression Omnibus (GEO) BLAST: Tool for aligning a query sequence (nucleotide or protein) to GenBank sequences included on microarray or SAGE platforms in the GEO database.

Gene Plot: A tool for pairwise comparison of two prokaryotic genomes that displays pairs of protein homologs that are symmetrical best hits between the two genomes.

Genetic Codes: Displays the genetic codes for organisms in the Taxonomy database in tables and on a taxonomic tree.

Genome BLAST: This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches using the Basic Local Alignment Search Tool (BLAST) algorithm.


Genome Remapping Service: NCBI's Remap tool allows users to project annotation data and convert locations of features from one genomic assembly to another or to RefSeqGene sequences through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.


LinkOut: A service that allows third parties to link directly from PubMed and other Entrez database records to relevant web-accessible resources beyond the Entrez system. Examples of LinkOut resources include full-text publications, biological databases, consumer health information and research tools.



NCBI Toolbox: A set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read records in Abstract Syntax Notation 1 (ASN.1) format, an International Standards Organization (ISO) data representation format.

OSIRIS: A public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratory-specific protocols. OSIRIS evaluates the raw electrophoresis data using an independently derived mathematically-based sizing algorithm. It offers two new peak quality measures - fit level and sizing residual. It can be customized to accommodate laboratory-specific signatures such as background noise settings, customized naming conventions and additional internal laboratory controls.

Open Mass Spectrometry Search Algorithm (OMSSA) Search: An efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.

Open Reading Frame Finder (ORF Finder): A graphical analysis tool that finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

PSSM Viewer: Allows users to display, sort, subset and download position-specific score matrices (PSSMs) either from CDD records or from Position Specific Iterated (PSI)-BLAST protein searches. The tool also can align a query protein to the PSSM and highlight positions of high conservation.

Phenotype-Genotype Integrator (PheGenI): Supports finding human phenotype/genotype relationships with queries by phenotype, chromosome location, gene, and SNP identifiers. Currently includes information from dbGaP, the NHGRI GWAS Catalog, and GTeX. Displays results on the genome, on sequence, or in tables for download.



PubChem Power User Gateway (PUG): PUG provides access to PubChem services via a programmatic interface. PUG allows users to download data, initiate chemical structure searches, standardize chemical structures and interact with the E-utilities. PUG can be accessed using either standard URLs or via SOAP.

PubChem Standardization Service: Standardization, in PubChem terminology, is the processing of chemical structures in the same way used to create PubChem Compound records from contributors' original structures. This service lets users see how PubChem would handle any structure they would like to submit.



PubMed Tutorials: A collection of web and flash tutorials on PubMed searching and linking, saving searches in MyNCBI, using MeSH and other PubMed services.

Related Structures: The Related Structures tool allows users to find 3D structures from the Molecular Modeling Database (MMDB) that are similar in sequence to a query protein. Although the query protein may not yet have a resolved structure, the 3D shape of a similar protein sequence can shed light on the putative shape and biological function of the query protein.

SNP Database Specialized Search Tools: A variety of tools are available for searching the SNP database, allowing search by genotype, method, population, submitter, markers and sequence similarity using BLAST. These are linked under ""Search"" on the left side bar of the dbSNP main page.


Sequence Viewer: Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component.  
 

TaxPlot: A tool for comparing genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome and two species for comparison. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared.



Taxonomy Statistics: Displays the number of taxonomic nodes in the database for a given rank and date of inclusion.

Taxonomy Status Reports: Displays the current status of a set of taxonomic nodes or IDs.

Variation Reporter: A tool designed to search for and report human sequence variation data from dbSNP and dbVar. Individual variations or batch files can be submitted in HGVS, GVF or BED formats. Related information will be retrieved and reported in a downloadable table containing variation identifiers, nucleotide and cytogenetic band locations on various genomic assemblies, allele type and minor allele frequencies, predicted functional consequences (missense, nonsense, frameshift, splice site, etc.), reported clinical significance, and relevant citations.

VecScreen: A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. VecScreen searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).


Viral Genotyping Tool: This tool helps identify the genotype of a viral sequence. A window is slid along the query sequence and each window is compared by BLAST to each of the reference sequences for a particular virus.

ExPASy databases

Frequently used ExPASY Databases are:

UniProtKB: functional information on proteins.

EBI databases

The main missions of the European Bioinformatics Institute (EBI) centre on building, maintaining and providing biological databases and information services to support data deposition and exploitation.



Some of the EBI databases include:

European Nucleotide Archive (ENA) - Europe's primary comprehensive nucleotide sequence data resource.

UniProt Knowledgebase - A complete annotated protein sequence database.

       UniProtKB/Swiss-Prot: An annotated protein sequence database (Reviewed).
       UniProtKB/TrEMBL: A computer generated protein database enriched with                                                                                 automated classification and annotation (Unreviewed).


Protein Databank in Europe Database (PDBe) - European Project for the management and distribution of data on macromolecular structures.

ArrayExpress - Gene expression data.

Ensembl - Providing up to date completed metazoic genomes and the best possible automatic annotation.

IntAct - Provides a freely available, open source database system and analysis tools for protein interaction data.

Patent Data Resources:  Patent data resources at the EBI contains patent abstracts, patent chemical compounds, patent sequences and patent equivalents. There are various ways of accessing and searching the patent data.

Gene Ontology Annotation (GOA): Provides assignments of proteins in UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and International Protein Index (IPI).

InterPro: A database of integrated documentation resources for protein families, domains and functional sites.

Chemical Entities of Biological Interests (ChEBI): A freely available dictionary of 'small molecular entities'.

Databases A-Z - A complete listing of all the EBI databases.

We have many other databases available including literature citation databases such as Medline.






Monday, December 26, 2011

NCBI databases

NCBI has created a large number of databases that are freely available to researchers. All these databases can be reached from the Entrez search page. Entrez Global Query allows cross searching of all the NCBI databases. Like Entrez, BLAST (Basic Local Alignment Search Tool) is not a database itself, but a means of accessing the data in NCBI protein and nucleotide databases.

Most frequently used NCBI databases are: 



PubMed: Major bibliographic database from NCBI. It searches MEDLINE and also allows access to journals indexed by MEDLINE.

OMIM; Online Mendelian Inheritance in Man: A database from John Hopkins University for human genetics containing short articles on genetic disorders.

Nucleotide databases: A composite of data from GenBank, the EMBL and DNA Databank of Japan (DDBJ). Divided in 3 sub-databases:

GenBank Expressed sequence Tags (EST): Short sequence derived from mRNA isolated from a particular tissue at a particular stage of development.

GenBank Genome Survey Sequence (GSS): sequences derived from whole-genome sequencing projects

CoreNucleotide: All sequences that are not ESTs or GSSs

Protein database: contains data from Genbank, EMBL and DDBJ and sequences submitted to various other sources including SWISS-PROT.


Genome database: provides views of entire genomes and chromosomes. Results are displayed via NCBI's Map Viewer.

Structure database: contains three dimensional images of proteins from protein database. Protein images can be manipulated using the free CN3D tool.

Gene database: allows users to search for individual genes from among the genomes represented in RefSeq. Results may be examined in sequence viewer.

Taxonomy: contains names of all organisms that are represented by nucleotide or protein sequences in NCBI databases

RCSB: Research Collaboratory for Structural Bioinformatics

The Research Collaboratory for Structural Bioinformatics (RCSB) is a dedicated organization to improve our understanding of the function of biological systems through the study of the 3-D structure of biological macromolecules. RCSB provides free public resources and publications to assist others and further the fields of bioinformatics and biology.

RCSB members include, RCSB-Rutgers, RCSB-SDSC (San Diego Supercomputer Center) and RCSB-BMRB (Bio Mag Res Bank).

In 1998, RCSB became responsible for the management of the Protein Data Bank (PDB). PDB archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids.  

Sunday, December 25, 2011

ExPASy: Expert Protein Analysis System

ExPASy is the Swiss Institute of Bioinformatics (SIB) Bioinformatics Resource Portal which provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. On this portal you find resources from many different SIB groups as well as external institutions.

The individual resources (databases, web-based and downloadable software tools) are hosted in a decentralised way by different groups of the SIB and partner institutions. The new Web interface provides visual guidance for newcomers to ExPASy.

EBI: European Bioinformatics Institute


The European Bioinformatics Institute (EBI) is an academic research institute located on the Wellcome Trust Genome Campus in Hinxton near Cambridge (UK), part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures.

EBI search service is a gateway to information spanning genes, functional genomics, proteins structures, small molecules, enzymes, interactions, pathways, scientific literature and patent sequences.EBI also provide tools that allow researchers to analyze information and to upload and share their work.





NCBI: National Center for Biotechnology Information

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health. NCBI is directed by David Lipman, one of the original authors of the BLAST sequence alignment progra. The NCBI houses genome sequencing data in GenBank and an index of biomedical research articles in PubMed Central and PubMed, as well as other information relevant to biotechnology. All these databases are available online through the Entrez search engine.

NCBI, Bookshelf provides free access to books and documents in life science and healthcare. A vital node in the data-rich resource network enables users to easily browse, retrieve, and read content, and spurs discovery of related information.

The NCBI structure  is developed by the Structure Group of the NCBI Computational Biology Branch (CBB) are freely available to the public.