Protein sequence databases

https://doi.org/10.1016/j.cbpa.2003.12.004Get rights and content

Abstract

A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly curated universal databases that cover all species and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases will play an even more important role as central comprehensive resources of protein information. Several the leading protein sequence databases are discussed here, with special emphasis on the databases now provided by the Universal Protein Knowledgebase (UniProt) consortium.

Introduction

With the availability of over 165 completed genome sequences from both eukaryotic and prokaryotic organisms, efforts are now being focused on the identification and functional analysis of the proteins encoded by these genomes. The large-scale analysis of these proteins has started to generate huge amounts of data due to the new information provided by the genome projects and to a range of new technologies in protein science. For example, mass spectrometry approaches are being used in protein identification and in determining the nature of post-translational modifications [1]. These and other methods make it possible to quickly identify large numbers of proteins, to map their interactions, to determine their location within the cell [2••] and to analyse their biological activities. Protein sequence databases play a vital role as a central resource for storing the data generated by these and more conventional efforts, and making them available to the scientific community.

To exploit the various resources fully, it is essential to distinguish between them and to identify the types of data they contain. Universal protein databases cover proteins from all species whereas specialized data collections contain information about a particular protein family or group of proteins, or related to a specific organism. Universal protein sequence databases can be further subdivided into two categories: sequence repositories, in which data are stored with little or no manual intervention in the creation of the records; and expertly curated databases, in which the original data are enhanced by the addition of further information. In the following, we present the current status of the leading protein sequence databases.

Section snippets

Sequence repositories

Several protein sequence databases act as repositories of protein sequences. These databases add little or no additional information to the sequence records they contain and generally make no effort to provide a non-redundant collection of sequences to users.

Universal curated databases

Although repositories are an essential means of providing the user with sequences as quickly as possible, it is clear that, when additional information is added to a sequence, this greatly increases the value of the resource for users. The curated databases enrich the sequence data by adding additional information, which gets validated by expert biologists before being added to the databases to ensure that the data in these collections can be considered to be highly reliable. There is also a

UniProt: the next generation of protein sequence databases

One of the most significant developments with regard to protein sequence databases is the recent decision by the National Institutes of Health to award a grant [29] to combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single resource, UniProt (http://www.uniprot.org) [30••]. UniProt was launched on 15 December 2003 and comprises three components: first, the UniProt Knowledgebase which will continue the work of Swiss-Prot, TrEMBL and PIR by providing an expertly curated database;

Conclusions

Complete and up-to-date databases of biological knowledge are vital for information-dependent biological and biotechnological research. With the rapid accumulation of genome sequences for many organisms, attention is turning to the identification and function of proteins encoded by these genomes. The recent joining of forces by the major protein databases Swiss-Prot, TrEMBL and PIR in the UniProt consortium to handle the increasing volume and variety of protein sequences and functional

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • of special interest

  • ••

    of outstanding interest

Acknowledgements

UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Minor support for the EBI’s involvement in UniProt comes from the two European Union contracts BioBabel (QLRT-2000-00981) and TEMBLOR (QLRI-2001-00015) and from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the National Science Foundation (NSF) grants

References (35)

  • D Butler

    NIH pledges cash for global protein database

    Nature

    (2002)
  • A Sickmann et al.

    Mass spectrometry – a key technology in proteome research

    Adv. Biochem. Eng. Biotechnol.

    (2003)
  • W.K Huh et al.

    Global analysis of protein expression in yeast

    Nature

    (2003)
  • D.L Wheeler et al.

    Database resources of the National Center for Biotechnology

    Nucleic Acids Res.

    (2003)
  • S Miyazaki et al.

    DNA Data Bank of Japan in XML

    Nucleic Acids Res.

    (2003)
  • G Stoesser et al.

    The EMBL Nucleotide Sequence Database: major new developments

    Nucleic Acids Res.

    (2003)
  • D.A Benson et al.

    GenBank

    Nucleic Acids Res.

    (2003)
  • B Boeckmann et al.

    The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003

    Nucleic Acids Res.

    (2003)
  • C.H Wu et al.

    The Protein Information Resource

    Nucleic Acids Res.

    (2003)
  • K.D Pruitt et al.

    NCBI Reference Sequence Project: update and current status

    Nucleic Acids Res.

    (2003)
  • J Westbrook et al.

    The Protein Data Bank and structural genomics

    Nucleic Acids Res.

    (2003)
  • Dayhoff MO: Atlas of Protein Sequence and Structure, vol 5, suppl. 3. Washington, DC: National Biomedical Research...
  • E Gasteiger et al.

    SWISS-PROT: connecting biomolecular knowledge via a protein database

    Curr. Issues Mol. Biol.

    (2001)
  • V Junker et al.

    Representation of functional information in the Swiss-Prot data bank

    Bioinformatics

    (1999)
  • H.M Wain et al.

    Genew: the human gene nomenclature database

    Nucleic Acids Res.

    (2002)
  • FlyBase consortium: The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res...
  • J.A Blake et al.

    MGD: the Mouse Genome Database

    Nucleic Acids Res.

    (2003)
  • Cited by (177)

    • Integrative metabolome and transcriptome analyses reveals the black fruit coloring mechanism of Crataegus maximowiczii C. K. Schneid

      2023, Plant Physiology and Biochemistry
      Citation Excerpt :

      To determine the Gene Ontology (GO) annotations of genes, the alignments from the NR database were used with blast2GO (https://www.blast2go.com/) (Ashburner et al., 2000). The DEGs were annotated by several databases, including Cluster of Orthologous Groups (COG; Tatusov et al., 2000), euKaryotic Orthologous Groups of proteins (KOG; Koonin et al., 2004), Swiss-Prot (Apweiler et al., 2004), and KEGG (Kanehisa et al., 2004). Hidden Markov models (HMMER, E-value < 10−10) (De Fonzo et al., 2007) were used to align the predicted amino acid sequences to the Protein family (Pfam) database (Finn et al., 2014).

    • Mapping genomes by using bioinformatics data and tools

      2021, Chemoinformatics and Bioinformatics in the Pharmaceutical Sciences
    • Structural investigation of APRs to improve the solubility of outer membrane protease (PgtE) from Salmonella enterica serotype typhi- A multi-constraint approach

      2020, Biochemistry and Biophysics Reports
      Citation Excerpt :

      All these methods were used Chaotropes (Urea or GdmHCl) [35,36]. The primary sequence of S. Typhi PgtE comprising 312 amino acid residues were retrieved from UniProtKB (Q8Z4Y4) [37] and followed by the analysis of sequence by ProGene1.0 [38] for calculating molecular weight and theoretical Iso electric point (PI). AGGRESCAN server is a web-based application used to predict APRs in PgtE primary sequence.

    View all citing articles on Scopus
    View full text