Protein sequence databases
Introduction
With the availability of over 165 completed genome sequences from both eukaryotic and prokaryotic organisms, efforts are now being focused on the identification and functional analysis of the proteins encoded by these genomes. The large-scale analysis of these proteins has started to generate huge amounts of data due to the new information provided by the genome projects and to a range of new technologies in protein science. For example, mass spectrometry approaches are being used in protein identification and in determining the nature of post-translational modifications [1]. These and other methods make it possible to quickly identify large numbers of proteins, to map their interactions, to determine their location within the cell [2••] and to analyse their biological activities. Protein sequence databases play a vital role as a central resource for storing the data generated by these and more conventional efforts, and making them available to the scientific community.
To exploit the various resources fully, it is essential to distinguish between them and to identify the types of data they contain. Universal protein databases cover proteins from all species whereas specialized data collections contain information about a particular protein family or group of proteins, or related to a specific organism. Universal protein sequence databases can be further subdivided into two categories: sequence repositories, in which data are stored with little or no manual intervention in the creation of the records; and expertly curated databases, in which the original data are enhanced by the addition of further information. In the following, we present the current status of the leading protein sequence databases.
Section snippets
Sequence repositories
Several protein sequence databases act as repositories of protein sequences. These databases add little or no additional information to the sequence records they contain and generally make no effort to provide a non-redundant collection of sequences to users.
Universal curated databases
Although repositories are an essential means of providing the user with sequences as quickly as possible, it is clear that, when additional information is added to a sequence, this greatly increases the value of the resource for users. The curated databases enrich the sequence data by adding additional information, which gets validated by expert biologists before being added to the databases to ensure that the data in these collections can be considered to be highly reliable. There is also a
UniProt: the next generation of protein sequence databases
One of the most significant developments with regard to protein sequence databases is the recent decision by the National Institutes of Health to award a grant [29] to combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single resource, UniProt (http://www.uniprot.org) [30••]. UniProt was launched on 15 December 2003 and comprises three components: first, the UniProt Knowledgebase which will continue the work of Swiss-Prot, TrEMBL and PIR by providing an expertly curated database;
Conclusions
Complete and up-to-date databases of biological knowledge are vital for information-dependent biological and biotechnological research. With the rapid accumulation of genome sequences for many organisms, attention is turning to the identification and function of proteins encoded by these genomes. The recent joining of forces by the major protein databases Swiss-Prot, TrEMBL and PIR in the UniProt consortium to handle the increasing volume and variety of protein sequences and functional
References and recommended reading
Papers of particular interest, published within the annual period of review, have been highlighted as:
- •
of special interest
- ••
of outstanding interest
Acknowledgements
UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Minor support for the EBI’s involvement in UniProt comes from the two European Union contracts BioBabel (QLRT-2000-00981) and TEMBLOR (QLRI-2001-00015) and from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the National Science Foundation (NSF) grants
References (35)
NIH pledges cash for global protein database
Nature
(2002)- et al.
Mass spectrometry – a key technology in proteome research
Adv. Biochem. Eng. Biotechnol.
(2003) - et al.
Global analysis of protein expression in yeast
Nature
(2003) - et al.
Database resources of the National Center for Biotechnology
Nucleic Acids Res.
(2003) - et al.
DNA Data Bank of Japan in XML
Nucleic Acids Res.
(2003) - et al.
The EMBL Nucleotide Sequence Database: major new developments
Nucleic Acids Res.
(2003) - et al.
GenBank
Nucleic Acids Res.
(2003) - et al.
The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Res.
(2003) - et al.
The Protein Information Resource
Nucleic Acids Res.
(2003) - et al.
NCBI Reference Sequence Project: update and current status
Nucleic Acids Res.
(2003)
The Protein Data Bank and structural genomics
Nucleic Acids Res.
SWISS-PROT: connecting biomolecular knowledge via a protein database
Curr. Issues Mol. Biol.
Representation of functional information in the Swiss-Prot data bank
Bioinformatics
Genew: the human gene nomenclature database
Nucleic Acids Res.
MGD: the Mouse Genome Database
Nucleic Acids Res.
Cited by (177)
Integrative metabolome and transcriptome analyses reveals the black fruit coloring mechanism of Crataegus maximowiczii C. K. Schneid
2023, Plant Physiology and BiochemistryCitation Excerpt :To determine the Gene Ontology (GO) annotations of genes, the alignments from the NR database were used with blast2GO (https://www.blast2go.com/) (Ashburner et al., 2000). The DEGs were annotated by several databases, including Cluster of Orthologous Groups (COG; Tatusov et al., 2000), euKaryotic Orthologous Groups of proteins (KOG; Koonin et al., 2004), Swiss-Prot (Apweiler et al., 2004), and KEGG (Kanehisa et al., 2004). Hidden Markov models (HMMER, E-value < 10−10) (De Fonzo et al., 2007) were used to align the predicted amino acid sequences to the Protein family (Pfam) database (Finn et al., 2014).
Mapping genomes by using bioinformatics data and tools
2021, Chemoinformatics and Bioinformatics in the Pharmaceutical SciencesStructural insights into the main S-layer unit of Deinococcus radiodurans reveal a massive protein complex with porin-like features
2020, Journal of Biological ChemistryStructural investigation of APRs to improve the solubility of outer membrane protease (PgtE) from Salmonella enterica serotype typhi- A multi-constraint approach
2020, Biochemistry and Biophysics ReportsCitation Excerpt :All these methods were used Chaotropes (Urea or GdmHCl) [35,36]. The primary sequence of S. Typhi PgtE comprising 312 amino acid residues were retrieved from UniProtKB (Q8Z4Y4) [37] and followed by the analysis of sequence by ProGene1.0 [38] for calculating molecular weight and theoretical Iso electric point (PI). AGGRESCAN server is a web-based application used to predict APRs in PgtE primary sequence.
Evolutionarily Conserved Interactions within the Pore Domain of Acid-Sensing Ion Channels
2020, Biophysical JournalGlycation profile of minor abundant erythrocyte proteome across varying glycemic index in diabetes mellitus
2019, Analytical Biochemistry