Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters

doi:10.1016/j.ymeth.2016.07.018

Methods

Volume 110, 1 November 2016, Pages 26-34

https://doi.org/10.1016/j.ymeth.2016.07.018 Get rights and content

Highlights

•
APCs were introduced to model sequence patterns with variable length and variants.
•
cAPC pairs were developed to model the co-occurring sequence patterns in PPI.
•
A method was proposed to turn a protein pair into a feature vector using cAPC pairs.
•
WeMine-PPI, a new PPI prediction method with outperforming results, was proposed.
•
WeMine-PPI allows biologically intuitive understanding of the feature vector.

Abstract

Predicting Protein–Protein Interaction (PPI) is important for making new discoveries in the molecular mechanisms inside a cell. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Sequence-based prediction is currently the most readily applicable and cost-effective method. It exploits known PPI Databases to construct classifiers for predicting unknown PPIs based only on sequence data without requiring any other prior knowledge. Among existing sequence-based methods, most feature-based methods use exact sequence patterns with fixed length as features — a constraint which is biologically unrealistic. SVM with Pairwise String Kernel renders better predicting performance. However it is difficult to be biologically interpretable since it is kernel-based where no concrete feature values are computed. Here we have developed a novel method WeMine-P2P to overcome these drawbacks. By assuming that the regions/sites that mediate PPI are more conserved, WeMine-P2P first discovers/locates the conserved sequence patterns in protein sequences in the form of Aligned Pattern Clusters (APCs), allowing pattern variations with variable length. It then pairs up all APCs into a set of Co-Occurring APC (cAPC) pairs, and computes a cAPC-PPI score for each cAPC pair on all PPI pairs. It further constructs a feature vector composed of all cAPC pairs with their cAPC-PPI scores for each PPI pair and uses them for constructing a PPI predictor. Through 40 independent experiments, we showed that (1) WeMine-P2P outperforms the well-known algorithm, PIPE2, which also utilizes co-occurring amino acid sequence segments but does not allow variable lengths and pattern variations; (2) WeMine-P2P achieves satisfactory PPI prediction performance, comparable to the SVM-based methods particularly among unseen protein sequences with a potential reduction of feature dimension of 1280×; (3) Unlike SVM-based methods, WeMine-P2P renders interpretable biological features from which we observed that co-occurring sequence patterns from the compositional bias regions are more discriminative. WeMine-P2P is extendable to predict other biosequence interactions such as Protein–DNA interactions.

Introduction

Protein–Protein Interaction (PPI) is important for various biological processes and functions in living cells such as metabolic cycles, DNA transcription and replication, and signaling cascades [1]. Predicting PPI is thus critical for better understanding the molecular mechanisms inside the cell [1]. It is particularly useful for discovering unknown functions of a protein [2]. Following [3], [4], we refer a PPI as an interaction that brings two different proteins A and B into direct physical contact, i.e. heterodimeric interactions. In contrast, most homodimeric interactions, where proteins A and B are identical, are for maintaining the stability of the interacting complex but not for regulating cellular processes [5].

A number of experimental techniques, such as the two–hybrid systems [6], mass spectrometry [7] tandem affinity purification (TAP) [1], and microarray analysis [8], have been developed for systematic and large-scale prediction of PPIs. However, these experimental methods are costly, labor-intensive and time-consuming [9], [10]. Thus, existing PPI data obtained by these methods covers only a small fraction of the complete PPI networks [11], [12]. Moreover, these experimental methods usually suffer from high rates of both false positive and false negative predictions [13], [14]. Hence, developing effective and reliable computational methods based on sequence data alone to facilitate PPI prediction is of fundamental importance [15].

Existing computational methods for PPI prediction can be divided into four types depending on the input data. The first type such as Computational docking [16] requires three-dimensional structures of the target proteins. It can be applied to the target proteins to simulate if they can interact based on physiochemical properties such as shape complementarity, electrostatics, and biochemcial information [17]. The second type requires genomic information of the target proteins, e.g. gene fusion events [18], the conservation of gene-order [19], and the calculation of prior probabilities of genomic features between interacting proteins [20]. The third type requires prior biological knowledge of the target proteins, e.g. phylogenetic profiles [21], domain knowledge of proteins [22], [23], [24] and topological properties of proteins in PPI networks [11]. All these methods have limited applicability because the required data/information is not always available. The last type of methods require only sequence data. It uses the coded information inherent in sequences to predict if a protein pair interacts. For this reason, sequence-based methods are becoming popular, since sequence data is more readily available nowadays [2].

PIPE [25]/ PIPE2 [26], [27] is a well-established sequence-based method. Given a protein A, a protein B and a database of positive PPIs, PIPE simply counts how frequently all fixed-length protein sequence segments in Proteins A and B found co-occurring in the database. To achieve such task, all combinations of 20-mers between Protein A and Protein B are first enumerated using a sliding window with a width of 20. Then, the co-occurrence of each combination, e.g. MGIRRLVSVITRPIINKVNS from Protein A and GPEAIILTGTFDDWKGTLPM from Protein B, is searched in the database, and the frequency of their co-occurrence is counted. The sum of all counts is then computed. If the sum is larger than or equal to a threshold, the algorithm then predicts that protein A and B would interact. PIPE2 is a much faster version of PIPE. However, in spite of the satisfactory prediction performance, we observe that there is room for improvement. The key drawback of PIPE/PIPE2 is their use of a fixed-window of 20 amino acids. This is biologically unrealistic since functional regions such as the Short Linear Motifs (SLiMs [28]) have variable length from 3 to 15 amino acids [28]). Most of them are less than 10 amino acids [29]. Recently, a similar algorithm called VLASPD [2] that allows variable length of protein sequence segments is proposed. Nevertheless, it still uses exact patterns, which are neither realistic nor useful for biological analysis since it does not accept variants. Furthermore, it adopts a threshold-based prediction model, which does not allow nonlinear relationship between features and class outputs. Nevertheless, since PIPE2 is well benchmarked [3], we would compare our newly proposed algorithm with it.

Another well-established sequence-based method involves the use of Support Vector Machine (SVM) with Pairwise String Kernel [30], [31], [32], [15], [33], [34]. They encode a PPI pair into a feature vector composed by the co-occurrence of k-mer (a sequence of k residues) and train the SVM to predict if a protein pair can interact. For example, assume k = 3, a selected feature could be the number of counts of how often the 3-mers, say WTG and LGA co-occur in a protein pair along the entire sequence. Since all possible 3-mers are considered, the feature space could be as large as $20^{3} \times 20^{3}$ (i.e. 64 millions) [4]. With SVM, even with such a high dimensionality, by using the kernel trick, neither computing nor storing the feature vector is needed. As no feature vectors are computed, in spite of achieving satisfactory prediction performance, it is hard to use SVM results to reveal or interpret why the feature space leads to its good performance. Thus, since the feature space is hardly interpretable, not much biological knowledge can be gained. Hence, to overcome this hurdle encountered in SVM is another key motivation of our proposed method. It should be noted that it is possible to generalize k-mer counting strategies allowing for gaps and mismatches [35]. However, these methods still do not allow a variable length. For example, if k is set to be 5, these methods would still consider all the 5-mers, while in WeMine-P2P, there could be 5-mers, 6-mers and 7-mers. In WeMine-P2P, we utilize the locally conserved sequence pattern clusters [36], [37] and their co-occurrence [38] to obtain biologically realistic and interpretable features that are flexible in pattern length while allowing variants. Experiments showed that our prediction results based on these features are comparable to those achieved by the SVM with Pairwise String Kernel approaches. In addition, the presence of concrete feature values makes the feature analysis of our models (and the subsequent biological interpretation) easier for biologists, comparing to the SVM with Pairwise String Kernel approaches, which have no concrete features and thus make feature analysis (and the subsequent biological interpretation) of the models difficult.

Motivated by the majority acceptance of sequence-based methods and the realization their drawbacks, the objective of our research as reported in this paper is to develop a new sequence-based prediction method which is (1) based on biologically interpretable features, (2) generating features to be more biologically realistic such as allowing variable lengths and pattern variations, and (3) achieving satisfactory prediction performance with biologically interpretable features. In this study, we propose a new algorithm WeMine-P2P, as illustrated in Fig. 1, to accomplish these objectives.

The remaining sections are outlined as follows. Section 2 explains in detail the WeMine-P2P prediction algorithm. Section 3 describes the dataset used and its pre-processing involved. Section 4 shows the design of the experiments and reports the results. Section 5 discusses the experimental results. Section 6 concludes the whole study.

Section snippets

Overview

We discover and locate APCs, then cAPC pairs, the “what” and “where” of the conserved regions, using them as discriminative features to construct the PPI classifier. This is elaborated in steps 1 to 6 in Fig. 1.

Problem definition

A protein pair, or a PPI pair is defined as a pair of protein sequences that can either be interacting or not interacting with one another. A Protein–Protein Interaction pair, referred to as a positive PPI pair, is defined as a pair of protein sequences that can interact with each other.

Material

In our experiments, 40 independent Yeast_Randam datasets were downloaded from [3] at http://www.marcottelab.org/differentialGeneralization. The procedure to obtain these 40 datasets is described below. Yeast Protein–Protein Interaction (PPI) data (Saccharomyces_cerevisiae-20100304.txt) containing the protein sequences and the positive PPI pairs was acquired from the protein interaction network analysis platform [42]. Further pre-processing was applied to the proteins therein. First, the

Experimental design and parameter setting

As mentioned in Section 3 Materials, we obtained in total 40 independent datasets provided by [3]. Each dataset has a training set of 16,000 PPIs and a testing set of 4000 PPIs (80%-20% split). In our experiment, we first extracted features (Step 1, Step 2) from the training set, then used the features to construct PPI matrix (Step 3, Step 4) and trained a predictive model on the PPI matrix. In Step 1, we used WeMine Aligned Pattern Clustering algorithm [37], [36] to obtain APCs with length

Discussions

The contributions of this study are summarized as follows: First, Aligned Pattern Clusters (APCs) [37], [36] were introduced to the represent the sequence patterns in Protein–Protein Interaction (PPI) (between two protein chains). This study demonstrates the first successful use of APCs in PPI, comparing to the previous studies [37], [36], [54], [38]. Second, based on APCs, co-occurring Aligned Pattern Cluster pairs (cAPC pairs) were newly developed to model the co-occurring sequence patterns

Conclusions

Sequence-based machine learning methods are becoming more and more popular because they are readily applicable and achieve satisfactory performance. However, existing methods prohibit researchers from gaining biological knowledge in PPI as they adopt features that are not biologically realistic, such as fixing the pattern length and using exact patterns, or adopt string kernels to skip the computation of features. In this study, we have furnished a new sequence-based method WeMine-P2P that

Acknowledgements

This research is supported by NSERC Post Graduate Scholarship, NSERC Discovery Grant and Waterloo/China Graduate Scholarship.

References (57)

F. Zhou et al.
Large-scale analyses of glycosylation in cellulases
Genomics Proteomics Bioinf.
(2009)
J.R. Parrish et al.
Yeast two-hybrid contributions to interactome mapping
Curr. Opin. Biotechnol.
(2006)
H.A. Gabb et al.
Modelling protein docking using shape complementarity, electrostatics and biochemical information
J. Mol. Biol.
(1997)
T. Dandekar et al.
Conservation of gene order: a fingerprint of proteins that physically interact
Trends Biochem. Sci.
(1998)
V.N. Uversky et al.
Understanding protein non-folding
Biochim. Biophys. Acta (BBA), Proteins Proteomics
(2010)
S. Alberti et al.
A systematic survey identifies prions and illuminates sequence features of prionogenic proteins
Cell
(2009)
A.K. Dunker et al.
Intrinsically disordered protein
J. Mol. Graphics Model.
(2001)
V. Neduva et al.
Peptides mediating interaction networks: new leads at last
Curr. Opin. Biotechnol.
(2006)
A.-C. Gavin et al.
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature
(2002)
L. Hu et al.
Discovering variable-length patterns in protein sequences for protein–protein interaction prediction
IEEE Transac. Nanobiosci.
(2015)

Y. Park et al.

Flaws in evaluation schemes for pair-input computational predictions

Nat. Methods

(2012)

T. Hamp et al.

Evolutionary profiles improve protein–protein interaction prediction from sequence

Bioinformatics

(2015)

I.M. Nooren et al.

Diversity of protein–protein interactions

The EMBO J.

(2003)

T. Ito et al.

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Proc. Nat. Acad. Sci.

(2001)

Y. Ho et al.

Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry

Nature

(2002)

M.F. Templin et al.

Protein microarrays: promising tools for proteomic research

Proteomics

(2003)

Z.-H. You et al.

Predicting protein–protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest

PloS One

(2015)

S. Pitre et al.

Computational methods for predicting protein–protein interactions

Z.-H. You et al.

Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data

Bioinformatics

(2010)

X. Luo, Z. You, M. Zhou, S. Li, H. Leung, Y. Xia, Q. Zhu, A highly efficient approach to protein interactome mapping...

J. Shen et al.

Predicting protein–protein interactions based only on sequences information

Proc. Nat. Acad. Sci.

(2007)

B.G. Pierce et al.

Zdock server: interactive docking prediction of protein–protein complexes and symmetric multimers

Bioinformatics

(2014)

A.J. Enright et al.

Protein interaction maps for complete genomes based on gene fusion events

Nature

(1999)

R. Jansen et al.

A bayesian networks approach for predicting protein–protein interactions from genomic data

Science

(2003)

M. Pellegrini et al.

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

Proc. Nat. Acad. Sci.

(1999)

X.-W. Chen et al.

Prediction of protein–protein interactions using random decision forest framework

Bioinformatics

(2005)

S.P. Kanaan et al.

Inferring protein–protein interactions from multiple protein domain combinations

A.J. González et al.

Predicting domain–domain interaction based on domain profiles with feature selection and support vector machines

BMC Bioinf.

(2010)

Cited by (9)

Identification of potential drugs for diffuse large b-cell lymphoma based on bioinformatics and Connectivity Map database
2018, Pathology Research and Practice
Citation Excerpt :
An FDR < 0.1 as the cut-off value for being statistically significant. The STRING database (http://www.string-db.org/) database and Cytoscape software (version 3.6.0; http://www.cytoscape.org/) were used to construct and analyze the interaction associations of proteins encoded by these DEGs from overlapping pathways [32,33]. In the protein-protein interaction (PPI) networks, nodes represent the DEGs and edges represent the interactions between DEGs.
Diffuse large B-cell lymphoma (DLBCL) is the most main subtype in non-Hodgkin lymphoma. After chemotherapy, about 30% of patients with DLBCL develop resistance and relapse. This study was to identify potential therapeutic drugs for DLBCL using the bioinformatics method. The differentially expressed genes (DEGs) between DLBCL and non-cancer samples were downloaded from the Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). Gene ontology enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of DEGs were analyzed using the Database for Annotation, Visualization, and Integrated Discovery. The R software package (SubpathwayMiner) was used to perform pathway analysis on DEGs affected by drugs found in the Connectivity Map (CMap) database. Protein–protein interaction (PPI) networks of DEGs were constructed using the Search Tool for the Retrieval of Interacting Genes online database and Cytoscape software. In order to identify potential novel drugs for DLBCL, the DLBCL-related pathways and drug-affected pathways were integrated. The results showed that 1927 DEGs were identified from TCGA and GEO. We found 54 significant pathways of DLBCL using KEGG pathway analysis. By integrating pathways, we identified five overlapping pathways and 47 drugs that affected these pathways. The PPI network analysis results showed that the CDK2 is closely associated with three overlapping pathways (cell cycle, p53 signaling pathway, and small cell lung cancer). The further literature verification results showed that etoposide, rinotecan, methotrexate, resveratrol, and irinotecan have been used as classic clinical drugs for DLBCL. Anisomycin, naproxen, gossypol, vorinostat, emetine, mycophenolic acid and daunorubicin also act on DLBCL. It was found through bioinformatics analysis that paclitaxel in the drug-pathway network can be used as a potential novel drug for DLBCL.
PATSIM: Prediction and analysis of protein sequences using hybrid Knuth-Morris Pratt (KMP) and Boyer-Moore (BM) algorithm
2018, Gene
Citation Excerpt :
Functional domain information and sequential evolution information are combined using a fusion ensemble classifier for structural classification named PFP-FunDSeqE (Shen and Chou, 2009). The conserved sequence patterns in protein sequences have been identified using WeMine-P2P in the form of Aligned Pattern clusters that allows pattern variations with variable length (Sze-To A et al., 2016). To describe the pseudo amino acid composition, hydrophobic patterns of amino acid and average power-spectral density (APSD) are introduced (Zhang et al., 2006).
In phylogenomic profiling, the genomic context based methods are based on the observation that two or more proteins having the same pattern of presence or absence in many diverse genomes most likely have a functional link. In this research work, a tool (PATSIM) has been developed to predict the protein patterns based on the SOPM tool. In this tool, the secondary structure for CATH database protein sequences, predicted by the SOPM (Self Optimized Prediction Method) server is passed as input to fulfill objectives such as, (i) Predict the Amino Acid Pattern using the proposed Hybrid KMP and BM algorithm, (ii) Predict the physiochemical properties such as Hydrophobic Non-Polar ALKYL Amino Acid groups, Hydrophobic Non-Polar AROMATIC Amino Acid groups, Hydrophilic Polar Neutral Amino Acid groups, Hydrophilic Polar Acidic Amino Acid groups and Hydrophilic Polar Basic Amino Acid groups of protein sequence, (iii) Predict the secondary structure of protein where the structure of protein sequence is unknown, and (iv) Similarity analysis of protein sequence (structure unknown) with the CATH database. From the results, it is inferred that this tool effectively predicts the similarity between the sequences and also identifies the protein patterns for four secondary structural classes, namely Alpha Helix (h), Beta Sheet (e), Turn (t) and Coil (c). Based on the experimental results, it is inferred that this tool identifies the physiochemical properties of the protein sequence in an effective manner. The source code and its documentation for the PATSIM tool is freely available in the GitHub public repository (https://github.com/manimkn89/Protein-Sequence-Analysis).
Structural study of the effects of mutations in proteins to identify the molecular basis of the loss of local structural fluidity leading to the onset of autoimmune diseases
2017, Biochemical and Biophysical Research Communications
Citation Excerpt :
Earlier studies showed that an altered PPI could be one of the causes for disease onset due to mutations [16,17]. Computational and bioinformatic studies of altered PPIs, of SAV's of a proteins, correlated with diseases, were also been found useful [18]. In our this dataset, most of the amino acid residues (nearly 61%) were found to be present at the protein core rather than on the protein surface.
Protein-Protein Interactions (PPIs) are crucial in most of the biological processes and PPI dysfunctions are known to be associated with the onsets of various diseases. One of such diseases is the auto-immune disease. Auto-immune diseases are one among the less studied group of diseases with very high mortality rates. Thus, we tried to correlate the appearances of mutations with their probable biochemical basis of the molecular mechanisms leading to the onset of the disease phenotypes. We compared the effects of the Single Amino Acid Variants (SAVs) in the wild type and mutated proteins to identify any structural deformities that might lead to altered PPIs leading ultimately to disease onset. For this we used Relative Solvent Accessibility (RSA) as a spatial parameter to compare the structural perturbation in mutated and wild type proteins. We observed that the mutations were capable to increase intra-chain PPIs whereas inter-chain PPIs would remain mostly unaltered. This might lead to more intra-molecular friction causing a deleterious alteration of protein's normal function. A Lyapunov exponent analysis, using the altered RSA values due to polymorphic and disease causing mutations, revealed polymorphic mutations have a positive mean value for the Lyapunov exponent while disease causing mutations have a negative mean value. Thus, local spatial stochasticity has been lost due to disease causing mutations, indicating a loss of structural fluidity. The amino acid conversion plot also showed a clear tendency of altered surface patch residue conversion propensity than polymorphic conversions. So far, this is the first report that compares the effects of different kinds of mutations (disease and non-disease causing polymorphic mutations) in the onset of autoimmune diseases.
Editorial
2016, Methods
Disease Associated Protein-Protein Interaction Network Reconstruction Based on Comprehensive Influence Analysis
2020, Research Square
Evolution of sequence-based bioinformatics tools for protein-protein interaction prediction
2020, Current Genomics

View all citing articles on Scopus

View full text

Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters

Highlights

Abstract

Introduction

Section snippets

Overview

Problem definition

Material

Experimental design and parameter setting

Discussions

Conclusions

Acknowledgements

Genomics Proteomics Bioinf.

Curr. Opin. Biotechnol.

J. Mol. Biol.

Trends Biochem. Sci.

Biochim. Biophys. Acta (BBA), Proteins Proteomics

Cell

J. Mol. Graphics Model.

Curr. Opin. Biotechnol.

Functional organization of the yeast proteome by systematic analysis of protein complexes

Nature

Discovering variable-length patterns in protein sequences for protein–protein interaction prediction

IEEE Transac. Nanobiosci.

Flaws in evaluation schemes for pair-input computational predictions

Nat. Methods

Evolutionary profiles improve protein–protein interaction prediction from sequence

Bioinformatics

Diversity of protein–protein interactions

The EMBO J.

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Proc. Nat. Acad. Sci.

Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry

Nature

Protein microarrays: promising tools for proteomic research

Proteomics

Predicting protein–protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest

PloS One

Computational methods for predicting protein–protein interactions

Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data

Bioinformatics

Predicting protein–protein interactions based only on sequences information

Proc. Nat. Acad. Sci.

Zdock server: interactive docking prediction of protein–protein complexes and symmetric multimers

Bioinformatics

Protein interaction maps for complete genomes based on gene fusion events

Nature

A bayesian networks approach for predicting protein–protein interactions from genomic data

Science

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

Proc. Nat. Acad. Sci.

Prediction of protein–protein interactions using random decision forest framework

Bioinformatics

Inferring protein–protein interactions from multiple protein domain combinations

Predicting domain–domain interaction based on domain profiles with feature selection and support vector machines

BMC Bioinf.