Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters
Introduction
Protein–Protein Interaction (PPI) is important for various biological processes and functions in living cells such as metabolic cycles, DNA transcription and replication, and signaling cascades [1]. Predicting PPI is thus critical for better understanding the molecular mechanisms inside the cell [1]. It is particularly useful for discovering unknown functions of a protein [2]. Following [3], [4], we refer a PPI as an interaction that brings two different proteins A and B into direct physical contact, i.e. heterodimeric interactions. In contrast, most homodimeric interactions, where proteins A and B are identical, are for maintaining the stability of the interacting complex but not for regulating cellular processes [5].
A number of experimental techniques, such as the two–hybrid systems [6], mass spectrometry [7] tandem affinity purification (TAP) [1], and microarray analysis [8], have been developed for systematic and large-scale prediction of PPIs. However, these experimental methods are costly, labor-intensive and time-consuming [9], [10]. Thus, existing PPI data obtained by these methods covers only a small fraction of the complete PPI networks [11], [12]. Moreover, these experimental methods usually suffer from high rates of both false positive and false negative predictions [13], [14]. Hence, developing effective and reliable computational methods based on sequence data alone to facilitate PPI prediction is of fundamental importance [15].
Existing computational methods for PPI prediction can be divided into four types depending on the input data. The first type such as Computational docking [16] requires three-dimensional structures of the target proteins. It can be applied to the target proteins to simulate if they can interact based on physiochemical properties such as shape complementarity, electrostatics, and biochemcial information [17]. The second type requires genomic information of the target proteins, e.g. gene fusion events [18], the conservation of gene-order [19], and the calculation of prior probabilities of genomic features between interacting proteins [20]. The third type requires prior biological knowledge of the target proteins, e.g. phylogenetic profiles [21], domain knowledge of proteins [22], [23], [24] and topological properties of proteins in PPI networks [11]. All these methods have limited applicability because the required data/information is not always available. The last type of methods require only sequence data. It uses the coded information inherent in sequences to predict if a protein pair interacts. For this reason, sequence-based methods are becoming popular, since sequence data is more readily available nowadays [2].
PIPE [25]/ PIPE2 [26], [27] is a well-established sequence-based method. Given a protein A, a protein B and a database of positive PPIs, PIPE simply counts how frequently all fixed-length protein sequence segments in Proteins A and B found co-occurring in the database. To achieve such task, all combinations of 20-mers between Protein A and Protein B are first enumerated using a sliding window with a width of 20. Then, the co-occurrence of each combination, e.g. MGIRRLVSVITRPIINKVNS from Protein A and GPEAIILTGTFDDWKGTLPM from Protein B, is searched in the database, and the frequency of their co-occurrence is counted. The sum of all counts is then computed. If the sum is larger than or equal to a threshold, the algorithm then predicts that protein A and B would interact. PIPE2 is a much faster version of PIPE. However, in spite of the satisfactory prediction performance, we observe that there is room for improvement. The key drawback of PIPE/PIPE2 is their use of a fixed-window of 20 amino acids. This is biologically unrealistic since functional regions such as the Short Linear Motifs (SLiMs [28]) have variable length from 3 to 15 amino acids [28]). Most of them are less than 10 amino acids [29]. Recently, a similar algorithm called VLASPD [2] that allows variable length of protein sequence segments is proposed. Nevertheless, it still uses exact patterns, which are neither realistic nor useful for biological analysis since it does not accept variants. Furthermore, it adopts a threshold-based prediction model, which does not allow nonlinear relationship between features and class outputs. Nevertheless, since PIPE2 is well benchmarked [3], we would compare our newly proposed algorithm with it.
Another well-established sequence-based method involves the use of Support Vector Machine (SVM) with Pairwise String Kernel [30], [31], [32], [15], [33], [34]. They encode a PPI pair into a feature vector composed by the co-occurrence of k-mer (a sequence of k residues) and train the SVM to predict if a protein pair can interact. For example, assume k = 3, a selected feature could be the number of counts of how often the 3-mers, say WTG and LGA co-occur in a protein pair along the entire sequence. Since all possible 3-mers are considered, the feature space could be as large as (i.e. 64 millions) [4]. With SVM, even with such a high dimensionality, by using the kernel trick, neither computing nor storing the feature vector is needed. As no feature vectors are computed, in spite of achieving satisfactory prediction performance, it is hard to use SVM results to reveal or interpret why the feature space leads to its good performance. Thus, since the feature space is hardly interpretable, not much biological knowledge can be gained. Hence, to overcome this hurdle encountered in SVM is another key motivation of our proposed method. It should be noted that it is possible to generalize k-mer counting strategies allowing for gaps and mismatches [35]. However, these methods still do not allow a variable length. For example, if k is set to be 5, these methods would still consider all the 5-mers, while in WeMine-P2P, there could be 5-mers, 6-mers and 7-mers. In WeMine-P2P, we utilize the locally conserved sequence pattern clusters [36], [37] and their co-occurrence [38] to obtain biologically realistic and interpretable features that are flexible in pattern length while allowing variants. Experiments showed that our prediction results based on these features are comparable to those achieved by the SVM with Pairwise String Kernel approaches. In addition, the presence of concrete feature values makes the feature analysis of our models (and the subsequent biological interpretation) easier for biologists, comparing to the SVM with Pairwise String Kernel approaches, which have no concrete features and thus make feature analysis (and the subsequent biological interpretation) of the models difficult.
Motivated by the majority acceptance of sequence-based methods and the realization their drawbacks, the objective of our research as reported in this paper is to develop a new sequence-based prediction method which is (1) based on biologically interpretable features, (2) generating features to be more biologically realistic such as allowing variable lengths and pattern variations, and (3) achieving satisfactory prediction performance with biologically interpretable features. In this study, we propose a new algorithm WeMine-P2P, as illustrated in Fig. 1, to accomplish these objectives.
The remaining sections are outlined as follows. Section 2 explains in detail the WeMine-P2P prediction algorithm. Section 3 describes the dataset used and its pre-processing involved. Section 4 shows the design of the experiments and reports the results. Section 5 discusses the experimental results. Section 6 concludes the whole study.
Section snippets
Overview
We discover and locate APCs, then cAPC pairs, the “what” and “where” of the conserved regions, using them as discriminative features to construct the PPI classifier. This is elaborated in steps 1 to 6 in Fig. 1.
Problem definition
A protein pair, or a PPI pair is defined as a pair of protein sequences that can either be interacting or not interacting with one another. A Protein–Protein Interaction pair, referred to as a positive PPI pair, is defined as a pair of protein sequences that can interact with each other.
Material
In our experiments, 40 independent Yeast_Randam datasets were downloaded from [3] at http://www.marcottelab.org/differentialGeneralization. The procedure to obtain these 40 datasets is described below. Yeast Protein–Protein Interaction (PPI) data (Saccharomyces_cerevisiae-20100304.txt) containing the protein sequences and the positive PPI pairs was acquired from the protein interaction network analysis platform [42]. Further pre-processing was applied to the proteins therein. First, the
Experimental design and parameter setting
As mentioned in Section 3 Materials, we obtained in total 40 independent datasets provided by [3]. Each dataset has a training set of 16,000 PPIs and a testing set of 4000 PPIs (80%-20% split). In our experiment, we first extracted features (Step 1, Step 2) from the training set, then used the features to construct PPI matrix (Step 3, Step 4) and trained a predictive model on the PPI matrix. In Step 1, we used WeMine Aligned Pattern Clustering algorithm [37], [36] to obtain APCs with length
Discussions
The contributions of this study are summarized as follows: First, Aligned Pattern Clusters (APCs) [37], [36] were introduced to the represent the sequence patterns in Protein–Protein Interaction (PPI) (between two protein chains). This study demonstrates the first successful use of APCs in PPI, comparing to the previous studies [37], [36], [54], [38]. Second, based on APCs, co-occurring Aligned Pattern Cluster pairs (cAPC pairs) were newly developed to model the co-occurring sequence patterns
Conclusions
Sequence-based machine learning methods are becoming more and more popular because they are readily applicable and achieve satisfactory performance. However, existing methods prohibit researchers from gaining biological knowledge in PPI as they adopt features that are not biologically realistic, such as fixing the pattern length and using exact patterns, or adopt string kernels to skip the computation of features. In this study, we have furnished a new sequence-based method WeMine-P2P that
Acknowledgements
This research is supported by NSERC Post Graduate Scholarship, NSERC Discovery Grant and Waterloo/China Graduate Scholarship.
References (57)
- et al.
Large-scale analyses of glycosylation in cellulases
Genomics Proteomics Bioinf.
(2009) - et al.
Yeast two-hybrid contributions to interactome mapping
Curr. Opin. Biotechnol.
(2006) - et al.
Modelling protein docking using shape complementarity, electrostatics and biochemical information
J. Mol. Biol.
(1997) - et al.
Conservation of gene order: a fingerprint of proteins that physically interact
Trends Biochem. Sci.
(1998) - et al.
Understanding protein non-folding
Biochim. Biophys. Acta (BBA), Proteins Proteomics
(2010) - et al.
A systematic survey identifies prions and illuminates sequence features of prionogenic proteins
Cell
(2009) - et al.
Intrinsically disordered protein
J. Mol. Graphics Model.
(2001) - et al.
Peptides mediating interaction networks: new leads at last
Curr. Opin. Biotechnol.
(2006) - et al.
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature
(2002) - et al.
Discovering variable-length patterns in protein sequences for protein–protein interaction prediction
IEEE Transac. Nanobiosci.
(2015)
Flaws in evaluation schemes for pair-input computational predictions
Nat. Methods
Evolutionary profiles improve protein–protein interaction prediction from sequence
Bioinformatics
Diversity of protein–protein interactions
The EMBO J.
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proc. Nat. Acad. Sci.
Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry
Nature
Protein microarrays: promising tools for proteomic research
Proteomics
Predicting protein–protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest
PloS One
Computational methods for predicting protein–protein interactions
Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data
Bioinformatics
Predicting protein–protein interactions based only on sequences information
Proc. Nat. Acad. Sci.
Zdock server: interactive docking prediction of protein–protein complexes and symmetric multimers
Bioinformatics
Protein interaction maps for complete genomes based on gene fusion events
Nature
A bayesian networks approach for predicting protein–protein interactions from genomic data
Science
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Proc. Nat. Acad. Sci.
Prediction of protein–protein interactions using random decision forest framework
Bioinformatics
Inferring protein–protein interactions from multiple protein domain combinations
Predicting domain–domain interaction based on domain profiles with feature selection and support vector machines
BMC Bioinf.
Cited by (9)
Identification of potential drugs for diffuse large b-cell lymphoma based on bioinformatics and Connectivity Map database
2018, Pathology Research and PracticeCitation Excerpt :An FDR < 0.1 as the cut-off value for being statistically significant. The STRING database (http://www.string-db.org/) database and Cytoscape software (version 3.6.0; http://www.cytoscape.org/) were used to construct and analyze the interaction associations of proteins encoded by these DEGs from overlapping pathways [32,33]. In the protein-protein interaction (PPI) networks, nodes represent the DEGs and edges represent the interactions between DEGs.
PATSIM: Prediction and analysis of protein sequences using hybrid Knuth-Morris Pratt (KMP) and Boyer-Moore (BM) algorithm
2018, GeneCitation Excerpt :Functional domain information and sequential evolution information are combined using a fusion ensemble classifier for structural classification named PFP-FunDSeqE (Shen and Chou, 2009). The conserved sequence patterns in protein sequences have been identified using WeMine-P2P in the form of Aligned Pattern clusters that allows pattern variations with variable length (Sze-To A et al., 2016). To describe the pseudo amino acid composition, hydrophobic patterns of amino acid and average power-spectral density (APSD) are introduced (Zhang et al., 2006).
Structural study of the effects of mutations in proteins to identify the molecular basis of the loss of local structural fluidity leading to the onset of autoimmune diseases
2017, Biochemical and Biophysical Research CommunicationsCitation Excerpt :Earlier studies showed that an altered PPI could be one of the causes for disease onset due to mutations [16,17]. Computational and bioinformatic studies of altered PPIs, of SAV's of a proteins, correlated with diseases, were also been found useful [18]. In our this dataset, most of the amino acid residues (nearly 61%) were found to be present at the protein core rather than on the protein surface.
Editorial
2016, Methods