Elsevier

Methods

Volume 110, 1 November 2016, Pages 26-34
Methods

Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters

https://doi.org/10.1016/j.ymeth.2016.07.018Get rights and content

Highlights

  • APCs were introduced to model sequence patterns with variable length and variants.

  • cAPC pairs were developed to model the co-occurring sequence patterns in PPI.

  • A method was proposed to turn a protein pair into a feature vector using cAPC pairs.

  • WeMine-PPI, a new PPI prediction method with outperforming results, was proposed.

  • WeMine-PPI allows biologically intuitive understanding of the feature vector.

Abstract

Predicting Protein–Protein Interaction (PPI) is important for making new discoveries in the molecular mechanisms inside a cell. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Sequence-based prediction is currently the most readily applicable and cost-effective method. It exploits known PPI Databases to construct classifiers for predicting unknown PPIs based only on sequence data without requiring any other prior knowledge. Among existing sequence-based methods, most feature-based methods use exact sequence patterns with fixed length as features — a constraint which is biologically unrealistic. SVM with Pairwise String Kernel renders better predicting performance. However it is difficult to be biologically interpretable since it is kernel-based where no concrete feature values are computed. Here we have developed a novel method WeMine-P2P to overcome these drawbacks. By assuming that the regions/sites that mediate PPI are more conserved, WeMine-P2P first discovers/locates the conserved sequence patterns in protein sequences in the form of Aligned Pattern Clusters (APCs), allowing pattern variations with variable length. It then pairs up all APCs into a set of Co-Occurring APC (cAPC) pairs, and computes a cAPC-PPI score for each cAPC pair on all PPI pairs. It further constructs a feature vector composed of all cAPC pairs with their cAPC-PPI scores for each PPI pair and uses them for constructing a PPI predictor. Through 40 independent experiments, we showed that (1) WeMine-P2P outperforms the well-known algorithm, PIPE2, which also utilizes co-occurring amino acid sequence segments but does not allow variable lengths and pattern variations; (2) WeMine-P2P achieves satisfactory PPI prediction performance, comparable to the SVM-based methods particularly among unseen protein sequences with a potential reduction of feature dimension of 1280×; (3) Unlike SVM-based methods, WeMine-P2P renders interpretable biological features from which we observed that co-occurring sequence patterns from the compositional bias regions are more discriminative. WeMine-P2P is extendable to predict other biosequence interactions such as Protein–DNA interactions.

Introduction

Protein–Protein Interaction (PPI) is important for various biological processes and functions in living cells such as metabolic cycles, DNA transcription and replication, and signaling cascades [1]. Predicting PPI is thus critical for better understanding the molecular mechanisms inside the cell [1]. It is particularly useful for discovering unknown functions of a protein [2]. Following [3], [4], we refer a PPI as an interaction that brings two different proteins A and B into direct physical contact, i.e. heterodimeric interactions. In contrast, most homodimeric interactions, where proteins A and B are identical, are for maintaining the stability of the interacting complex but not for regulating cellular processes [5].

A number of experimental techniques, such as the two–hybrid systems [6], mass spectrometry [7] tandem affinity purification (TAP) [1], and microarray analysis [8], have been developed for systematic and large-scale prediction of PPIs. However, these experimental methods are costly, labor-intensive and time-consuming [9], [10]. Thus, existing PPI data obtained by these methods covers only a small fraction of the complete PPI networks [11], [12]. Moreover, these experimental methods usually suffer from high rates of both false positive and false negative predictions [13], [14]. Hence, developing effective and reliable computational methods based on sequence data alone to facilitate PPI prediction is of fundamental importance [15].

Existing computational methods for PPI prediction can be divided into four types depending on the input data. The first type such as Computational docking [16] requires three-dimensional structures of the target proteins. It can be applied to the target proteins to simulate if they can interact based on physiochemical properties such as shape complementarity, electrostatics, and biochemcial information [17]. The second type requires genomic information of the target proteins, e.g. gene fusion events [18], the conservation of gene-order [19], and the calculation of prior probabilities of genomic features between interacting proteins [20]. The third type requires prior biological knowledge of the target proteins, e.g. phylogenetic profiles [21], domain knowledge of proteins [22], [23], [24] and topological properties of proteins in PPI networks [11]. All these methods have limited applicability because the required data/information is not always available. The last type of methods require only sequence data. It uses the coded information inherent in sequences to predict if a protein pair interacts. For this reason, sequence-based methods are becoming popular, since sequence data is more readily available nowadays [2].

PIPE [25]/ PIPE2 [26], [27] is a well-established sequence-based method. Given a protein A, a protein B and a database of positive PPIs, PIPE simply counts how frequently all fixed-length protein sequence segments in Proteins A and B found co-occurring in the database. To achieve such task, all combinations of 20-mers between Protein A and Protein B are first enumerated using a sliding window with a width of 20. Then, the co-occurrence of each combination, e.g. MGIRRLVSVITRPIINKVNS from Protein A and GPEAIILTGTFDDWKGTLPM from Protein B, is searched in the database, and the frequency of their co-occurrence is counted. The sum of all counts is then computed. If the sum is larger than or equal to a threshold, the algorithm then predicts that protein A and B would interact. PIPE2 is a much faster version of PIPE. However, in spite of the satisfactory prediction performance, we observe that there is room for improvement. The key drawback of PIPE/PIPE2 is their use of a fixed-window of 20 amino acids. This is biologically unrealistic since functional regions such as the Short Linear Motifs (SLiMs [28]) have variable length from 3 to 15 amino acids [28]). Most of them are less than 10 amino acids [29]. Recently, a similar algorithm called VLASPD [2] that allows variable length of protein sequence segments is proposed. Nevertheless, it still uses exact patterns, which are neither realistic nor useful for biological analysis since it does not accept variants. Furthermore, it adopts a threshold-based prediction model, which does not allow nonlinear relationship between features and class outputs. Nevertheless, since PIPE2 is well benchmarked [3], we would compare our newly proposed algorithm with it.

Another well-established sequence-based method involves the use of Support Vector Machine (SVM) with Pairwise String Kernel [30], [31], [32], [15], [33], [34]. They encode a PPI pair into a feature vector composed by the co-occurrence of k-mer (a sequence of k residues) and train the SVM to predict if a protein pair can interact. For example, assume k = 3, a selected feature could be the number of counts of how often the 3-mers, say WTG and LGA co-occur in a protein pair along the entire sequence. Since all possible 3-mers are considered, the feature space could be as large as 203×203 (i.e. 64 millions) [4]. With SVM, even with such a high dimensionality, by using the kernel trick, neither computing nor storing the feature vector is needed. As no feature vectors are computed, in spite of achieving satisfactory prediction performance, it is hard to use SVM results to reveal or interpret why the feature space leads to its good performance. Thus, since the feature space is hardly interpretable, not much biological knowledge can be gained. Hence, to overcome this hurdle encountered in SVM is another key motivation of our proposed method. It should be noted that it is possible to generalize k-mer counting strategies allowing for gaps and mismatches [35]. However, these methods still do not allow a variable length. For example, if k is set to be 5, these methods would still consider all the 5-mers, while in WeMine-P2P, there could be 5-mers, 6-mers and 7-mers. In WeMine-P2P, we utilize the locally conserved sequence pattern clusters [36], [37] and their co-occurrence [38] to obtain biologically realistic and interpretable features that are flexible in pattern length while allowing variants. Experiments showed that our prediction results based on these features are comparable to those achieved by the SVM with Pairwise String Kernel approaches. In addition, the presence of concrete feature values makes the feature analysis of our models (and the subsequent biological interpretation) easier for biologists, comparing to the SVM with Pairwise String Kernel approaches, which have no concrete features and thus make feature analysis (and the subsequent biological interpretation) of the models difficult.

Motivated by the majority acceptance of sequence-based methods and the realization their drawbacks, the objective of our research as reported in this paper is to develop a new sequence-based prediction method which is (1) based on biologically interpretable features, (2) generating features to be more biologically realistic such as allowing variable lengths and pattern variations, and (3) achieving satisfactory prediction performance with biologically interpretable features. In this study, we propose a new algorithm WeMine-P2P, as illustrated in Fig. 1, to accomplish these objectives.

The remaining sections are outlined as follows. Section 2 explains in detail the WeMine-P2P prediction algorithm. Section 3 describes the dataset used and its pre-processing involved. Section 4 shows the design of the experiments and reports the results. Section 5 discusses the experimental results. Section 6 concludes the whole study.

Section snippets

Overview

We discover and locate APCs, then cAPC pairs, the “what” and “where” of the conserved regions, using them as discriminative features to construct the PPI classifier. This is elaborated in steps 1 to 6 in Fig. 1.

Problem definition

A protein pair, or a PPI pair is defined as a pair of protein sequences that can either be interacting or not interacting with one another. A Protein–Protein Interaction pair, referred to as a positive PPI pair, is defined as a pair of protein sequences that can interact with each other.

Material

In our experiments, 40 independent Yeast_Randam datasets were downloaded from [3] at http://www.marcottelab.org/differentialGeneralization. The procedure to obtain these 40 datasets is described below. Yeast Protein–Protein Interaction (PPI) data (Saccharomyces_cerevisiae-20100304.txt) containing the protein sequences and the positive PPI pairs was acquired from the protein interaction network analysis platform [42]. Further pre-processing was applied to the proteins therein. First, the

Experimental design and parameter setting

As mentioned in Section 3 Materials, we obtained in total 40 independent datasets provided by [3]. Each dataset has a training set of 16,000 PPIs and a testing set of 4000 PPIs (80%-20% split). In our experiment, we first extracted features (Step 1, Step 2) from the training set, then used the features to construct PPI matrix (Step 3, Step 4) and trained a predictive model on the PPI matrix. In Step 1, we used WeMine Aligned Pattern Clustering algorithm [37], [36] to obtain APCs with length

Discussions

The contributions of this study are summarized as follows: First, Aligned Pattern Clusters (APCs) [37], [36] were introduced to the represent the sequence patterns in Protein–Protein Interaction (PPI) (between two protein chains). This study demonstrates the first successful use of APCs in PPI, comparing to the previous studies [37], [36], [54], [38]. Second, based on APCs, co-occurring Aligned Pattern Cluster pairs (cAPC pairs) were newly developed to model the co-occurring sequence patterns

Conclusions

Sequence-based machine learning methods are becoming more and more popular because they are readily applicable and achieve satisfactory performance. However, existing methods prohibit researchers from gaining biological knowledge in PPI as they adopt features that are not biologically realistic, such as fixing the pattern length and using exact patterns, or adopt string kernels to skip the computation of features. In this study, we have furnished a new sequence-based method WeMine-P2P that

Acknowledgements

This research is supported by NSERC Post Graduate Scholarship, NSERC Discovery Grant and Waterloo/China Graduate Scholarship.

References (57)

  • Y. Park et al.

    Flaws in evaluation schemes for pair-input computational predictions

    Nat. Methods

    (2012)
  • T. Hamp et al.

    Evolutionary profiles improve protein–protein interaction prediction from sequence

    Bioinformatics

    (2015)
  • I.M. Nooren et al.

    Diversity of protein–protein interactions

    The EMBO J.

    (2003)
  • T. Ito et al.

    A comprehensive two-hybrid analysis to explore the yeast protein interactome

    Proc. Nat. Acad. Sci.

    (2001)
  • Y. Ho et al.

    Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry

    Nature

    (2002)
  • M.F. Templin et al.

    Protein microarrays: promising tools for proteomic research

    Proteomics

    (2003)
  • Z.-H. You et al.

    Predicting protein–protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest

    PloS One

    (2015)
  • S. Pitre et al.

    Computational methods for predicting protein–protein interactions

  • Z.-H. You et al.

    Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data

    Bioinformatics

    (2010)
  • X. Luo, Z. You, M. Zhou, S. Li, H. Leung, Y. Xia, Q. Zhu, A highly efficient approach to protein interactome mapping...
  • J. Shen et al.

    Predicting protein–protein interactions based only on sequences information

    Proc. Nat. Acad. Sci.

    (2007)
  • B.G. Pierce et al.

    Zdock server: interactive docking prediction of protein–protein complexes and symmetric multimers

    Bioinformatics

    (2014)
  • A.J. Enright et al.

    Protein interaction maps for complete genomes based on gene fusion events

    Nature

    (1999)
  • R. Jansen et al.

    A bayesian networks approach for predicting protein–protein interactions from genomic data

    Science

    (2003)
  • M. Pellegrini et al.

    Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

    Proc. Nat. Acad. Sci.

    (1999)
  • X.-W. Chen et al.

    Prediction of protein–protein interactions using random decision forest framework

    Bioinformatics

    (2005)
  • S.P. Kanaan et al.

    Inferring protein–protein interactions from multiple protein domain combinations

  • A.J. González et al.

    Predicting domain–domain interaction based on domain profiles with feature selection and support vector machines

    BMC Bioinf.

    (2010)
  • Cited by (9)

    • Identification of potential drugs for diffuse large b-cell lymphoma based on bioinformatics and Connectivity Map database

      2018, Pathology Research and Practice
      Citation Excerpt :

      An FDR < 0.1 as the cut-off value for being statistically significant. The STRING database (http://www.string-db.org/) database and Cytoscape software (version 3.6.0; http://www.cytoscape.org/) were used to construct and analyze the interaction associations of proteins encoded by these DEGs from overlapping pathways [32,33]. In the protein-protein interaction (PPI) networks, nodes represent the DEGs and edges represent the interactions between DEGs.

    • PATSIM: Prediction and analysis of protein sequences using hybrid Knuth-Morris Pratt (KMP) and Boyer-Moore (BM) algorithm

      2018, Gene
      Citation Excerpt :

      Functional domain information and sequential evolution information are combined using a fusion ensemble classifier for structural classification named PFP-FunDSeqE (Shen and Chou, 2009). The conserved sequence patterns in protein sequences have been identified using WeMine-P2P in the form of Aligned Pattern clusters that allows pattern variations with variable length (Sze-To A et al., 2016). To describe the pseudo amino acid composition, hydrophobic patterns of amino acid and average power-spectral density (APSD) are introduced (Zhang et al., 2006).

    • Structural study of the effects of mutations in proteins to identify the molecular basis of the loss of local structural fluidity leading to the onset of autoimmune diseases

      2017, Biochemical and Biophysical Research Communications
      Citation Excerpt :

      Earlier studies showed that an altered PPI could be one of the causes for disease onset due to mutations [16,17]. Computational and bioinformatic studies of altered PPIs, of SAV's of a proteins, correlated with diseases, were also been found useful [18]. In our this dataset, most of the amino acid residues (nearly 61%) were found to be present at the protein core rather than on the protein surface.

    • Editorial

      2016, Methods
    View all citing articles on Scopus
    View full text