KOGAL (KnOwledge Graph ALignment) is a tool for local network alignment.

Written By: Warith Eddine DJEDDI (waritheddine@yahoo.fr), Sadok BEN YAHIA (sadok.ben@taltech.ee) and Engelbert MEPHU NGUIFO (engelbert.mephu_nguifo@uca.fr)

This page describes the usage of the command line interface of KOGAL. The executable KOGAL is compiled for Linux x86_64 and Windows 64bit platform. The KOGAL python source code and the datasets used to generate such results are publicly accessible online https://perso.isima.fr/~enmephun/FILES/KOGAL/KOGAL.zip

The program KOGAL finds a global alignment of pairwise input protein-protein interaction networks. The program applies knowledge graph embedding models (i.e. TransE, DistMult, TransR) for network alignment. Given two networks with N1 and N2 nodes each, it returns a matching between the input networks.

A) KOGAL workflow

 

B) How to execute the source code

To understand how to use the algorithm, let's start with an example of pairwise alignment of two input networks. The multiple case is similar.

(1) Suppose the species are named 'A' and 'B'

(2) Create the graph files in a single directory. You'll need 5 files:

(2.1) Network files: You'll need A.txt and B.txt , tab-separated files where each line contains an interaction. For example, the first 5 lines of A.pin are:

 ====== BEGIN ========
 INTERACTOR_A INTERACTOR_B
 a0 a1
 a0 a2
 a0 a3
 a0 a4
 ====== END ========

  Columns are separated by tabs. The first line is a header line of the
 form as shown above. All other lines describe an interaction, one per
 line.

 There may be a third column which contains edge weights (0 < wt <= 1) 
 and in that case the header line should've a third column titled Weight_Edge.

(2.2) Entity embedding file

(2.3) Relation embedding file

(2.4) Mapping from entity name to entity identifiers

(2.5) Mapping from relation_name to relation identifiers

(3) Call the code using two different networks (i.e. Homo Sapiens and Saccharomyces Cerevisiae S288C) and two different cluster methods (i.e. IPCA, MCODE): Here are samples:

(3.1) Using the IPCA method during the alignment process:

  python3 KOGAL.py --ppinet1=DATALMNAKG/HINTHUMAN/HomoSapiensbinaryhq.txt --ppinet2=DATALMNAKG/HINTSaccharomycesCerevisiaeS288C/SaccharomycesCerevisiaeS288Cbinaryhq.txt --clstm ipca

(3.2) Using the MCODE method during the alignment process:

  python3 KOGAL.py --ppinet1=DATALMNAKG/HINTHUMAN/HomoSapiensbinaryhq.txt --ppinet2=DATALMNAKG/HINTSaccharomycesCerevisiaeS288C/SaccharomycesCerevisiaeS288Cbinaryhq.txt --clstm mcode

 

 

The options are as follows (you can also use the "-h" or "--help" flag):

 usage: KOGAL.py [<args>] [-h | --help]
 Local network alignment using Knowledge Graph Embedding Models
 options:
   -h, --help            show this help message and exit
   --ppinet1 PPINET1     Source PPI network
   --ppinet2 PPINET2     Target PPI network
   --entity_emb_path     Entity embedding file
   --relation_emb_path   Relation embedding file
   --entity_idmap_path   Entity mapping file
   --relation_idmap_path Relation mapping file
   --gamma GAMMA
   --SCORE_THRESHOLD SCORE_THRESHOLD
                    Minimum score needed for cluster detection. Any
                    cluster whose score is less than a given threshold is
                    abandoned
   --SEED_THRESHOLD SEED_THRESHOLD
                    Applying a threshold to filter the pertinent seed node
                    pairs in order to detect the pairs of initial clusters
   --alpha ALPHA         
                    Tuning the contribution between the local and global
                    edge score computed from the knowledge graph embedding
   -save SAVE_PATH, --save_path SAVE_PATH
   --clstm CLSTM    Choosing graph clustering techniques (i.e. ipca,
                    mocode or coach)

(4) The output

  The main files are :
   (4.1) The outputs of the alignment are of three typed and which are located inside the --save_path argument.
        (4.1.1) the "alignment_spec1_spec2_protein_ppi.txt" file: each line represents an aligned cluster from the two compared species. Cluster in the same line number of the two files are aligned clusters.
        (4.1.2) the "alignment_spec1_spec2_pairwise_protein_PPI.txt" file: each line represents an aligned protein pairs with the first and second column denoting the nodes from the source and target network, respectively. 
        (4.1.3) Each protein is represented by a string (separated by a tab), and each cluster is on a single line. The alignment process generate two files with same name as the input network file name. Both files will have same number of lines. Each line represents a discovered cluster. The line starts with an integer representing number (e.g., n1 and m1) of proteins inner each aligned cluster followed by the belonging proteins. Clusters in the same line number of the two files are aligned clusters.

C) The generation of the yeast-human reference conserved protein complexes:

We provide the source code permitted to create two types of gold standard for the Homo-sapiens and Saccharomyces cerevisiae S288C species, the source code needs (i.e. under the sub-folder: "Gold-standard-human-yeast") the following file:

     (1) the CYC2008 and CORUM complexes files
     
     (2) the Gene ontology (GO) File
     
     (3) the GO annotation file of the two compared species (i.e. the Homo-sapiens and Saccharomyces cerevisiae S288C) which contains GO annotations for proteins in the input networks. The format of this GO annotation file should be compliant with the GO consortium.

The two files of the gold standard reference complexes are in the subfolder (Gold-standard-human-yeast/files_gold_standard_reference)

D)Knowledge graph embedding description

The whole dataset contains five part:


 

DGLBACKEND=pytorch dglke_train --dataset LNA --model_name DistMult --batch_size 1000 --hidden_dim 200 --gamma 5.9 --lr 0.25 --max_step 80000 --log_interval 100 -adv --regularization_coef 1.00E-33 --test --data_path ./train/ --format raw_udd_hrt --data_files data_train.tsv data_valid.tsv data_test.tsv

there are four generated files from the execution of script_train.sh:

 1) LNA_DistMult_entity.npy, NumPy binary data, storing the entity embedding
 
 2) LNA_DistMult_relation.npy, NumPy binary data, storing the relation embedding
 
 3) entities.tsv, mapping from entity_name to entity_id
 
 4) relations.tsv, mapping from relation_name to relation_id

To use the pretrained embedding, one can load the entity embeddings and relation embeddings by defining the arguments in KOGAL. Call the code using two different networks (i.e. Homo Sapiens and Saccharomyces Cerevisiae S288C), the IPCA cluster method and the novel embedding parameters:


 

python3 KOGAL.py --ppinet1=DATALMNAKG/HINTHUMAN/HomoSapiensbinaryhq.txt --ppinet2=DATALMNAKG/HINTSaccharomycesCerevisiaeS288C/SaccharomycesCerevisiaeS288Cbinaryhq.txt --clstm ipca --entity_emb_path LNA_DistMult_entity.npy --relation_emb_path LNA_DistMult_relation.npy --entity_idmap_path entities.tsv --relation_idmap_path relations.tsv

E) Python code for evalaution metrics, including the amount of matched reference conserved complexes (Frac), complex-wise sensitivity (Sn), positive predictive value (PPV), geometric accuracy (ACC), and maximum matching ratio (MMR)

Try to execute the file in the subfolder "evaluation_cl1_reproducibility/run_all.sh" in order to reproduce the evaluation results (i.e., Frac, Sn, PPV, ACC and MMR) mentioned in the paper. Call the code using two different input results. The results are the cluster alignment generated by the KOGAL approach.

Note: The python code for the evaluation metrics is taken from the published article in:
Nepusz, Tamás, Haiyuan Yu, and Alberto Paccanaro. "Detecting overlapping protein complexes in protein-protein interaction networks." Nature methods 9.5 (2012): 471-472.