• What is the DRKG?

    Drug Repurposing Knowledge Graph (DRKG) is a comprehensive biological knowledge graph relating genes, compounds, diseases, biological processes, side effects and symptoms. DRKG includes information from six existing databases including DrugBank, Hetionet, GNBR, String, IntAct and DGIdb, and more. It includes 97,238 entities belonging to 13 entity-types; and 5,874,261 triplets belonging to 107 edge-types. These 107 edge-types show a type of interaction between one of the 17 entity-type pairs (multiple types of interactions are possible between the same entity-pair), as depicted in the figure below.

    With DRKG you are able to score potential relations between 2 previously unconnected entities.

    connectivity.png

  • How does the DRKG work?

    Untitled

    Representing the graph

    Graphs are made up of head, tails and relations between the heads and tails. In the case of DRKG, heads and tails are represented by the following formats:

    • <Entity-type>::<ID> e.g. Gene::100860742
    • <Entity-type>::<ID Format>:<ID> e.g. Disease::OMIM:256600

    Where Entity-type can be any of the 13 types shown in the top image i.e. Side-effect, Atc, Pharmacologic class, Pathway, Compound, Symptom, Disease, Molecular function, Gene, Anatomy, Biological process, Tax, Cellular component.

    The paper found at the bottom of this section has the following to say about the IDs of heads and tails:

    “Data sources use one of several ID spaces to represent genes, compounds, diseases and others. For example, the same chemical compound may be represented in the drugbank compound ID space in DrugBank and in the chembl compound ID space in the DGIdb. To ensure that information from different sources in integrating correctly, we map biological entities to a common ID space using the following rules:

    • Compound entities are mapped to the drugbank compound ID space and if not possible to the chembl compound ID space. If a compound can not be found to either of the two we use the native ID space and we include the name of the source as part of the entity’s name (e.g., Compound::brenda:1695336).
    • Gene entities are mapped to the Entrez ID space.
    • Disease entities are mapped to the MESH ID space.
    • The remaining biological entities appear only in a single data source and hence we use the data source’s ID.

    These rules are applied to the biological entities per database to map the entities to the common ID space. Finally, in order to avoid relations for which we do not have enough data to train good embeddings, we exclude relations types that have less than 50 edges”

    Relations between head and tails are formatted:

    • <Database>::<Relation type>::<Head Category>:<Tail Category>

      e.g. GNBR::Sa::Compound:Disease or INTACT::UBIQUITINATION REACTION::Gene:Gene

    The possible databases are GNBR, Drugbank, Hetionet, STRING, IntAct, DGIdb, bioarx. Each database will have its own relation type:

    • Tables showing relation types for each database. These can also be found in the paper at the bottom of this section

    Edge scores

    This is the score of the head, tail and relation that represents how likely the given head, tail, relation triplet is. The score will be calculated for every h,r,t combination. It is calculated by the following:

    $$ \mathbf{d} = \gamma - ||\mathbf{h}+\mathbf{r}-\mathbf{t}||_{2} $$

    $$ \mathbf{score} = \log\left(\frac{1}{1+\exp(\mathbf{-d})}\right) $$

    $\mathbf{h}$, $\mathbf{r}$, $\mathbf{t}$ are embeddings of the head, relation and tail calculated by maximising the following model for h,r,t triplets which do exist and minimizing for triplets that don’t exist:

    $$ \textup{min} \sum_{\mathbf{h},\mathbf{r},\mathbf{t} \in\mathbb{D}^+\cup \mathbb{D}^-} \textup{log}(1+\textup{exp}(-y\times f(\mathbf{h},\mathbf{r},\mathbf{t})) $$

    All scores will be less than 0. The closer the score is to 0 the stronger $\mathbf{h}$ will have $\mathbf{r}$ with $\mathbf{t}$

    Paper

    DRKG Drug Repurposing Knowledge Graph.pdf

    Paper source

  • How to use our DRKG app?

    Inputs

    Our DRKG app takes 3 file inputs. Note, all input entities and relations must already be in the DRKG (see entities.csv and relations.csv below):

    1. A csv file of head entities e.g:

      input_heads.csv

    2. A csv file of tail entities e.g:

      input_tails.csv

    3. A csv file of relations e.g:

      input_relations.csv

    The examples included above are the same as the example data in the DRKG app.

    Files containing all possible entities and all possible relations:

    entities.csv

    relations.csv

    Or view them in browser here (note: may take a few seconds to load when scrolling through database):

    relations

    Outputs

    The DRKG app will return a list of scores for all head, relation, tails triplets available to download. For example:

    scores.csv

    In the results page of the DRKG app,

    If the DRKG app receives input entities or relations it cannot find in the knowledge graph, it will return these not found in a table in the results page.