Data Sources

Swanson's Apple synthesizes biomedical associations from six curated databases to generate hypotheses. Each source contributes a different type of evidence linking diseases, genes, and compounds.

Comparative Toxicogenomics Database (CTD)

NC State University

B↔AC↔A

Manually curated associations between chemicals, genes/proteins, and diseases drawn from peer-reviewed literature. One of the most comprehensive hand-curated biomedical resources available.

Role: Primary source for both Disease–Gene (B↔A) and Chemical–Gene (C↔A) edges. Forms the backbone of most hypotheses in the engine.

Open Targets Platform

EMBL-EBI, Wellcome Sanger Institute, GSK, Pfizer, Takeda

B↔A

Integrates genetic, somatic, and functional genomics evidence linking diseases to molecular targets across multiple evidence types including GWAS, rare variants, and gene expression.

Role: Expands Disease–Gene (B↔A) edges with genetic evidence, broadening the set of disease-target links beyond manual curation.

ChEMBL

EMBL-EBI

FDA tagging

Manually curated bioactivity database for drug-like compounds. Tracks development phase for over two million compounds, with max_phase=4 indicating full FDA approval.

Role: Identifies FDA-approved drugs among compounds in the database. Powers the “FDA Approved” filter on the main results table.

Drug-Gene Interaction Database (DGIdb)

Washington University in St. Louis

C↔A

Aggregates drug-gene interactions from roughly 30 sources including DrugBank, PharmGKB, and ChEMBL, providing a broad view of known pharmacological relationships.

Role: Adds Drug–Gene (C↔A) edges, expanding the set of known drug-target relationships used to generate compound hypotheses.

DrugCentral

University of New Mexico / NIH

C↔A

Clinically active drug-target pairs curated from FDA drug labels, the WHO essential medicines list, and the biomedical literature. Emphasizes interactions of established clinical significance.

Role: Adds Drug–Protein (C↔A) edges with high clinical confidence, grounded in FDA-approved indications and active pharmacology.

DISEASES (Jensen Lab)

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen

B↔A

Text-mined and curated disease-gene associations derived from the biomedical literature at scale. Applies confidence scoring to filter associations by evidence quality.

Role: Adds Disease–Gene (B↔A) edges from large-scale literature mining, complementing the manually curated sources with broader coverage.

How the sources combine

Each hypothesis connects a disease (B) to a compound (C) through a shared gene or protein target (A). The B↔A edges come from CTD, Open Targets, and DISEASES; the C↔A edges from CTD, DGIdb, and DrugCentral. ChEMBL tags which compounds are FDA-approved. The ABC algorithm surfaces pairs where the B–C connection has little or no existing literature — the basis of Swanson-style literature-based discovery.