• Description:

Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect cardiotoxicity caused by binding hERG target, a protein associated with heart beat rhythm. The data covers over 9000 molecules with hERG activity.

  1. The data is split into four splits: train, test-iid, test-ood1, test-ood2.

  2. Each molecule in the dataset has 2D graph annotations which is designed to facilitate graph neural network modeling. Nodes are the atoms of the molecule and edges are the bonds. Each atom is represented as a vector encoding basic atom information such as atom type. Similar logic applies to bonds.

  3. We include Tanimoto fingerprint distance (to training data) for each molecule in the test sets to facilitate research on distributional shift in graph domain.

For each example, the features include: atoms: a 2D tensor with shape (60, 27) storing node features. Molecules with less than 60 atoms are padded with zeros. Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12) storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor with shape (60, ) storing node masks. 1 indicates the corresponding atom is real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing edge masks. 1 indicates the corresponding edge is real, othewise a padded one. active: a one-hot vector indicating if the molecule is toxic or not. [0, 1] indicates it's toxic, otherwise [1, 0] non-toxic.


[1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM, 2020.

[2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under Distributional Shift. NeurIPS DistShift Workshop 2021.

Split Examples
'test' 839
'test2' 177
'train' 6,523
'validation' 1,631
  • Feature structure:
    'active': Tensor(shape=(2,), dtype=int64),
    'atom_mask': Tensor(shape=(60,), dtype=float32),
    'atoms': Tensor(shape=(60, 27), dtype=float32),
    'dist2topk_nbs': Tensor(shape=(1,), dtype=float32),
    'molecule_id': string,
    'pair_mask': Tensor(shape=(60, 60), dtype=float32),
    'pairs': Tensor(shape=(60, 60, 12), dtype=float32),
  • Feature documentation:
Feature Class Shape Dtype Description
active Tensor (2,) int64
atom_mask Tensor (60,) float32
atoms Tensor (60, 27) float32
dist2topk_nbs Tensor (1,) float32
molecule_id Tensor string
pair_mask Tensor (60, 60) float32
pairs Tensor (60, 60, 12) float32
  • Citation:
  title         = "Reliable Graph Neural Networks for Drug Discovery Under
                   Distributional Shift",
  author        = "Han, Kehang and Lakshminarayanan, Balaji and Liu, Jeremiah",
  month         =  nov,
  year          =  2021,
  archivePrefix = "arXiv",
  primaryClass  = "cs.LG",
  eprint        = "2111.12951"