CATEGORY
Clustering
SOURCE
RDKit
DESCRIPTION
Cluster molecules using the Butina algorithm from RDKit.
INPUTS
A Dataset of Molecules
OUTPUTS
A Dataset of Molecules
OPTIONS
Threshold | Similarity score cuttoff between 0 and 1 (1 means identical). Default is 0.7 | Number between 0 and 1 |
Fragment method | Strategy for selecting the largest fragment for multi component molecules | hac or mw |
Output fragment | If multiple fragments then output the biggest, otherwise output the whole molecule | true or false |
Descriptor | Fingerprint type. Options are: maccs, morgan2, morgan3, rdkit (default) | rdkit (default), maccs, morgan2, morgan3 |
Metric | Similarity comparison metric. | asymmetric, braunblanquet, cosine, dice, kulczynski, mcconnaughey, rogotgoldberg, russel, sokal, tanimoto (default) |
ADDITIONAL INFO
For more info on Butina clustering in RDKit see here.
Note: this methods builds a full distance matrix for the distances between the molecules so does not scale to large datasets. 1000 structures is OK, 10,000 is not.
To pick a diverse subset using this approach try the RDKit Diverse Subset Picker cell.