RDKit Butina Clustering

CATEGORY

Clustering

SOURCE

RDKit

DESCRIPTION

Cluster molecules using the Butina algorithm from RDKit.

INPUTS

A Dataset of Molecules

OUTPUTS

A Dataset of Molecules

OPTIONS

ThresholdSimilarity score cuttoff between 0 and 1 (1 means identical). Default is 0.7Number between 0 and 1
Fragment methodStrategy for selecting the largest fragment for multi component moleculeshac or mw
Output fragmentIf multiple fragments then output the biggest, otherwise output the whole moleculetrue or false
DescriptorFingerprint type. Options are: maccs, morgan2, morgan3, rdkit (default)rdkit (default), maccs, morgan2, morgan3
MetricSimilarity comparison metric.asymmetric, braunblanquet, cosine, dice, kulczynski, mcconnaughey, rogotgoldberg, russel, sokal, tanimoto (default)

ADDITIONAL INFO

For more info on Butina clustering in RDKit see here.

Note: this methods builds a full distance matrix for the distances between the molecules so does not scale to large datasets. 1000 structures is OK, 10,000 is not.

To pick a diverse subset using this approach try the RDKit Diverse Subset Picker cell.