CATEGORY
Clustering
SOURCE
RDKit
DESCRIPTION
Picks diverse subset of compounds using the MaxMinPicker algorithm as implemented in RDKit
INPUTS
A Dataset of Molecules to pick from.
Optional: a Dataset of Molecules that provides the seeds
See below for more details.
OUTPUTS
A Dataset of Molecules that comprises the newly picked molecules
OPTIONS
Number to pick | The number of new molecules to pick | Positive integer |
Threshold | Dissimilarity threshold (0.0 is identical) | Number between 0 and 1 |
Fragment method | Strategy for selecting the largest fragment for multi component molecules | hac or mw |
Output fragment | If multiple fragments then output the biggest, otherwise output the whole molecule | true or false |
Descriptor | The molecular descriptor (fingerprint) to use to calculate the Tanimoto distance | maccs, morgan2, morgan3 |
ADDITIONAL INFO
Background
MaxMinPicker is an efficient algorithm for picking a optimal subset of diverse compounds from a candidate pool. The algorithm is described in Ashton, M. et. al. (doi 10.1002/qsar.200290002) and an improved RDKit implementation was described by Roger Sayle at the 2017 RDKit user group meeting. The Squonk implementation is based on that improved algorithm.
Squonk also has the RDKit Diverse Subset Picker cell (based on Butina clustering) that allows to pick a diverse subset and to optimise this for a particular property (e.g. logP) but this approach requires the whole distance matrix to be generated which limits scaelability (1000 structures is OK, 10,000 is not). In contrast the MaxMinPicker generates distances on demand and so is much more scaleable.
Variants
The MaxMinPicker cell comes in 2 variants:
RDKitMaxMinPickerSimple
This is for picking a diverse subset from a single pool of molecules. It has one input, the pool of molecules to pick from, and one output, the molecules that have been picked. It is broadly analogous to the RDKit Diverse Subset Picker cell but scales much better, though does not allow for optimisation of a molecular property.
RDKitMaxMinPickerEnrich
This is for picking a diverse subset from a single pool of molecules given a set of seed molecules. It has two inputs, one for the pool of molecules to pick from (named input) and, and another for the seed molecules (named seeds). It has one output, the molecules that have been picked from the pool. It is designed for scenarios like ‘given my existing compound collection give the the n most diverse molecules from a vendor catalog’.
Options
Whilst the inputs of the two variants are different the options are identical.
The Number to pick and Threshold parameters determine the picking criteria. One or both must be specified. The Number to pick parameter is used to terminate the picker once that many molecules have been picked. The Threshold parameter is used to terminate the picker once there are no more molecule that distant to any that have been picked (it is a dissimilarity measure), and as such can be used to allow to pick until the chemical space is saturated at that threshold. If both Number to pick and Threshold are specified then the picker terminates as soon and the first of them hits the limit.
Generating descriptors and distances only really makes sense for discrete molecules not mixtures. As salted forms and other types of mixture are common in molecule sets the MaxMinPicker provides options for how to generate the fragment to use for the picking. It does this by providing the Fragment method option that tells the picker how to determine the largest fragment where there are multiple fragments. It has 2 options, hac (heavy atom count) and mw (molecular weight). Where a molecule has multiple fragments the biggest according to hac or mw will be chosen and used to generate the descriptors. The Output fragment option specifies whether the picked molecules that are output should be the largest fragment or the entire molecule.
The final option is the Descriptor to use for the Tanimoto distance between molecules. Default is morgan2, but you can also use morgan3 or maccs keys.
Examples
The following screenshot shows both variants in use.