Process a dataset using a specified Docker container
A dataset of molecules
A dataset of molecules
|Docker image name||The Docker image to use to process the dataset|
|Input media type||Specifies the format in which the data is provided to the container|
|Output media type||Specifies the format in which the container provides its output|
|Command||Command script (e.g. bash script) to execute inside the container to process the data|
This cell allows you and other third parties to provide their own functionality in the form of Docker images. Whilst we try to provide a reasonable range of “off the shelf” functionality within Squonk we couldn’t begin to provide everything that a user might need. The Docker cell allow you to provide your own functionality in the form of a Docker image that is used to process the data. This means you can provide whatever processing functionality you want, and the Docker containerisation allows that to be executed securely without compromising other Squonk users.
If you are not familiar with Docker take a look here.
The Docker image you use needs to follow a few conventions for this to work.
- The input is written as files that the container can read.
- The container must write its output as files that can then be read once execution is complete.
- The container must execute without needing to access any external resources.
To execute this cell you need to specify a number of options:
Docker image name
This is the name of the Docker image to use. This is an image from Docker hub, and currently it must already be pulled and be present on the Squonk server. So this means you need to notify us of any containers you wish to use. We will probably remove this limitation in future.
Input and Output media types
These are the formats in which the data is provided to the container (Input) and generated by the container (Output). Currently there are 2 options, Squonk’s native JSON format and MDL’s SD File formats (additional formats may be supported in future). The options you select here affect the files that are provided and need to be generated.
Inputs: input.data.gz, input.meta Outputs: output.data.gz, output.meta The .data.gz files contain the gzipped data files (e.g. the JSON array of MoleculeObjects) and the .meta file contains the metadata.
Input: input.sdf.gz Output: output.sdf.gz Both files are gzipped SD files.
All files will be located in the /source directory inside the container
This is the script to execute to process the data. The script must read the input files from the /source directory, process them, and write the required output files to the /source directory.
The simple example shown just copies the input files to the required output files, so is effectively a “do nothing” example. Note that /source is the working directory so does not need to be specified in this example, but the command is equivalent to specifying the full paths as:
cp /source/input.sdf.gz /source/output.sdf.gz
When the cell is executed:
- The container is created from the specified Docker image
- A “volume” is created that will appear as the /source directory inside the container
- The data is copied to this directory according the specified format (e.g as input.data.gz + input.meta or as input.sdf.gz)
- The command script is written to the /source directory and made executable
- The working directory is set to /source (your script can change to a different directory if it needs)
- The command script is executed
- On completion the results (e.g output.data.gz + output.meta or output.sdf.gz) are read and stored as the output of the cell execution (always converted to a Dataset)
- The container and all the files are deleted.
The type of command script that is supported is dependent on the Docker container you are using. The possibilities are almost endless, but common examples would include:
- Shell scripts, using a file with the
#!/bin/bashetc.) shebang header
- Python scripts , using a file with the
#!/usr/bin/env pythonshebang header
Some examples can be found on the examples page.