Usage
In this document, the use of command line interface (CLI) will be described. If you want more control over the inputs and outputs or want to customize DrugEx itself, you can also use the Python API directly (see DrugEx Package API Documentation). You can find a complete tutorial illustrating some common use cases for each model type on the project’s GitHub.
The command-line is a simple interface that can be used to preprocess data and build models quickly. In order to obtain a final model and generate novel compounds you will need to run multiple scripts, however.
The description of the functionality of each script can be displayed with the --help
argument. For example, the help message for the drugex.dataset
script can be shown as follows:
python -m drugex.dataset --help
On Linux and MacOS, you do not need to call python explicitly and the following will suffice:
drugex dataset --help
A basic command-line workflow to fine-tune and optimize a graph-based model is given below (see CLI Example). However, we also show a few other workflows to show some of the other functionalities.
Before you start, make sure you have downloaded the example data and models in the tutorial/CLI/examples
folder:
python -m drugex.download -o tutorial/CLI/examples # ran from the repository root
Note
All of the examples below also assume you are also executing them from the repository root.
Warning
All of the commands below are intended as quick examples and it is unlikely the resulting models will be useful in any way. In production settings, the models should of course be trained for many more epochs.
CLI Example
Basics
Fine-tuning a Pretrained Generator
In this example, we will use the DrugEx CLI to fine-tune a pretrained graph transformer (trained on the latest version of the Papyrus data set). This pretrained model has been trained on a diverse set of molecules. Fine-tuning will give us a model that can generate molecules that should more closely resemble the compounds in the data set of interest. You can find the model used here archived on Zenodo or among the other data files for this tutorial ‘CLI/generators/’. You can find links to more pretrained models on the project GitHub.
Here, we want to bias the model towards generating compounds that are more related to known ligands of the Adenosine receptors.
To use the CLI all the input data should be in the data
folder of the base directory -b tutorial/CLI
.
For fine-tuning this input is a file with compounds (A2AR_LIGANDS.tsv
)
Before we begin the fine-tuning, we have to preprocess the training data, as follows:
# input is in tutorial/CLI/examples/data/A2AR_LIGANDS.tsv
export BASE_DIR=tutorial/CLI/examples
python -m drugex.dataset -b ${BASE_DIR} -i A2AR_LIGANDS.tsv -mc SMILES -o arl -mt graph
This will tell DrugEx to preprocess compounds saved in the -mc SMILES
column of the -i A2AR_LIGANDS.tsv
file for a -mt graph
type transformer
Preprocessing molecules for the graph based models includes fragmentation and encoding. This is done because the transformer takes fragmented molecules as input. For the graph-based transformers these inputs also need to be encoded into a graph representation.
The resulting files will be saved in the data folder and given a prefix (-o arl
). You can use this prefix to load the compiled data files in the next step. If you made and error somewhere or got an exception, you may also notice some backup_{number}
folders being created in the data folder. These are backups of the data files before the last step. You can use them to go back to the previous results if you accidentally overwrite them.
Now that we have our data sets prepared, we can finetune the pretrained generator on the preprocessed molecules with the train
script:
python -m drugex.train -tm FT -b ${BASE_DIR} -i arl -o arl -ag ${BASE_DIR}/models/pretrained/graph-trans/Papyrus05.5_graph_trans_PT/Papyrus05.5_graph_trans_PT.pkg -mt graph -e 2 -bs 32 -gpu 0
This tells DrugEx to use the generated file (prefixed with -i arl
) to fine-tune (-m FT
) a pretrained model with model states saved in the -pt Papyrus05.5_graph_trans_PT.pkg
file.
The training will only be 2 epochs, -e 2
, with a batch size of 32, -bs 32
and it will be done on GPU 0, -gpu 0
. You can also specify multiple GPUs with the -gpu
argument (i.e -gpu 0,1
). The best model will be saved to ${BASE_DIR}/generators/arl_graph_trans_FT.pkg
. However, you will find more output files with the .log
and .tsv
extensions in ${BASE_DIR}
. These files contain the training and validation losses and the molecules generated at each epoch.
Optimization with Reinforcement Learning
In this example, want to generate drug-like molecules that are active on A2AR and have a high Syntehtic Accessibility Score (SAScore). To achieve this, reinforcement learning (RL) is used to tune the generator model to generate molecules with desired properties. For this task the RL framework is composed of the agent (generator) and environment (predictor and SAScorer). The predictor model (a Random Forest QSAR model for binary A2A bioactivity predictions) has been created using QSPRpred
During RL a combination of two generators with the same architecture is used to create molecules; the agent that is optimized during RL for exploitation and the prior that is kept fixed for exploration. At each iteration, generated molecules are scored based on the environment and send a back to the agent for tuning.
python -m drugex.train -tm RL -b ${BASE_DIR} -i arl -o arl -ag arl_graph_trans_FT -pr ${BASE_DIR}/models/pretrained/graph-trans/Papyrus05.5_graph_trans_PT/Papyrus05.5_graph_trans_PT.pkg -p models/qsar/qspr/models/A2AR_RandomForestClassifier/A2AR_RandomForestClassifier_meta.json -ta A2AR_RandomForestClassifier -sas -e 2 -bs 32 -gpu 0
This tells DrugEx to create molecules from input fragments encoded in preprocessed data file (prefixed with arl
)
and optimize the initial agent (the fine-tuned model) (-ag arl_graph_trans_FT
) with RL (-m RL
). In this case we are using two desirability functions to score molecules:
Pretrained QSAR Model (
-p .../A2AR_RandomForestClassifier_meta.json
): This model is located in thetutorial/CLI/examples/models/qsar/
folder and is used to predict the bioactivity of the generated molecules on A2AR, which is indicated by adding it by name to the list of active targets with-ta A2AR_RandomForestClassifier
. This model was build using theQSPRpred
package and you can check out the Jupyter Notebook used to create it in the Python tutorialSAScore (
-sas
): This is a synthetic accessibility score that will prevent DrugEx from generating molecules that are too difficult to synthesize.
The rate between exploration and exploitation of known chemical space is forced by the use of a fixed prior-generator (-pr Papyrus05.5_graph_trans_PT
) and its influence can be tuned with the -eps, --epsilon
parameter.
The best model found during RL will be saved as ${BASE_DIR}/generators/arl_graph_trans_RL.pkg
.
Design new molecules
In this example, we use the optimized agent model to design new compounds that should be active on A2AR and have high synthetic accessibility.
python -m drugex.generate -b ${BASE_DIR} -i arl_test_graph.txt -g arl_graph_trans_RL
This tells DrugEx to generate new molecules based on the input fragment in arl_test_graph.txt
with the arl_graph_trans_RL.pkg
model.
The new compounds are saved to ${BASE_DIR}/new_molecules/arl_graph_trans_RL.tsv
and are also scored with the original environment used to create the model.
Advanced
Using different generator architectures
You can vary the type of model to use with the -a
and -mt
parameters.
Recurrent neural network
The most simple model is the RNN-based generator. This model gets the ‘go’ token as input and from there generates SMILES strings.
Therefore, this model does not use input fragments for training or sampling. To preprocess the data for training an RNN-based generator the molecules
are standardized and encoded based on the vocabulary of the pretrained model -vf Papyrus05.5_smiles_voc.txt
, but no fragmentation is done -nof
.
To fine-tune an RNN-based generator on the A2AR set, the algorithm needs to be specified -a rnn
.
Here the generator is fine-tuned on the A2AR set and then used to generate new compounds.
python -m drugex.dataset -b ${BASE_DIR} -i A2AR_LIGANDS.tsv -mc SMILES -o rnn-example -nof -vf Papyrus05.5_smiles_voc.txt
python -m drugex.train -tm FT -b ${BASE_DIR} -i rnn-example -ag ${BASE_DIR}/models/pretrained/smiles-rnn/Papyrus05.5_smiles_rnn_PT/Papyrus05.5_smiles_rnn_PT.pkg -vfs Papyrus05.5_smiles_voc.txt -mt smiles -a rnn -e 2 -bs 32 -gpu 0
python -m drugex.generate -b ${BASE_DIR} -g rnn-example_smiles_rnn_FT -vfs Papyrus05.5_smiles_voc.txt -gpu 0 -n 30 --keep_undesired
Sequence-based transformer
For working with a SMILES-based transformer; you need to preprocess the data by specifying -mt smiles
indicating that the inputs are encoded as SMILES.
By default the transformer algorithm (-a trans
) is used for training.
Warning
Note that the pretrained model for this model is not fetched by the tutorial utility at this point so you will have download its files separately. This model is also still more experimental and will likely not perform as well as the previous models.
python -m drugex.dataset -b ${BASE_DIR} -i A2AR_LIGANDS.tsv -mc SMILES -o ast -mt smiles
python -m drugex.train -tm FT -i ast -ag ${BASE_DIR}/models/pretrained/smiles-trans/Papyrus05.5_smiles_trans_PT/Papyrus05.5_smiles_trans_PT.pkg -mt smiles -a trans -e 2 -bs 32 -gpu 0
Pretraining a Generator
Pretraining -m PT
of a model from scratch works exactly the same way as finetuning,
the only difference is that the generator will not be initialized with pretrained model weights.
python -m drugex.dataset -b ${BASE_DIR} -i A2AR_LIGANDS.tsv -mc SMILES -o example_pt -mt graph
python -m drugex.train -tm PT -b ${BASE_DIR} -i example_pt -mt graph -e 2 -bs 32 -gpu 0
Scaffold-based Reinforcement learning
Tuning of the transformer-based generators can also be done on one scaffold or a subset of scaffolds. There are two ways to do it, either by using a subset of fragments-molecule pairs containing the selected scaffold or using the directly the scaffold as input. If your training data contains molecules with the selected scaffold we recommend former methods as its more stable with policy gradient-based reinforcement learning.
Here we show examples of these approaches on the previously trained and fine-tuned A2AR generators. We will use the molecule xanthine as a scaffold, in both examples.
With subset of molecules containing the scaffold
First the molecules from the given dataset are fragmented and encoding while only selecting fragments-molecule pairs (-s <scaffold>
) containing the xanthine in the input fragements, then we proceed with RL with this subset of molecules.
python -m drugex.dataset -b ${BASE_DIR} -i A2AR_LIGANDS.tsv -mc SMILES -o arl_xanthine -mt graph -sf c1[nH]c2c(n1)nc(nc2O)O
python -m drugex.train -tm RL -b ${BASE_DIR} -i arl_xanthine -o arl_xanthine -ag arl_graph_trans_FT -pr ${BASE_DIR}/models/pretrained/graph-trans/Papyrus05.5_graph_trans_PT/Papyrus05.5_graph_trans_PT.pkg -p models/qsar/qspr/models/A2AR_RandomForestClassifier/A2AR_RandomForestClassifier_meta.json -ta A2AR_RandomForestClassifier -sas -e 2 -bs 32 -gpu 0
python -m drugex.generate -b ${BASE_DIR} -i arl_xanthine -g arl_xanthine_graph_trans_RL -gpu 0 -n 5
If you want the fragments-molecule pairs consist of ones with exclusively the selected scaffold as the input fragment add the argument -sfe
With input scaffold
First this molecule is encoded, then reinforcement learning is done with this scaffold as input. Lastly a new molecule is generated containing this scaffold.
# input is in tutorial/CLI/data/xanthine.tsv
python -m drugex.dataset -b ${BASE_DIR} -i xanthine.tsv -mc SMILES -o scaffold_based -mt graph -s
python -m drugex.train -tm RL -b ${BASE_DIR} -i scaffold_based_graph.txt -o scaffold_based -ag arl_graph_trans_FT -pr ${BASE_DIR}/models/pretrained/graph-trans/Papyrus05.5_graph_trans_PT/Papyrus05.5_graph_trans_PT.pkg -p models/qsar/qspr/models/A2AR_RandomForestClassifier/A2AR_RandomForestClassifier_meta.json -ta A2AR_RandomForestClassifier -sas -e 2 -bs 32 -gpu 0
python -m drugex.generate -b ${BASE_DIR} -i scaffold_based_graph.txt -g scaffold_based_graph_trans_RL -gpu 0 -n 5
Note
The not fully converged model here will have trouble producing the scaffold that we need so the generate command may take a long time.