Command Line Interface Usage

You can use the command-line interface to preprocess data and build models. The description of the functionality can be displayed with the --help argument, e.g. the help message for the QSPRpred.data_CLI script can be shown as follows:

python -m qsprpred.data_CLI --help

A simple command-line workflow to prepare your dataset and train QSPR models is given below (see CLI Example).

If you want more control over the inputs and outputs or want to customize QSPRpred for your purpose, you can also use the Python API directly (see tutorials).

CLI Example

In this example, we will use the command line utilities of QSPRpred to train a QSAR model for the A2A and A2B receptor. We will use the Adenosine dataset from the API tutorial. The data is available through OneDrive (just unzip and place the two datasets A2A_LIGANDS.tsv and AR_LIGANDS.tsv in the ‘tutorial_data’ folder) or recreate the dataset yourself by running ‘tutorial_data/create_tutorial_data.py’.

Input data should contain a column with SMILES sequences and at least one column with a property for modelling. The AR_LIGANDS.tsv file contains a SMILES column, a column with the property pchembl_value_mean and a column with the property accession (uniprot accession numbers). However, to create models for the A2A and A2B receptor, we need to have the data in a pivot table format, where the properties are in the columns. To create this pivot table, we can use the pandas library in Python as follows:

import pandas as pd
df = pd.read_csv('tutorial_data/AR_LIGANDS.tsv', sep='\t')
df = df.pivot(index="SMILES", columns="accession", values="pchembl_value_Mean")
df.columns.name = None
df.reset_index(inplace=True)
df.to_csv('tutorial_data/AR_LIGANDS_pivot.tsv', sep='\t')

It is also possible to simply run the create_tutorial_data.py script in the tutorial_data folder with the following command:

python create_tutorial_data.py -m AR_LIGANDS.tsv

This will create a pivot table with the name AR_LIGANDS_pivot.tsv in the tutorial_data folder. Our example dataset now contains a SMILES column and two columns with the properties P29274 (A2AR) and P29275 (A2BR) (as well as a columns for A1 and A3, which we will not use in this example).

Preparing Data

Basics

We will now use the QSPRpred QSPRpred.data_CLI script for data preparation. In the CLI we need to indicate which property/ies we are interested in predicting (here P29274 and P29275, the A2A and A2B receptor respectively), this should be equal to the column names (should not contain spaces) containing the values to be predicted. For regression models these columns should contain numerical datapoints. For categorical models either categorical data or numerical data can be used (the latter will be categorized based on the activity threshold). Furthermore, we should indicate how we wish to split the data to create a train and test set. Here we will use a random split with a test fraction of 15%. We need to calculate features that describe the molecules, here we use Morgan fingerprints.

# input is in tutorial/data/AR_LIGANDS_pivot.tsv
python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp random -sf 0.15 -fe Morgan

Running this command will create folders tutorial_output/data, with subfolders P29274_REGRESSION and P29275_REGRESSION containing the prepared data. Each subfolder, named by the property identifiers (i.e. P29274/P29275) and the model task (REGRESSION), will contain the following files:

File	Function
{prefixes}_df.pkl {prefixes}_meta.json {prefixes}_MorganFP {prefixes}_MorganFP/{prefixes}_MorganFP_df.pkl {prefixes}_MorganFP/{prefixes}_MorganFP_meta.json	Dataframe Meta data, also used to instantiate a QSPRData object Descriptor set folder Calculated descriptors Meta data of the descriptor set

Furthermore, the command line interface will create a log file and settings file in the output folder.

File	Function
QSPRdata.json QSPRdata.log	Command Line interface settings Log file

More

Run settings arguments

Apart from the the input file name, there are a few other base options that can be set. Including -d, will print debug information to the log file. The random seed (-ran) can also be set manually, which should guarantee identical results while keeping the same random seed. Furthermore, the number of cpu’s (-ncpu) used for model training. Finally, the name of the SMILES column in your dataset can be indicated with -sm (default SMILES).

# input is in tutorial/data/AR_LIGANDS_pivot.tsv, setting debug flag, smiles column, random seed, number of cpu's
    python -m qsprpred.data_CLI  -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -sm SMILES -de -ran 42 -ncpu 5 -pr P29274 -pr P29275 -r REG -sp random -sf 0.15 -fe Morgan

Transform target property

To apply (-tr) transformations to target properties, indicate this in the CLI as follows:

# Log transform data for P29274
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -tr '{"P29274":"log"}' -r REG -sp random -sf 0.15 -fe Morgan

note. on windows remove the single quotes around the dictionary and add backslashes before the double quotes, e.g. -tr {\"P29274\":\"log\"}

Train-test split

In the base example we use a random split to create the train and test set. There are several more options, One is a scaffold split, where the data is split into a test and train set randomly but keeping molecules with the same (Murcko) scaffold in the same set.

# Scaffold split
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp scaffold -sf 0.15 -fe Morgan

Another option is the cluster split, where the data is split into a test and train set randomly but keeping molecules with the same clusters in the same set. Here you can set the clustering method as well (-scm).

# Cluster split
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp cluster -scm MaxMin -sf 0.15 -fe Morgan

The third option is a temporal split, where a column needs to be indicated which holds information on the time each sample was observed and split based on threshold in a column. In this example, all samples after 2015 (in column year) make up the test set. NOTE: this example will not work on the example set as it does not contain a year column.

# Time split
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG  -sp time -st 2015 -stc year -fe Morgan

Lastly, the data can be split based on a specific column in the dataset. This column has to be named datasplit where the value test indicates the test set and the value train indicates the train set. NOTE. this example will not work on the example set as it does not contain a datasplit column.

# Split based on a specific column
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp manual -sf 0.15 -fe Morgan

Data for classification models

You can set whether to prepare data for regression, classification or both. The default setting is to run both, but you can run either by setting the regression argument to true/REG for regression or false/CLS for classification. When using classification, the threshold(s) for each property (that has not been preclassified) need to be included. If the data is already preclassified, the threshold has to be set to ‘precomputed’. This is set using a dictionary. In case of multi-class classification the bounderies of the bins need to be given. For binary classification only give 1 threshold per property.

# Classification and regression
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r CLS -sp random -sf 0.15 -fe Morgan -th '{"P29274":[6.5],"P29275":[0,4,6,12]}'

note. on windows remove the single quotes around the dictionary and add backslashes before the double quotes, e.g. -th {\"P29274\":[6.5],\"P29275\":[0,4,6,10]}.

Feature calculation

There are many different descriptor sets that can be calculated from the CLI, such as Morgan fingerprints, rdkit, Mordred, Mold2 and Padel descriptors. Check the help message for the full list of available descriptor sets. The different descriptor sets can also be combined. For more control over the descriptor settings use the python API.

# With Morgan, RDkit, Mordred, Mold2, PaDEL and DrugEx descriptors
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp random -sf 0.15 -fe Morgan RDkit Mordred Mold2 PaDEL DrugEx

Feature filtering

The calculated features can also be filtered. Three different filters are implemented in QSPRpred, namely a high correlation filter, a low variance filter and the boruta filter. The high correlation filter and low variance filter need to be set with a threshold for filtering. The boruta filter needs a threshold for the comparison between shadow and real features.

# input is in ./data/LIGAND_RAW_small.tsv
   python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 -pr P29275 -r REG -sp random -sf 0.15 -fe Morgan -lv 0.1 -hc 0.9 -bf 90

Papyrus Low quality filter

Specifically for use with a dataset from the Papyrus dataset, an option is included for filtering low quality data from the dataset (All data is removed with value ‘Low’ in column ‘Quality’). To apply this filter include -lq or --low_quality in your command.

Multitask data

Multitask modelling is possible by passing multiple properties to the -pr argument. Furthermore, missing data can be imputed using the -im argument. You can combine any number of targets and combination of regression and classification tasks for the data preparation, however currently the DNN models do not support multitask modelling and only the random forest models and KNN sklearn models are supported for multitask. The multitask sklearn modelling is only possible for multiple regression task or multiple single class classification tasks. For multiple multi-class classification tasks or a combination of regression and classification tasks, the multitask modelling is not supported at the moment.

# input is in ./data/parkinsons_pivot.tsv
    python -m qsprpred.data_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/data -pr P29274 P29275 -r REG -sp random -sf 0.15 -fe Morgan -im '{"P29274":"mean", "P29275":"median"}'

Note. on windows remove the single quotes around the dictionary and add backslashes before the double quotes, e.g. -im {"P29274":"mean","P29275":"median"}

Model Training

Basics

Finally, we need to indicate what models we want to train and which steps to take in the training. In this example, we will build regression random forest models through passing the prepared regression datasets files P29274_REGRESSION and P29275_REGRESSION to the -dp argument. If you wish to train classification models, you can pass the classification datasets P29274_SINGLECLASS and P29275_MULTICLASS to the -dp argument (or any combination thereof). The model type is set with -m. We will also evaluate the model through cross-validation (-me) and train the model on all data to save for further use (-s).

# Using the prepared datasets P29274_REGRESSION and P29275_REGRESSION
    python -m qsprpred.model_CLI -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -mt RF -me -s

This will create a folder tutorial_output/models containing the trained models. Each subfolder, named by the model type (RF) and the dataset name (P29274_REGRESSION/P29275_REGRESSION), will contain the following files:

File	Function
{prefixes}.json {prefixes}_meta.json {prefixes}_cv.tsv {prefixes}_ind.tsv	Model file Meta data, also used to instantiate a QSPRModel object Cross-validation predictions Test set predictions

Furthermore, the command line interface will create a log file and settings file in the output folder.

File	Function
QSPRmodel.json QSPRmodel.log	Command Line interface settings Log file

More

The model training can be further customized with several CLI arguments. For more control over the model training settings use the python API. Here you can find a short overview.

Run settings arguments

As with the data preparation including -de, will print debug information to the log file. The random seed can also be set manually (although identical results are not guaranteed while keeping the same random seed). Furthermore, the number of cpu’s used for model training and the gpu number for training pytorch models can be set.

# Setting debug flag, random seed, number of cpu's and a specific gpu (for now multiple gpu's not possible)
    python -m qsprpred.model_CLI -de -ran 42 -ncpu 5 -gpus [3] -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -mt RF -me -s

model types

You also need to indicate which models you want to run, out of the following model types: ‘RF’ (Random Forest), ‘XGB’ (XGboost), ‘SVM’ (Support Vector Machine), ‘PLS’ (partial least squares regression), ‘KNN’ (k-nearest neighbours), NB’ (Naive Bayes) and/or ‘DNN’ (pytorch fully connected neural net). The default is to run all the different model types.

# Training a RF, SVM and PLS model
    python -m qsprpred.model_CLI -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -me -s -mt RF SVM PLS

Defining model parameters

Specific model parameters can be set with the parameters argument by passing a json file.

./myparams.json .. code-block:

[["RF", {"max_depth": [null, 20, 50, 100],
        "max_features": ["sqrt", "log2"],
        "min_samples_leaf": [1, 3, 5]}],
["KNN", {"n_neighbors" : [1, 5, 15, 25, 30],
        "weights"      : ["uniform", "distance"]}]]

# Setting some parameter values for a Random Forest and k-nearest neighbours model
    python -m qsprpred.model_CLI -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -mt RF KNN -me -s -p ./tutorial_output/myparams

Specifically for the training of the DNN model, you can set the tolerance and the patience from the CLI. Tolerance gives the mimimum decrease in loss needed to count as an improvement and patience is the number of training epochs without improvement in loss to stop the training.

# Setting the tolerance and patience for training a DNN model
    python -m qsprpred.model_CLI -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -mt DNN -me -s -tol 0.02 -pat 100

Hyperparameter optimization

In addition to setting model parameters manually, a hyperparameter search can be performed. In QSPRpred, two methods of hyperparameter optimization are implemented: grid search and bayesian optimization. For baysian optimization also give the number of trials. The search space needs to be set using a json file. A simple search space file for a RF and KNN model should look as given below. Note the indication of the model type as first list item and type of optimization algorithm as third list item. The search space file should always include all models to be trained.

./mysearchspace.json .. code-block:

[["RF", {"max_depth": [null, 20, 50, 100],
        "max_features": ["sqrt", "log2"],
        "min_samples_leaf": [1, 3, 5]}, "grid"],
["RF", {"n_estimators": ["int", 10, 2000],
        "max_depth": ["int", 1, 100],
        "min_samples_leaf": ["int", 1, 25]}, "bayes"],
["KNN", {"n_neighbors" : [1, 5, 15, 25, 30],
        "weights"      : ["uniform", "distance"]}, "grid"],
["KNN", {"n_neighbors": ["int", 1, 100],
        "weights": ["categorical", ["uniform", "distance"]],
        "metric": ["categorical", ["euclidean","manhattan",
                    "chebyshev","minkowski"]]}, "bayes"]]

# Bayesian optimization
    python -m qsprpred.model_CLI -dp ./tutorial_output/data/P29274_REGRESSION/P29274_REGRESSION_meta.json ./tutorial_output/data/P29275_REGRESSION/P29275_REGRESSION_meta.json -o ./tutorial_output/models -mt RF -me -s -o bayes -nt 5 -ss ./tutorial_output/mysearchspace.json -me -s

Multitask modelling

Multitask modelling is also possible. This means that the models are trained on multiple targets at once. The modelling arguments are the same as for single task modelling, you just need to specifiy a multitask dataset data prefix (see multitask data preparation).

Prediction

Furthermore, trained QSPRpred models can be used to predict values from SMILES from the command line interface predict_CLI.py.

Basics

Here we will predict activity values for the A1 (P29274) and A3 receptor (P29275) on the SMILES in the dataset used in the previous examples using the models from the previous examples. The input -i here is the set of SMILES for which we want to predict activity values. The argument -mp, is the paths to the meta files of the models we want to use for prediction relative to the base-directory subfolder qspr/models.

# Making predictions for the A2A and A2B receptor
python -m qsprpred.predict_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/predictions/AR_LIGANDS_preds.tsv -mp ./tutorial_output/models/RF_P29274_REGRESSION/RF_P29274_REGRESSION_meta.json ./tutorial_output/models/RF_P29275_REGRESSION/RF_P29275_REGRESSION_meta.json

More

The predictions can be further customized with several CLI arguments. Here you can find a short overview.

Run settings arguments

As with the data preparation including -de, will print debug information to the log file. The random seed can also be set manually. Furthermore, the number of cpu’s used for model prediction and the gpu number for prediction with pytorch models can be set.

# Setting debug flag, random seed, output file name, number of cpu's and a specific gpu (for now multiple gpu's not possible)
    python -m qsprpred.predict_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/predictions/AR_LIGANDS_preds.tsv -mp ./tutorial_output/models/RF_P29274_REGRESSION/RF_P29274_REGRESSION_meta.json ./tutorial_output/models/RF_P29275_REGRESSION/RF_P29275_REGRESSION_meta.json -de -ran 42 -ncpu 5 -gpus [3]

Adding probability predictions

When using a classification model, the probability of the predicted class can be added to the output file using the -pr flag.

# Adding probability predictions
python -m qsprpred.predict_CLI -i ./tutorial_data/AR_LIGANDS_pivot.tsv -o ./tutorial_output/predictions/AR_LIGANDS_preds.tsv -mp ./tutorial_output/models/RF_P29274_SINGLECLASS/RF_P29274_SINGLECLASS_meta.json ./tutorial_output/models/RF_P29275_MULTICLASS/RF_P29275_MULTICLASS_meta.json -pr