qsprpred.extra.models package

Submodules

qsprpred.extra.models.pcm module

qsprpred.extra.models.random module

class qsprpred.extra.models.random.MedianDistributionAlgorithm[source]

Bases: RandomDistributionAlgorithm

fit(y_df: DataFrame)[source]
from_dict(loaded_dict)[source]
get_probas(X_test: ndarray)[source]
to_dict()[source]
class qsprpred.extra.models.random.RandomDistributionAlgorithm[source]

Bases: ABC

abstract fit(y_df: DataFrame)[source]
abstract from_dict(loaded_dict)[source]
abstract get_probas(X_test: ndarray)[source]
abstract to_dict()[source]
class qsprpred.extra.models.random.RandomModel(base_dir: str, alg: RandomDistributionAlgorithm, name: str | None = None, parameters: dict | None = None, autoload=True, random_state: int | None = None)[source]

Bases: QSPRModel

Initialize a QSPR model instance.

If the model is loaded from file, the data set is not required. Note that the data set is required for fitting and optimization.

Parameters:
  • base_dir (str) – base directory of the model, the model files are stored in a subdirectory {baseDir}/{outDir}/

  • name (str) – name of the model

  • parameters (dict) – dictionary of algorithm specific parameters

  • autoload (bool) – if True, the estimator is loaded from the serialized file if it exists, otherwise a new instance of alg is created

property applicabilityDomain: Any

Return the applicability domain of the model.

Returns:

applicability domain of the model

Return type:

Any

checkData(ds: QSPRDataSet, exception: bool = True) bool

Check if the model has a data set.

Parameters:
  • ds (QSPRDataSet) – data set to check

  • exception (bool) – if true, an exception is raised if no data is set

Returns:

True if data is set, False otherwise (if exception is False)

Return type:

bool

property classPath: str

Return the fully classified path of the model.

Returns:

class path of the model

Return type:

str

cleanFiles()

Clean up the model files.

Removes the model directory and all its contents.

convertToNumpy(X: DataFrame | ndarray, y: DataFrame | ndarray | None = None) tuple[ndarray, ndarray] | ndarray

Convert the given data matrix and target matrix to np.ndarray format.

Parameters:
  • X (pd.DataFrame, np.ndarray) – data matrix if a QSPRDataSet instance is given, the features and targets are extracted from the data set and returned

  • y (pd.DataFrame, np.ndarray) – target matrix

Returns:

data matrix and/or target matrix in np.ndarray format

createPredictionDatasetFromMols(mols: Iterable[str | Mol], n_jobs: int = 1) tuple[QSPRTable, ndarray]

Create a QSPRTable instance from a list of SMILES strings.

Parameters:
  • mols (Iterable[str | Mol]) – list of SMILES strings

  • n_jobs (int) – number of parallel jobs to use

Returns:

a tuple containing the QSPRTable instance and a boolean mask indicating which molecules failed to be processed

Return type:

tuple

fit(X: DataFrame | ndarray | QSPRTable, y: DataFrame | ndarray | QSPRTable, estimator: Type[RandomDistributionAlgorithm] | None = None, mode: EarlyStoppingMode = None, monitor: FitMonitor | None = None, **kwargs) RandomDistributionAlgorithm[source]

Fit the model to the given data matrix or QSPRTable.

Parameters:
  • X (pd.DataFrame, np.ndarray, QSPRTable) – data matrix to fit

  • y (pd.DataFrame, np.ndarray, QSPRTable) – target matrix to fit

  • estimator (Any) – estimator instance to use for fitting

  • mode (EarlyStoppingMode) – early stopping mode, unused

  • monitor (FitMonitor) – monitor instance to track the fitting process, unused

  • kwargs – additional keyword arguments for the fit function

Returns:

fitted estimator instance

Return type:

(RandomDistributionAlgorithm)

fitDataset(ds: QSPRDataSet, pipeline: DatasetPipeline | None = None, monitor=None, mode=EarlyStoppingMode.OPTIMAL, save_model=True, save_data=False, **kwargs) str

Train model on the whole attached data set.

** IMPORTANT ** For models that supportEarlyStopping, Assessor should be run first, so that the average number of epochs from the cross-validation with early stopping can be used for fitting the model.

Parameters:
  • ds (QSPRDataSet) – data set to fit this model on

  • pipeline (DatasetPipeline) – pipeline to use for fitting

  • monitor (FitMonitor) – monitor for the fitting process, if None, the base monitor is used

  • mode (EarlyStoppingMode) – early stopping mode for models that support early stopping, by default fit the ‘optimal’ number of epochs previously stopped at in model assessment on train or test set, to avoid the use of extra data for a validation set.

  • save_model (bool) – save the model to file

  • save_data (bool) – save the supplied dataset to file

  • kwargs – additional arguments to pass to fit

Returns:

path to the saved model, if save_model is True

Return type:

str

classmethod fromFile(filename: str) QSPRModel

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getParameters(new_parameters: dict | None = None) dict | None

Get the model parameters combined with the given parameters.

If both the model and the given parameters contain the same key, the value from the given parameters is used.

Parameters:

new_parameters (dict) – dictionary of new parameters to add

Returns:

dictionary of model parameters

Return type:

dict

static handleInvalidsInPredictions(num_mols: int, predictions: ndarray | list[ndarray], failed_mask: ndarray) ndarray

Replace invalid predictions with None.

Parameters:
  • num_mols (int) – molecules for which the predictions were made

  • predictions (np.ndarray) – predictions made by the model

  • failed_mask (np.ndarray) – boolean mask of failed predictions

Returns:

predictions with invalids replaced by None

Return type:

np.ndarray

initFromData(data: QSPRDataSet | None, pipeline: DatasetPipeline | None)

Initialize the model from a data set and pipeline.

Parameters:
  • data (QSPRDataSet) – data set to initialize the model with

  • pipeline (DatasetPipeline) – pipeline to use for feature calculation

initRandomState(random_state)

Set random state if applicable. Defaults to random state of dataset if no random state is provided,

Parameters:

random_state (int) – Random state to use for shuffling and other random operations.

property isMultiTask: bool

Return if model is a multitask model, taken from the data set or deserialized from file if the model is loaded without data.

Returns:

True if model is a multitask model

Return type:

bool

loadEstimator(params: dict | None = None) object[source]

Initialize estimator instance with the given parameters.

If params is None, the default parameters will be used.

Parameters:

params (dict) – algorithm parameters

Returns:

initialized estimator instance

Return type:

object

loadEstimatorFromFile(params: dict | None = None, fallback_load=True) object[source]

Load estimator instance from file and apply the given parameters.

Parameters:

params (dict) – algorithm parameters

Returns:

initialized estimator instance

Return type:

object

classmethod loadParamsGrid(fname: str, optim_type: str, model_types: str) ndarray

Load parameter grids for bayes or grid search parameter optimization from json file.

Parameters:
  • fname (str) – file name of json file containing array with three columns containing modeltype, optimization type (grid or bayes) and model type

  • optim_type (str) – optimization type (grid or bayes)

  • model_types (list of str) – model type for hyperparameter optimization (e.g. RF)

Returns:

array with three columns containing modeltype, optimization type (grid or bayes) and model type

Return type:

np.ndarray

property metaFile: str
property optimalEpochs: int | None

Return the optimal number of epochs for early stopping.

Returns:

optimal number of epochs

Return type:

int | None

property outDir: str

Return output directory of the model, the model files are stored in this directory ({baseDir}/{name}).

Returns:

output directory of the model

Return type:

str

property outPrefix: str

Return output prefix of the model files.

The model files are stored with this prefix (i.e. {outPrefix}_meta.json).

Returns:

output prefix of the model files

Return type:

str

predict(X: DataFrame | ndarray | QSPRTable, estimator: Any = None) ndarray[source]

Make predictions for the given data matrix or QSPRTable.

Parameters:
  • X (pd.DataFrame, np.ndarray, QSPRTable) – data matrix to predict

  • estimator (Any) – estimator instance to use for fitting

Returns:

2D array containing the predictions, where each row corresponds to a sample in the data and each column to a target property

Return type:

np.ndarray

predictDataset(dataset: QSPRDataSet, use_probas: bool = False) ndarray | list[ndarray]

Make predictions for the given dataset.

Parameters:
  • dataset – a QSPRDataSet instance

  • use_probas – use probabilities if this is a classification model

Returns:

an array of predictions or a list of arrays of predictions (for classification models with use_probas=True)

Return type:

np.ndarray | list[np.ndarray]

predictMols(mols: Iterable[str | Mol], use_probas: bool = False, n_jobs: int = 1, use_applicability_domain: bool = False) ndarray | list[ndarray]

Make predictions for the given molecules.

Parameters:
  • mols (Iterable[str | Mol]) – list of SMILES strings

  • use_probas (bool) – use probabilities for classification models

  • n_jobs – Number of jobs to use for parallel processing.

  • use_applicability_domain – Use applicability domain to return if a molecule is within the applicability domain of the model.

Returns:

an array of predictions or a list of arrays of predictions

(for classification models with use_probas=True)

np.ndarray[bool]: boolean mask indicating which molecules fall

within the applicability domain of the model

Return type:

np.ndarray | list[np.ndarray]

predictProba(X: DataFrame | ndarray | QSPRTable, estimator: Any = None)[source]

Make predictions for the given data matrix or QSPRDataSet, but use probabilities for classification models. Does not work with regression models.

Note. convertToNumpy can be called here, to convert the input data to np.ndarray format.

Note. if no estimator is given, the estimator instance of the model

is used.

Parameters:
  • X (pd.DataFrame, np.ndarray, QSPRDataSet) – data matrix to make predict

  • estimator (Any) – estimator instance to use for fitting

Returns:

a list of 2D arrays containing the probabilities for each class, where each array corresponds to a target property, each row to a sample in the data and each column to a class

Return type:

list[np.ndarray]

save(save_estimator=False)

Save model to file.

Parameters:

save_estimator (bool) – Explicitly save the estimator to file, if True. Note that some models may save the estimator by default even if this argument is False.

Returns:

absolute path to the metafile of the saved model str:

absolute path to the saved estimator, if include_estimator is True

Return type:

str

saveEstimator() str[source]

Save the underlying estimator to file.

Returns:

path to the saved estimator

Return type:

path (str)

setParams(params: dict | None, reset_estimator: bool = True)

Set model parameters. The estimator is also updated with the new parameters if ‘reload_estimator’ is True.

Parameters:
  • params (dict) – dictionary of model parameters or None to reset the parameters

  • reset_estimator (bool) – if True, the estimator is reinitialized with the new parameters

property supportsEarlyStopping: bool

Check if the model supports early stopping.

Returns:

whether the model supports early stopping or not

Return type:

(bool)

property task: ModelTasks

Return the task of the model, taken from the data set or deserialized from file if the model is loaded without data.

Returns:

task of the model

Return type:

ModelTasks

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON()
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.extra.models.random.RatioDistributionAlgorithm(random_state=None)[source]

Bases: RandomDistributionAlgorithm

Categorical distribution using ratio of categories as probabilities

Values of X are irrelevant, only distribution of y is used

Variables:
  • ratios (pd.DataFrame) – ratio of each category in y

  • random_state (int) – random state for reproducibility

fit(y_df: DataFrame)[source]

Calculate ratio of each category in y_df and store as probability distribution

from_dict(loaded_dict)[source]
get_probas(X_test: ndarray)[source]

Get probabilities of each category for each sample in X_test

to_dict()[source]
class qsprpred.extra.models.random.ScipyDistributionAlgorithm(distribution: ~scipy.stats._distn_infrastructure.rv_continuous = <scipy.stats._continuous_distns.norm_gen object>, params={}, random_state=None)[source]

Bases: RandomDistributionAlgorithm

fit(y_df: DataFrame)[source]
from_dict(loaded_dict)[source]
get_probas(X_test: ndarray)[source]
to_dict()[source]

qsprpred.extra.models.tests module

Module contents