qsprpred.data package

Subpackages

Module contents

class qsprpred.data.BootstrapSplit(split: DataSplit, n_bootstraps=5, seed=None, dataset=None)[source]

Bases: DataSplit, Randomized, DataSetDependent

Splits dataset in random train and test subsets (bootstraps). Unlike cross-validation, bootstrapping allows for repeated samples in the test set.

Variables:
  • nBootstraps (int) – number of bootstraps to perform

  • seed (int) – Random state to use for shuffling and other random operations.

Initialize a BootstrapSplit object.

Parameters:
  • split (DataSplit) – the splitter to use for the bootstraps

  • n_bootstraps (int) – number of bootstraps to perform

  • seed (int) – random seed to use for random operations

  • dataset (QSPRDataSet) – dataset for the underlying splitter if it is DataSetDependent

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

property randomState: int

Get the random state for the object.

setDataSet(dataset)[source]

Set the dataset for the underlying splitter.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) Iterable[tuple[list[int], list[int]]][source]

Split the given data into nBootstraps training and test sets.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over nBootstraps tuples generated by the underlying splitter

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.ClusterSplit(test_fraction: float = 0.1, n_folds: int = 1, custom_test_list: list[str] | None = None, seed: int | None = None, clustering: MoleculeClusters | None = None, data_set: QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit, Randomized

Splits dataset into balanced train and test subsets based on clusters of similar molecules.

Variables:
  • testFraction (float) – fraction of total dataset to testset

  • customTestList (list) – list of molecule indexes to force in test set

  • seed (int) – Random state to use for shuffling and other random operations.

  • split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

property randomState: int

Get the random state for the object.

setDataSet(dataset: QSPRDataSet | None) None

Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.DatasetPipeline(feature_calculators: list[DescriptorSet] | None = None, steps: dict[str, Step | BaseEstimator] | None = None, fixed: list[str] | None = None, fit_on: dict[str, str] | None = None, apply_to: dict[str, str] | None = None, skip: list[str] | None = None, seed: int | None = None)[source]

Bases: Pipeline

Pipeline class for applying data preprocessing steps to a QSPRDataset.

Variables:
  • feature_calculators (list[DescriptorSet] | None) – List of feature calculators to apply to the dataset. If None, no feature calculators are applied.

  • originalfeatureNames (list[str] | None) – Original feature names in the dataset before applying the pipeline.

Initialize the DatasetPipeline

Parameters:
  • feature_calculators (list[DescriptorSet] | None) – List of feature calculators to apply to the dataset.

  • steps (dict[str, Step | BaseEstimator]) – Dictionary of named steps in the pipeline, if the step is a scikit-learn transformer, it will be wrapped in a SklearnStep.

  • fixed (list[str]) – List of step names that should not be fitted, only transformed

  • fit_on (dict[str, str]) – Settings for which data a step should be fitted on. Either ‘train’, ‘test’ or ‘both’, if not specified the step is fitted on the training data.

  • apply_to (dict[str, str]) – Settings for which data a step should be applied to. Either ‘train’, ‘test’ or ‘both’, if not specified the step is applied to both.

  • skip (list[str]) – List of step names to skip

  • seed (int | None) – Random state for the pipeline

addSkip(name: str)

Add a step to the skip list

Parameters:

name (str) – name of the step to skip

addStep(name: str, step: Step, fit_on: str = 'train', apply_to: str = 'both', fixed: bool = False)

Add a step to the pipeline

Parameters:
  • name (str) – name of the step

  • step (Step) – step to add to the pipeline

  • fit_on (str) – whether to fit the step on ‘train’, ‘test’ or ‘both’

  • apply_to (str) – whether to apply the step on ‘train’, ‘test’ or ‘both’

  • fixed (bool) – whether the step should be fixed and not fitted

apply(X_train: DataFrame, y_train: DataFrame | None = None, X_test: DataFrame | None = None, y_test: DataFrame | None = None, fit: bool = True) tuple[DataFrame, DataFrame | None, DataFrame | None, DataFrame | None]

Apply the pipeline to the data

If fit is True, the pipeline is fitted to the training data and then applied to the train and test data. If fit is False, the pipeline is only applied to the data.

Parameters:
  • X_train (pd.DataFrame) – training data to apply the pipeline to

  • y_train (pd.DataFrame | None) – training target data to apply the pipeline to

  • X_test (pd.DataFrame | None) – test data to apply the pipeline to

  • y_test (pd.DataFrame | None) – test target data to apply the pipeline to

  • fit (bool) – whether to fit the pipeline

Returns:

transformed training data y_train (pd.DataFrame | None): transformed training targets X_test (pd.DataFrame | None): transformed test data y_test (pd.DataFrame | None): transformed test targets

Return type:

X_train (pd.DataFrame)

applyOnDataSet(dataset: QSPRTable, split: DataSplit | None = None, fit: bool = True, seed: int | None = None) Generator[tuple[DataFrame, DataFrame, DataFrame, DataFrame] | tuple[DataFrame, DataFrame], None, None][source]

Apply the pipeline to the dataset

Note. the random state of the dataset is used to randomize the pipeline

when the seed of feature calculators, splits or steps is not set.

Parameters:
  • dataset (QSPRTable) – dataset to apply the pipeline to

  • split (DataSplit) – split to apply to the dataset

  • seed (int | None) – seed to randomize the pipeline, if None, the random state of the dataset is used

  • fit (bool) – whether to fit the pipeline

Yields:

X_train (pd.DataFrame) – transformed training data y_train (pd.DataFrame): transformed training targets X_test (pd.DataFrame | None): transformed test data if split is not None y_test (pd.DataFrame | None): transformed test targets if split is not None

property fitted: bool

Check if the pipeline is fitted

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

orderSteps(order: list[str])

Order the steps in the pipeline

Parameters:

order (list[str]) – list of step names in the desired order

property randomState: int | None

Get the random state for the object.

removeSkip(name: str)

Remove a step from the skip list

Parameters:

name (str) – name of the step to remove from the skip list

removeStep(name: str)

Remove a step from the pipeline

Parameters:

name (str) – name of the step to remove

property skip: list[str]

Get the steps to skip

The steps to skip are not fitted or transformed, but are still present in the pipeline.

Returns:

list of step names to skip

Return type:

list[str]

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.GBMTRandomSplit(test_fraction: float = 0.1, n_folds: int = 1, seed: int | None = None, n_initial_clusters: int | None = None, custom_test_list: list[str] | None = None, data_set: QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit, Randomized

Splits dataset into balanced random train and test subsets.

Variables:
  • testFraction (float) – fraction of total dataset to testset

  • customTestList (list) – list of molecule indexes to force in test set

  • split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

property randomState: int

Get the random state for the object.

setDataSet(dataset: QSPRDataSet | None) None

Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.MoleculeTable(storage: ChemStore | None = None, name: str | None = None, path: str = '.', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: MoleculeDataSet, Parallelizable

Class that holds and prepares molecule data for modelling and other analyses organized as a collection of PandasDataTable objects.

Variables:
  • descriptors (list[DescriptorTable]) – List of descriptor tables attached to this data set.

  • randomState (int) – Random state to use for shuffling and other random ops.

  • storeFormat (str) – Format to use for storing the data set.

  • rootDir (str) – Path to the directory where the data set is stored.

  • storage (ChemStore) – The storage object that holds the molecule data.

  • path (str) – Path to the directory where the data set will be stored.

  • name (str) – Name of the data set.

Initialize a MoleculeTable object.

This object wraps a pandas dataframe and provides short-hand methods to prepare molecule data for modelling and analysis.

Parameters:
  • storage (ChemStore) – The storage object that holds the molecule data.

  • name (str) – Name of the data set.

  • path (str) – Path to the directory where the data set will be stored.

  • random_state (int) – Random state to use for shuffling and other random ops.

  • store_format (str) – Format to use for storing the data set.

addClusters(clusters: list[MoleculeClusters], recalculate: bool = False)[source]

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:
  • clusters (list) – list of MoleculeClusters calculators.

  • recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[DescriptorSet], recalculate: bool = False, *args, **kwargs)[source]

Add descriptors to the data frame with the given descriptor calculators.

Parameters:
  • descriptors (list[DescriptorSet]) – List of DescriptorSet objects to use for descriptor calculation.

  • recalculate (bool) – Whether to recalculate descriptors even if they are already present in the data frame. If False, existing descriptors are kept and no calculation takes place.

  • *args – Additional positional arguments to pass to each descriptor set.

  • **kwargs – Additional keyword arguments to pass to each descriptor set.

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]

Add entries to the data set.

Parameters:
  • ids (list[str]) – IDs of the entries to add.

  • props (dict[str, list]) – Properties to add.

  • raise_on_existing (bool)

  • exist. (Whether to raise an error if the entries already)

Raises:

NotImplementedError – Adding entries is not yet available for the data set.

addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]

Add a property to the data frame.

Parameters:
  • name (str) – Name of the property.

  • data (Sized) – Property values.

  • ids (list[str], optional) – IDs of the molecules to add the property for.

Returns:

Whether the property was added successfully.

Return type:

(bool)

addScaffolds(scaffolds: list[Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)[source]

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:
  • scaffolds (list) – list of Scaffold calculators.

  • add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.

  • recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[Iterable[Any], None, None][source]

Apply a function to the data set.

Parameters:
  • func (callable) – Function to apply.

  • func_args (list, optional) – Positional arguments to pass to the function.

  • func_kwargs (dict, optional) – Keyword arguments to pass to the function.

  • on_props (tuple[str, ...], optional) – Properties to apply the function on.

  • chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the results.

Return type:

(Generator[Iterable[Any], None, None])

applyIdentifier(identifier: ChemIdentifier)[source]

Apply an identifier to the data set.

Parameters:

identifier (ChemIdentifier) – Identifier to apply.

applyStandardizer(standardizer: ChemStandardizer)[source]

Apply a standardizer to the data set.

Parameters:

standardizer (ChemStandardizer) – Standardizer to apply.

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)[source]

Attach descriptors to the data frame.

Parameters:
  • calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.

  • descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.

  • index_cols (list) – List of column names to use as index.

property chunkSize: int

Get the size of chunks to use per job in parallel processing.

clear()[source]

Clear the data set from memory and disk.

createScaffoldGroups(mols_per_group: int = 10)[source]

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:

mols_per_group (int) – Number of molecules per scaffold group.

property descriptorSets: list[DescriptorSet]

Get the descriptor calculators for this table.

property descsPath
dropDescriptorSets(descriptors: list[DescriptorSet | str], full_removal: bool = False)[source]

Drop descriptors from the given sets from the data frame.

Parameters:
  • descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

  • full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

Raises:

AssertionError – If the data set does not contain any descriptors.

dropDescriptors(descriptors: list[str])[source]

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:

descriptors (list[str]) – List of descriptor names to drop.

dropEmptyEntries(names: list[str])[source]

Drop rows with missing values in the properties.

Parameters:

names (list[str]) – list property names

dropEntries(ids: Iterable[str])[source]

Drop entries from the data set.

Parameters:

ids (Iterable[str]) – IDs of the entries to drop.

classmethod fromDF(name: str, df: DataFrame, path: str = '.', smiles_col: str = 'SMILES', **kwargs) MoleculeTable[source]

Create a MoleculeTable instance from a pandas DataFrame.

Parameters:
  • name (str) – Name of the data set.

  • df (pd.DataFrame) – DataFrame containing the molecule data.

  • path (str) – Path to the directory where the data set will be stored.

  • smiles_col (str) – Name of the column in the data frame containing the SMILES sequences.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

classmethod fromSDF(name: str, filename: str, path: str, smiles_prop: str, *args, **kwargs)[source]

Create a MoleculeTable instance from an SDF file.

Parameters:
  • name (str) – Name of the data set.

  • filename (str) – Path to the SDF file.

  • path (str) – Path to the directory where the data set will be stored.

  • smiles_prop (str) – Name of the property in the SDF file containing the SMILES sequence.

  • *args – Additional arguments to pass to the MoleculeTable constructor.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

classmethod fromSMILES(name: str, smiles: list, path: str, *args, **kwargs)[source]

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:
  • name (str) – Name of the data set.

  • smiles (list) – list of SMILES sequences.

  • path (str) – Path to the directory where the data set will be stored.

  • *args – Additional arguments to pass to the MoleculeTable constructor.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromTableFile(name: str, filename: str, path: str, *args, sep='\t', **kwargs)[source]

Create a MoleculeTable instance from a file containing a table of molecules (i.e. a CSV file).

Parameters:
  • name (str) – Name of the data set.

  • filename (str) – Path to the file containing the table.

  • path (str) – Path to the directory where the data set will be stored.

  • sep (str) – Separator used in the file for different columns.

  • *args – Additional arguments to pass to the MoleculeTable constructor.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

generateDescriptorDataSetName(ds_set: str | DescriptorSet, name: str | None = None) str[source]

Generate a descriptor set name from a descriptor set.

Parameters:
  • ds_set (str | DescriptorSet) – Name of the descriptor set.

  • name (str) – Name of the data set.

Returns:

Name of the descriptor set.

Return type:

(str)

getClusterNames(clusters: list[MoleculeClusters] | None = None) list[str][source]

Get the names of the clusters in the data frame.

Parameters:

clusters (list) – List of cluster calculators of clusters to include

Returns:

List of cluster names.

Return type:

(list[str])

getClusters(clusters: list[MoleculeClusters] | None = None)[source]

Get the subset of the data frame that contains only clusters.

Parameters:

clusters (list) – List of cluster calculators of clusters to include.

Returns:

Data frame containing only clusters.

Return type:

pd.DataFrame

getDF() DataFrame[source]

Get the data frame of the data set.

getDescriptorNames() list[str][source]

Get the names of the descriptors present for molecules in this data set.

Returns:

list of descriptor names.

Return type:

(list[str])

getDescriptors(active_only: bool = True) DataFrame[source]

Get the calculated descriptors as a pandas data frame.

Returns:

Data frame containing only descriptors.

Return type:

pd.DataFrame

getProperties() list[str][source]

Get the names of the properties in the data frame.

getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any][source]

Get the property with the given name.

Parameters:
  • name (str) – Name of the property.

  • ids (tuple[str], optional) – IDs of the molecules to get the property for.

Returns:

Property values.

Return type:

(Iterable[Any])

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10) Series[source]

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:
  • scaffold_name (str) – Name of the scaffold.

  • mol_per_group (int) – Number of molecules per scaffold group.

Returns:

Series containing the scaffold groups.

Return type:

(pd.Series)

getScaffoldNames(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) list[str][source]

Get the names of the scaffolds in the data frame.

Parameters:
  • scaffolds (list) – List of scaffold calculators of scaffolds to include.

  • include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

List of scaffold names.

Return type:

(list[str])

getScaffolds(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) DataFrame[source]

Get the subset of the data frame that contains only scaffolds.

Parameters:
  • scaffolds (list) – List of scaffold calculators of scaffolds to include.

  • include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

Data frame containing only scaffolds.

Return type:

pd.DataFrame

getSubset(subset: Iterable[str], ids: Iterable[str] | None = None, name: str | None = None, path: str = '.', **kwargs) MoleculeTable[source]

Get a subset of the data frame.

Parameters:
  • subset (Iterable[str]) – List of properties to include in the subset.

  • ids (Iterable[str], optional) – IDs of the molecules to include in the subset.

  • name (str, optional) – Name of the new data set.

  • path (str) – Path to the directory where the data set will be stored.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

getSummary() DataFrame[source]

Get a summary of the data set.

Returns:

Summary of the data set.

Return type:

(pd.DataFrame)

Raises:

NotImplementedError – Summary not yet available for MoleculeTable.

property hasClusters: bool

Check whether the data frame contains clusters.

Returns:

Whether the data frame contains clusters.

Return type:

bool

hasDescriptors(descriptors: list[DescriptorSet | str] | None = None) bool | list[bool][source]

Check whether the data frame contains given descriptors.

Parameters:

None) ((list[DescriptorSet | str] |) – List of descriptor objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.

Returns:

Whether the data frame contains the given descriptors.

Return type:

(bool | list[bool])

hasProperty(name: str) bool[source]

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

property hasScaffoldGroups: bool

Check whether the data frame contains scaffold groups.

Returns:

Whether the data frame contains scaffold groups.

Return type:

(bool)

property hasScaffolds: bool

Check whether the data frame contains scaffolds.

Returns:

Whether the data frame contains scaffolds.

Return type:

bool

property idProp: str

Get the name of the property that contains the molecule IDs.

property identifier: ChemIdentifier

Get the identifier to use for the data set.

iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol], None, None][source]

Iterate over chunks of the data set.

Parameters:
  • size (int, optional) – Size of the chunks.

  • on_props (list, optional) – Properties to iterate over.

  • chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the chunks.

Return type:

(Generator[list[StoredMol], None, None])

property metaFile: str

Get the path to the meta file of the data set.

property nJobs: int

Get the number of jobs to use for parallel processing.

property name: str

Get the name of the data set.

processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None][source]

Process molecules in the data set.

Parameters:
  • processor (MolProcessor) – Processor to use for molecule processing.

  • proc_args (tuple, optional) – Positional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Keyword arguments to pass to the processor.

  • mol_type (Literal["smiles", "mol", "rdkit"], optional) – Type of molecules to process.

  • add_props (Iterable[str], optional) – Additional properties to add to the data frame.

Returns:

Generator of the results.

Return type:

(Generator[Any, None, None])

property randomState: int

Get the random state to use for shuffling and other random ops.

reload()[source]

Reload the data set from disk.

removeProperty(name: str) bool[source]

Remove a property from the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property was removed successfully.

Return type:

(bool)

restoreDescriptorSets(descriptors: list[DescriptorSet | str])[source]

Restore descriptors that were previously removed.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

Raises:

ValueError – If any of the descriptors are not present in the data set.

sample(n: int, name: str | None = None, random_state: int | None = None) MoleculeTable[source]

Sample n molecules from the table.

Parameters:
  • n (int) – Number of molecules to sample.

  • name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.

  • random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save()[source]

Save the whole storage to disk.

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None, path: str = '.') MoleculeTable[source]

Search the data set based on a property.

Parameters:
  • prop_name (str) – Name of the property to search on.

  • values (list[float | int | str]) – Values to search for.

  • exact (bool) – Whether to perform an exact search.

  • name (str) – Name of the new table.

  • path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, path: str = '.') MoleculeTable[source]

Search the data set with SMARTS patterns.

Parameters:
  • patterns (list[str]) – List of SMARTS patterns to search for.

  • operator (Literal["or", "and"]) – Operator to use for combining the patterns.

  • use_chirality (bool) – Whether to use chirality in the search.

  • name (str) – Name of the new table.

  • path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in the data set.

property smilesProp: str

Get the name of the property that contains the SMILES strings.

property standardizer: ChemStandardizer

Get the standardizer to use for the data set.

toFile(filename: str)[source]

Save the data set to a file.

Parameters:

filename (str) – Path to the file to save the data set to.

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

transformProperties(names: list[str], transformer: Callable[[Iterable[Any]], Iterable[Any]])[source]

Transform the properties of the data frame.

Parameters:
  • names (list[str]) – List of property names to transform.

  • transformer (Callable) – Function to use for transformation.

class qsprpred.data.QSPRTable(storage: ChemStore | None = None, name: str | None = None, target_props: list[TargetSpec | dict] | None = None, path: str = '.', random_state: int | None = None, store_format: str = 'pkl', drop_empty_target_props: bool = True)[source]

Bases: QSPRDataSet, MoleculeTable

Implementation of QSPRDataSet using a collection of PandasDataTable objects.

Variables:

targetProperties (str) – property to be predicted with QSPRmodel

Construct QSPRdata, also apply transformations of output property if specified.

Parameters:
  • storage (ChemStore | None) – storage object to use for saving the data. Defaults to None.

  • name (str) – data name, used in saving the data

  • target_props (list[TargetSpec | dict] | None) – target properties, names should correspond with target column names in df. If None, target specifications will be inferred if this data set has been saved previously. Defaults to None.

  • path (str, optional) – path to the directory where the data set will be saved. Defaults to “.”.

  • random_state (int, optional) – random state for splitting the data.

  • store_format (str, optional) – format to use for storing the data (‘pkl’ or ‘csv’).

  • drop_empty_target_props (bool, optional) – whether to ignore entries with empty target properties. Defaults to True.

Raises:

ValueError – Raised if threshold given with non-classification task.

addClusters(clusters: list[MoleculeClusters], recalculate: bool = False)

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:
  • clusters (list) – list of MoleculeClusters calculators.

  • recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[DescriptorSet], recalculate: bool = False, *args, **kwargs)

Add descriptors to the data frame with the given descriptor calculators.

Parameters:
  • descriptors (list[DescriptorSet]) – List of DescriptorSet objects to use for descriptor calculation.

  • recalculate (bool) – Whether to recalculate descriptors even if they are already present in the data frame. If False, existing descriptors are kept and no calculation takes place.

  • *args – Additional positional arguments to pass to each descriptor set.

  • **kwargs – Additional keyword arguments to pass to each descriptor set.

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the data set.

Parameters:
  • ids (list[str]) – IDs of the entries to add.

  • props (dict[str, list]) – Properties to add.

  • raise_on_existing (bool)

  • exist. (Whether to raise an error if the entries already)

Raises:

NotImplementedError – Adding entries is not yet available for the data set.

addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the data frame.

Parameters:
  • name (str) – Name of the property.

  • data (Sized) – Property values.

  • ids (list[str], optional) – IDs of the molecules to add the property for.

Returns:

Whether the property was added successfully.

Return type:

(bool)

addScaffolds(scaffolds: list[Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:
  • scaffolds (list) – list of Scaffold calculators.

  • add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.

  • recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

addSplit(split: DataSplit, name: str)[source]

Add a split to the dataset.

Performs the split and stores the split object and the indices of the split. If the split has a random state, it will be set to the random state of the dataset if it is not set.

Parameters:
  • split (DataSplit) – split to add

  • name (str) – name of the split

addTargetProperty(target_spec: TargetSpec | dict, drop_empty: bool = True)[source]

Add a target property to the dataset.

Parameters:
  • target_spec (TargetSpec | dict) – target property specification to add or dictionary to initialize a TargetSpec

  • drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[Iterable[Any], None, None]

Apply a function to the data set.

Parameters:
  • func (callable) – Function to apply.

  • func_args (list, optional) – Positional arguments to pass to the function.

  • func_kwargs (dict, optional) – Keyword arguments to pass to the function.

  • on_props (tuple[str, ...], optional) – Properties to apply the function on.

  • chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the results.

Return type:

(Generator[Iterable[Any], None, None])

applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the data set.

Parameters:

identifier (ChemIdentifier) – Identifier to apply.

applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the data set.

Parameters:

standardizer (ChemStandardizer) – Standardizer to apply.

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)

Attach descriptors to the data frame.

Parameters:
  • calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.

  • descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.

  • index_cols (list) – List of column names to use as index.

checkClassification(target_property: str) bool[source]

Checks the validity of the target property for classification tasks.

Parameters:

target_property (str) – Name of the target property to use for classification

Returns:

True if the target property is correctly set up for classification, False otherwise.

Return type:

bool

property chunkSize: int

Get the size of chunks to use per job in parallel processing.

clear()

Clear the data set from memory and disk.

createScaffoldGroups(mols_per_group: int = 10)

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:

mols_per_group (int) – Number of molecules per scaffold group.

property descriptorSets: list[DescriptorSet]

Get the descriptor calculators for this table.

property descsPath
dropDescriptorSets(descriptors: list[DescriptorSet | str], full_removal: bool = False)

Drop descriptors from the given sets from the data frame.

Parameters:
  • descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

  • full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

Raises:

AssertionError – If the data set does not contain any descriptors.

dropDescriptors(descriptors: list[str])

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:

descriptors (list[str]) – List of descriptor names to drop.

dropEmptyEntries(names: list[str])

Drop rows with missing values in the properties.

Parameters:

names (list[str]) – list property names

dropEntries(ids: Iterable[str])

Drop entries from the data set.

Parameters:

ids (Iterable[str]) – IDs of the entries to drop.

filter(table_filters: list[Callable])[source]

Filter the data set using the given filters.

Parameters:

table_filters (list[DataFilter]) – list of filters to apply

classmethod fromDF(name: str, df: DataFrame, target_props: list[TargetSpec | dict], path: str = '.', smiles_col: str = 'SMILES', drop_empty_target_props: bool = True, **kwargs) QSPRTable[source]

Create QSPRTable from a pandas DataFrame.

Parameters:
  • name (str) – name of the data set

  • df (pd.DataFrame) – data frame containing the data

  • target_props (list[TargetProperty | dict]) – target properties to use

  • path (str) – path to the directory where the data set will be saved

  • smiles_col (str) – name of the column containing SMILES

  • drop_empty_target_props (bool, optional) – whether to drop rows with empty target property values. Defaults to True.

  • **kwargs – additional keyword arguments for MoleculeTable constructor

Returns:

created data set

Return type:

QSPRTable

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

classmethod fromMolTable(mol_table: MoleculeTable, target_props: list[TargetSpec | dict], *args, path: str = '.', name: str | None = None, **kwargs) QSPRTable[source]

Create QSPRTable from a MoleculeTable.

Parameters:
  • mol_table (MoleculeTable) – MoleculeTable to use as the data source

  • target_props (list) – list of target properties to use

  • *args – additional positional arguments to pass to the constructor of QSPRTable

  • path (str) – path to the directory where the data set will be saved

  • name (str) – name of the data set

  • **kwargs – additional keyword arguments to pass to the constructor of QSPRTable

Returns:

created data set

Return type:

QSPRTable

classmethod fromSDF(name: str, filename: str, smiles_prop: str, *args, **kwargs)[source]

Create QSPRTable from SDF file.

It is currently not implemented for QSPRTable, but you can convert from ‘MoleculeTable’ with the ‘fromMolTable’ method.

Parameters:
  • name (str) – name of the data set

  • filename (str) – path to the SDF file

  • smiles_prop (str) – name of the property in the SDF file containing SMILES

  • *args – additional arguments for QSPRTable constructor

  • **kwargs – additional keyword arguments for QSPRTable constructor

classmethod fromSMILES(name: str, smiles: list, path: str, *args, **kwargs)

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:
  • name (str) – Name of the data set.

  • smiles (list) – list of SMILES sequences.

  • path (str) – Path to the directory where the data set will be stored.

  • *args – Additional arguments to pass to the MoleculeTable constructor.

  • **kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromTableFile(name: str, filename: str, path: str, *args, sep: str = '\t', target_props: list[TargetSpec | dict] | None = None, **kwargs)[source]

Create QSPRTable from table file (i.e. CSV or TSV).

Parameters:
  • name (str) – name of the data set

  • filename (str) – path to the table file

  • path (str) – path to the directory where the data set will be saved

  • *args – additional arguments for MolTable constructor

  • sep (str, optional) – separator in the table file. Defaults to “t”.

  • target_props (list[TargetProperty | dict], optional) – target properties to use. Defaults to None.

  • **kwargs – additional keyword arguments for MolTable constructor

Returns:

QSPRTable object

Return type:

QSPRTable

generateDescriptorDataSetName(ds_set: str | DescriptorSet, name: str | None = None) str

Generate a descriptor set name from a descriptor set.

Parameters:
  • ds_set (str | DescriptorSet) – Name of the descriptor set.

  • name (str) – Name of the data set.

Returns:

Name of the descriptor set.

Return type:

(str)

getClusterNames(clusters: list[MoleculeClusters] | None = None) list[str]

Get the names of the clusters in the data frame.

Parameters:

clusters (list) – List of cluster calculators of clusters to include

Returns:

List of cluster names.

Return type:

(list[str])

getClusters(clusters: list[MoleculeClusters] | None = None)

Get the subset of the data frame that contains only clusters.

Parameters:

clusters (list) – List of cluster calculators of clusters to include.

Returns:

Data frame containing only clusters.

Return type:

pd.DataFrame

getDF() DataFrame

Get the data frame of the data set.

getDescriptorNames() list[str]

Get the names of the descriptors present for molecules in this data set.

Returns:

list of descriptor names.

Return type:

(list[str])

getDescriptors(active_only: bool = True) DataFrame

Get the calculated descriptors as a pandas data frame.

Returns:

Data frame containing only descriptors.

Return type:

pd.DataFrame

getProperties() list[str]

Get the names of the properties in the data frame.

getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get the property with the given name.

Parameters:
  • name (str) – Name of the property.

  • ids (tuple[str], optional) – IDs of the molecules to get the property for.

Returns:

Property values.

Return type:

(Iterable[Any])

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10) Series

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:
  • scaffold_name (str) – Name of the scaffold.

  • mol_per_group (int) – Number of molecules per scaffold group.

Returns:

Series containing the scaffold groups.

Return type:

(pd.Series)

getScaffoldNames(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) list[str]

Get the names of the scaffolds in the data frame.

Parameters:
  • scaffolds (list) – List of scaffold calculators of scaffolds to include.

  • include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

List of scaffold names.

Return type:

(list[str])

getScaffolds(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) DataFrame

Get the subset of the data frame that contains only scaffolds.

Parameters:
  • scaffolds (list) – List of scaffold calculators of scaffolds to include.

  • include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

Data frame containing only scaffolds.

Return type:

pd.DataFrame

getSplit(name: str, as_type: str = 'split') DataSplit | list[tuple[Index, Index]][source]

Get the split with the given name.

Parameters:

name (str) – name of the split

as_type (str): Determines the type of output. Can be one of:
  • “split”: Returns a DataSplit object.

  • “ids”: Returns train and test indices.

Returns:

split if as_type is “split” list[tuple[pd.Index, pd.Index]]:

train and test indices if as_type is “ids”

Return type:

DataSplit

getSubset(subset: list[str], ids: list[str] | None = None, name: str | None = None, path: str = '.', **kwargs) QSPRTable[source]

Get a subset of the data set.

Parameters:
  • subset (list[str]) – list of columns to include in the subset

  • ids (list[str], optional) – list of IDs to include in the subset. Defaults to None.

  • name (str, optional) – name of the subset. Defaults to None.

  • path (str, optional) – path to the directory where the subset will be saved. Defaults to “.”.

  • **kwargs – additional keyword arguments for the constructor of QSPRTable.

Returns:

subset of the data set

Return type:

QSPRTable

getSummary() DataFrame

Get a summary of the data set.

Returns:

Summary of the data set.

Return type:

(pd.DataFrame)

Raises:

NotImplementedError – Summary not yet available for MoleculeTable.

getTarget(name: str | TargetSpec) Series[source]

Get the target property values for the given target property.

Parameters:

name (str | TargetSpec) – name or specification of the target property

Returns:

target property values

Return type:

(pd.Series)

getTargetPropertiesNames() list[str]

Get the names of the target properties. :returns: list of target property names :rtype: (list[str])

getTargetSpec(name: str) TargetSpec[source]

Get the target specification of a single target property by its name.

Parameters:

name (str) – name of the target property

Returns:

target specification with the given name

Return type:

TargetSpec

Raises:

ValueError – if the target property with the given name is not found

getTargetSpecs(names: list | None) list[TargetSpec][source]

Get the target specifications with the given names.

Parameters:

names (list[str]) – name of the target properties

Returns:

list of target specifications

Return type:

(list[TargetSpec])

getTargets() DataFrame[source]

Get the target property values

Returns:

target property values

Return type:

(pd.DataFrame)

property hasClusters: bool

Check whether the data frame contains clusters.

Returns:

Whether the data frame contains clusters.

Return type:

bool

hasDescriptors(descriptors: list[DescriptorSet | str] | None = None) bool | list[bool]

Check whether the data frame contains given descriptors.

Parameters:

None) ((list[DescriptorSet | str] |) – List of descriptor objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.

Returns:

Whether the data frame contains the given descriptors.

Return type:

(bool | list[bool])

hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

property hasScaffoldGroups: bool

Check whether the data frame contains scaffold groups.

Returns:

Whether the data frame contains scaffold groups.

Return type:

(bool)

property hasScaffolds: bool

Check whether the data frame contains scaffolds.

Returns:

Whether the data frame contains scaffolds.

Return type:

bool

property idProp: str

Get the name of the property that contains the molecule IDs.

property identifier: ChemIdentifier

Get the identifier to use for the data set.

property isMultiTask: bool

Check if the dataset contains multiple target properties.

Returns:

True if the dataset contains multiple target properties

Return type:

(bool)

iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol], None, None]

Iterate over chunks of the data set.

Parameters:
  • size (int, optional) – Size of the chunks.

  • on_props (list, optional) – Properties to iterate over.

  • chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the chunks.

Return type:

(Generator[list[StoredMol], None, None])

iterSplit(name: str, as_type: str = 'ids') Generator[tuple[Index, Index], None, None] | Generator[tuple[ndarray, ndarray, ndarray, ndarray], None, None] | Generator[tuple[DataFrame, DataFrame, DataFrame, DataFrame], None, None] | Generator[tuple[QSPRTable, QSPRTable], None, None][source]

Get the split with the given name.

Parameters:

name (str) – name of the split

as_type (str): Determines the type of output. Can be one of:
  • “ids”: yields train and test indices.

  • “numpy”: Yields train and test numpy arrays.

  • “pandas”: Yields train and test pandas DataFrames.

  • “QSPRTable”: Yields train and test QSPRTable objects.

Yields:

tuple[pd.Index, pd.Index] – train and test indices if as_type is “ids” tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:

train descriptors, train targets, test descriptors, test targets as_type is “numpy”

tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:

train descriptors, train targets, test descriptors, test targets as_type is “pandas”

tuple[QSPRTable, QSPRTable]:

train and test QSPRTable objects if as_type is “QSPRTable”

makeClassification(target_property: str, th: list[float] | None = None)[source]

Switch to classification task using the given threshold values.

Parameters:
  • target_property (str) – Name of target property to use for classification

  • th (list[float], optional) – list of threshold values. If not provided, it is assumed that the target property is already discretized and can be used for classification.

makeRegression(target_property: str)[source]

Switch to regression task using the given target property.

Parameters:

target_property (str) – name of the target property to use for regression

property metaFile: str

Get the path to the meta file of the data set.

property nJobs: int

Get the number of jobs to use for parallel processing.

property nTargetProperties: int

Get the number of target properties in the dataset.

property name: str

Get the name of the data set.

processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None]

Process molecules in the data set.

Parameters:
  • processor (MolProcessor) – Processor to use for molecule processing.

  • proc_args (tuple, optional) – Positional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Keyword arguments to pass to the processor.

  • mol_type (Literal["smiles", "mol", "rdkit"], optional) – Type of molecules to process.

  • add_props (Iterable[str], optional) – Additional properties to add to the data frame.

Returns:

Generator of the results.

Return type:

(Generator[Any, None, None])

property randomState: int

Get the random state to use for shuffling and other random ops.

reload()

Reload the data set from disk.

removeProperty(name: str) bool

Remove a property from the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property was removed successfully.

Return type:

(bool)

restoreDescriptorSets(descriptors: list[DescriptorSet | str])

Restore descriptors that were previously removed.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

Raises:

ValueError – If any of the descriptors are not present in the data set.

restoreTargetProperty(prop: TargetSpec | str)[source]

Reset target property to its original value.

Parameters:

prop (TargetProperty | str) – target property to reset

sample(n: int, name: str | None = None, random_state: int | None = None) MoleculeTable

Sample n molecules from the table.

Parameters:
  • n (int) – Number of molecules to sample.

  • name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.

  • random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save()

Save the whole storage to disk.

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None, path: str = '.') MoleculeTable

Search the data set based on a property.

Parameters:
  • prop_name (str) – Name of the property to search on.

  • values (list[float | int | str]) – Values to search for.

  • exact (bool) – Whether to perform an exact search.

  • name (str) – Name of the new table.

  • path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, path: str = '.') MoleculeTable

Search the data set with SMARTS patterns.

Parameters:
  • patterns (list[str]) – List of SMARTS patterns to search for.

  • operator (Literal["or", "and"]) – Operator to use for combining the patterns.

  • use_chirality (bool) – Whether to use chirality in the search.

  • name (str) – Name of the new table.

  • path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

setTargetProperties(target_props: list[TargetSpec | dict], drop_empty: bool = True)[source]

Set list of target properties for the dataset.

Parameters:
  • target_props (list[TargetSpec | dict]) – list of target properties specifications or dictionaries to initialize the TargetSpec objects from.

  • drop_empty (bool, optional) – whether to drop rows with empty target property values. Defaults to True.

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in the data set.

property smilesProp: str

Get the name of the property that contains the SMILES strings.

split(split: DataSplit) Generator[tuple[Index, Index], None, None][source]

Create folds from Descriptors and Targets. Can be used either for cross-validation, bootstrapping or train-test split.

Parameters:
  • split (DataSplit) – Split to apply to the data

  • X (pd.DataFrame) – data to apply the split to

  • y (pd.DataFrame | None) – target data to apply the split to

Yields:

pd.Index, pd.Index – indices of the train and test set

property standardizer: ChemStandardizer

Get the standardizer to use for the data set.

property targetProperties: list[TargetSpec]

Returns the specifications of target properties of the dataset.

toFile(filename: str)

Save the data set to a file.

Parameters:

filename (str) – Path to the file to save the data set to.

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

transformProperties(names: list[str], transformer: Callable[[Iterable[Any]], Iterable[Any]])

Transform the properties of the data frame.

Parameters:
  • names (list[str]) – List of property names to transform.

  • transformer (Callable) – Function to use for transformation.

unsetTargetProperty(name: str | TargetSpec)[source]

Unset a target property. It will not remove it from the data set, but will make it unavailable for training.

Parameters:

name (str | TargetSpec) – name or specification of the target property to drop

class qsprpred.data.RandomSplit(test_fraction=0.1, seed: int | None = None)[source]

Bases: DataSplit, Randomized

Splits dataset in random train and test subsets.

Variables:
  • testFraction (float) – fraction of total dataset to testset

  • seed (int) – Random state to use for shuffling and other random operations.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

property randomState: int

Get the random state for the object.

split(X, y)[source]

Split the given data into one or multiple train/test subsets.

These classes handle partitioning of a feature matrix by returning an generator of train and test indices. It is compatible with the approach taken in the sklearn package (see sklearn.model_selection._BaseKFold). This can be used for both cross-validation or a one time train/test split.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix X (note that these are integer indices, rather than a pandas index!)

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.ScaffoldSplit(scaffold: ~qsprpred.data.chem.scaffolds.Scaffold = <qsprpred.data.chem.scaffolds.BemisMurckoRDKit object>, test_fraction: float = 0.1, n_folds: int = 1, custom_test_list: list | None = None, data_set: ~qsprpred.data.tables.interfaces.qspr_data_set.QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit

Splits dataset into balanced train and test subsets based on molecular scaffolds.

Variables:
  • testFraction (float) – fraction of total dataset to testset

  • customTestList (list) – list of molecule indexes to force in test set

  • split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) None

Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.TemporalSplit(timesplit: float | list[float], timeprop: str, data_set: QSPRDataSet | None = None)[source]

Bases: DataSplit, DataSetDependent

Splits dataset train and test subsets based on a threshold in time.

Variables:
  • timeSplit (float) – time point after which sample to test set

  • timeCol (str) – name of the column within the dataframe with timepoints

Initialize a TemporalSplit object.

Parameters:
  • timesplit (float | list[float]) – time point after which sample is moved to test set. If a list is provided, the splitter will split the dataset into multiple subsets based on the timepoints in the list.

  • timeprop (str) – name of the column within the dataset with timepoints

  • dataset (QSPRDataSet) – dataset that this splitter will be acting on

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) None

Set the data set for this object.

split(X, y)[source]

Split single-task dataset based on a time threshold.

Parameters:
  • X (np.ndarray | pd.DataFrame) – the input data matrix

  • y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)
toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)