qsprpred.data package

Subpackages

Module contents

class qsprpred.data.BootstrapSplit(split: DataSplit, n_bootstraps=5, seed=None, dataset=None)[source]

Bases: DataSplit, Randomized, DataSetDependent

Splits dataset in random train and test subsets (bootstraps). Unlike cross-validation, bootstrapping allows for repeated samples in the test set.

Variables:

nBootstraps (int) – number of bootstraps to perform
seed (int) – Random state to use for shuffling and other random operations.

Initialize a BootstrapSplit object.

Parameters:

split (DataSplit) – the splitter to use for the bootstraps
n_bootstraps (int) – number of bootstraps to perform
seed (int) – random seed to use for random operations
dataset (QSPRDataSet) – dataset for the underlying splitter if it is DataSetDependent

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

property randomState: int: Get the random state for the object.

setDataSet(dataset)[source]: Set the dataset for the underlying splitter.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) → Iterable[tuple[list[int], list[int]]][source]

Split the given data into nBootstraps training and test sets.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over nBootstraps tuples generated by the underlying splitter

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.data.ClusterSplit(test_fraction: float = 0.1, n_folds: int = 1, custom_test_list: list[str] | None = None, seed: int | None = None, clustering: MoleculeClusters | None = None, data_set: QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit, Randomized

Splits dataset into balanced train and test subsets based on clusters of similar molecules.

Variables:

testFraction (float) – fraction of total dataset to testset
customTestList (list) – list of molecule indexes to force in test set
seed (int) – Random state to use for shuffling and other random operations.
split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

property randomState: int: Get the random state for the object.

setDataSet(dataset: QSPRDataSet | None) → None: Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) → Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

Bases: Pipeline

Pipeline class for applying data preprocessing steps to a QSPRDataset.

Variables:

feature_calculators (list[DescriptorSet] | None) – List of feature calculators to apply to the dataset. If None, no feature calculators are applied.
originalfeatureNames (list[str] | None) – Original feature names in the dataset before applying the pipeline.

Initialize the DatasetPipeline

Parameters:

feature_calculators (list[DescriptorSet] | None) – List of feature calculators to apply to the dataset.
steps (dict[str, Step | BaseEstimator]) – Dictionary of named steps in the pipeline, if the step is a scikit-learn transformer, it will be wrapped in a SklearnStep.
fixed (list[str]) – List of step names that should not be fitted, only transformed
fit_on (dict[str, str]) – Settings for which data a step should be fitted on. Either ‘train’, ‘test’ or ‘both’, if not specified the step is fitted on the training data.
apply_to (dict[str, str]) – Settings for which data a step should be applied to. Either ‘train’, ‘test’ or ‘both’, if not specified the step is applied to both.
skip (list[str]) – List of step names to skip
seed (int | None) – Random state for the pipeline

addSkip(name: str)

Add a step to the skip list

Parameters:: name (str) – name of the step to skip

addStep(name: str, step: Step, fit_on: str = 'train', apply_to: str = 'both', fixed: bool = False)

Add a step to the pipeline

Parameters:

name (str) – name of the step
step (Step) – step to add to the pipeline
fit_on (str) – whether to fit the step on ‘train’, ‘test’ or ‘both’
apply_to (str) – whether to apply the step on ‘train’, ‘test’ or ‘both’
fixed (bool) – whether the step should be fixed and not fitted

Apply the pipeline to the data

If fit is True, the pipeline is fitted to the training data and then applied to the train and test data. If fit is False, the pipeline is only applied to the data.

Parameters:

X_train (pd.DataFrame) – training data to apply the pipeline to
y_train (pd.DataFrame | None) – training target data to apply the pipeline to
X_test (pd.DataFrame | None) – test data to apply the pipeline to
y_test (pd.DataFrame | None) – test target data to apply the pipeline to
fit (bool) – whether to fit the pipeline

Returns:

transformed training data y_train (pd.DataFrame | None): transformed training targets X_test (pd.DataFrame | None): transformed test data y_test (pd.DataFrame | None): transformed test targets

Return type:

X_train (pd.DataFrame)

applyOnDataSet(dataset: QSPRTable, split: DataSplit | None = None, fit: bool = True, seed: int | None = None) → Generator[tuple[DataFrame, DataFrame, DataFrame, DataFrame] | tuple[DataFrame, DataFrame], None, None][source]

Apply the pipeline to the dataset

Note. the random state of the dataset is used to randomize the pipeline: when the seed of feature calculators, splits or steps is not set.

Parameters:

dataset (QSPRTable) – dataset to apply the pipeline to
split (DataSplit) – split to apply to the dataset
seed (int | None) – seed to randomize the pipeline, if None, the random state of the dataset is used
fit (bool) – whether to fit the pipeline

Yields:

X_train (pd.DataFrame) – transformed training data y_train (pd.DataFrame): transformed training targets X_test (pd.DataFrame | None): transformed test data if split is not None y_test (pd.DataFrame | None): transformed test targets if split is not None

property fitted: bool: Check if the pipeline is fitted

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

orderSteps(order: list[str])

Order the steps in the pipeline

Parameters:: order (list[str]) – list of step names in the desired order

property randomState: int | None: Get the random state for the object.

removeSkip(name: str)

Remove a step from the skip list

Parameters:: name (str) – name of the step to remove from the skip list

removeStep(name: str)

Remove a step from the pipeline

Parameters:: name (str) – name of the step to remove

property skip: list[str]

Get the steps to skip

The steps to skip are not fitted or transformed, but are still present in the pipeline.

Returns:: list of step names to skip
Return type:: list[str]

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.data.GBMTRandomSplit(test_fraction: float = 0.1, n_folds: int = 1, seed: int | None = None, n_initial_clusters: int | None = None, custom_test_list: list[str] | None = None, data_set: QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit, Randomized

Splits dataset into balanced random train and test subsets.

Variables:

testFraction (float) – fraction of total dataset to testset
customTestList (list) – list of molecule indexes to force in test set
split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

property randomState: int: Get the random state for the object.

setDataSet(dataset: QSPRDataSet | None) → None: Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) → Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.data.MoleculeTable(storage: ChemStore | None = None, name: str | None = None, path: str = '.', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: MoleculeDataSet, Parallelizable

Class that holds and prepares molecule data for modelling and other analyses organized as a collection of PandasDataTable objects.

Variables:

descriptors (list[DescriptorTable]) – List of descriptor tables attached to this data set.
randomState (int) – Random state to use for shuffling and other random ops.
storeFormat (str) – Format to use for storing the data set.
rootDir (str) – Path to the directory where the data set is stored.
storage (ChemStore) – The storage object that holds the molecule data.
path (str) – Path to the directory where the data set will be stored.
name (str) – Name of the data set.

Initialize a MoleculeTable object.

This object wraps a pandas dataframe and provides short-hand methods to prepare molecule data for modelling and analysis.

Parameters:

storage (ChemStore) – The storage object that holds the molecule data.
name (str) – Name of the data set.
path (str) – Path to the directory where the data set will be stored.
random_state (int) – Random state to use for shuffling and other random ops.
store_format (str) – Format to use for storing the data set.

addClusters(clusters: list[MoleculeClusters], recalculate: bool = False)[source]

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:

clusters (list) – list of MoleculeClusters calculators.
recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[DescriptorSet], recalculate: bool = False, *args, **kwargs)[source]

Add descriptors to the data frame with the given descriptor calculators.

Parameters:

descriptors (list[DescriptorSet]) – List of DescriptorSet objects to use for descriptor calculation.
recalculate (bool) – Whether to recalculate descriptors even if they are already present in the data frame. If False, existing descriptors are kept and no calculation takes place.
*args – Additional positional arguments to pass to each descriptor set.
**kwargs – Additional keyword arguments to pass to each descriptor set.

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]

Add entries to the data set.

Parameters:

ids (list[str]) – IDs of the entries to add.
props (dict[str, list]) – Properties to add.
raise_on_existing (bool)
exist. (Whether to raise an error if the entries already)

Raises:

NotImplementedError – Adding entries is not yet available for the data set.

addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (Sized) – Property values.
ids (list[str], optional) – IDs of the molecules to add the property for.

Returns:

Whether the property was added successfully.

Return type:

(bool)

addScaffolds(scaffolds: list[Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)[source]

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:

scaffolds (list) – list of Scaffold calculators.
add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.
recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') → Generator[Iterable[Any], None, None][source]

Apply a function to the data set.

Parameters:

func (callable) – Function to apply.
func_args (list, optional) – Positional arguments to pass to the function.
func_kwargs (dict, optional) – Keyword arguments to pass to the function.
on_props (tuple[str, ...], optional) – Properties to apply the function on.
chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the results.

Return type:

(Generator[Iterable[Any], None, None])

applyIdentifier(identifier: ChemIdentifier)[source]

Apply an identifier to the data set.

Parameters:: identifier (ChemIdentifier) – Identifier to apply.

applyStandardizer(standardizer: ChemStandardizer)[source]

Apply a standardizer to the data set.

Parameters:: standardizer (ChemStandardizer) – Standardizer to apply.

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)[source]

Attach descriptors to the data frame.

Parameters:

calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.
descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.
index_cols (list) – List of column names to use as index.

property chunkSize: int: Get the size of chunks to use per job in parallel processing.

clear()[source]: Clear the data set from memory and disk.

createScaffoldGroups(mols_per_group: int = 10)[source]

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:: mols_per_group (int) – Number of molecules per scaffold group.

property descriptorSets: list[DescriptorSet]: Get the descriptor calculators for this table.

property descsPath

dropDescriptorSets(descriptors: list[DescriptorSet | str], full_removal: bool = False)[source]

Drop descriptors from the given sets from the data frame.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

Raises:

AssertionError – If the data set does not contain any descriptors.

dropDescriptors(descriptors: list[str])[source]

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:: descriptors (list[str]) – List of descriptor names to drop.

dropEmptyEntries(names: list[str])[source]

Drop rows with missing values in the properties.

Parameters:: names (list[str]) – list property names

dropEntries(ids: Iterable[str])[source]

Drop entries from the data set.

Parameters:: ids (Iterable[str]) – IDs of the entries to drop.

classmethod fromDF(name: str, df: DataFrame, path: str = '.', smiles_col: str = 'SMILES', **kwargs) → MoleculeTable[source]

Create a MoleculeTable instance from a pandas DataFrame.

Parameters:

name (str) – Name of the data set.
df (pd.DataFrame) – DataFrame containing the molecule data.
path (str) – Path to the directory where the data set will be stored.
smiles_col (str) – Name of the column in the data frame containing the SMILES sequences.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

classmethod fromSDF(name: str, filename: str, path: str, smiles_prop: str, *args, **kwargs)[source]

Create a MoleculeTable instance from an SDF file.

Parameters:

name (str) – Name of the data set.
filename (str) – Path to the SDF file.
path (str) – Path to the directory where the data set will be stored.
smiles_prop (str) – Name of the property in the SDF file containing the SMILES sequence.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

classmethod fromSMILES(name: str, smiles: list, path: str, *args, **kwargs)[source]

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:

name (str) – Name of the data set.
smiles (list) – list of SMILES sequences.
path (str) – Path to the directory where the data set will be stored.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromTableFile(name: str, filename: str, path: str, *args, sep='\t', **kwargs)[source]

Create a MoleculeTable instance from a file containing a table of molecules (i.e. a CSV file).

Parameters:

name (str) – Name of the data set.
filename (str) – Path to the file containing the table.
path (str) – Path to the directory where the data set will be stored.
sep (str) – Separator used in the file for different columns.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

generateDescriptorDataSetName(ds_set: str | DescriptorSet, name: str | None = None) → str[source]

Generate a descriptor set name from a descriptor set.

Parameters:

ds_set (str | DescriptorSet) – Name of the descriptor set.
name (str) – Name of the data set.

Returns:

Name of the descriptor set.

Return type:

(str)

getClusterNames(clusters: list[MoleculeClusters] | None = None) → list[str][source]

Get the names of the clusters in the data frame.

Parameters:: clusters (list) – List of cluster calculators of clusters to include
Returns:: List of cluster names.
Return type:: (list[str])

getClusters(clusters: list[MoleculeClusters] | None = None)[source]

Get the subset of the data frame that contains only clusters.

Parameters:: clusters (list) – List of cluster calculators of clusters to include.
Returns:: Data frame containing only clusters.
Return type:: pd.DataFrame

getDF() → DataFrame[source]: Get the data frame of the data set.

getDescriptorNames() → list[str][source]

Get the names of the descriptors present for molecules in this data set.

Returns:: list of descriptor names.
Return type:: (list[str])

getDescriptors(active_only: bool = True) → DataFrame[source]

Get the calculated descriptors as a pandas data frame.

Returns:: Data frame containing only descriptors.
Return type:: pd.DataFrame

getProperties() → list[str][source]: Get the names of the properties in the data frame.

getProperty(name: str, ids: tuple[str] | None = None) → Iterable[Any][source]

Get the property with the given name.

Parameters:

name (str) – Name of the property.
ids (tuple[str], optional) – IDs of the molecules to get the property for.

Returns:

Property values.

Return type:

(Iterable[Any])

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10) → Series[source]

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:

scaffold_name (str) – Name of the scaffold.
mol_per_group (int) – Number of molecules per scaffold group.

Returns:

Series containing the scaffold groups.

Return type:

(pd.Series)

getScaffoldNames(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) → list[str][source]

Get the names of the scaffolds in the data frame.

Parameters:

scaffolds (list) – List of scaffold calculators of scaffolds to include.
include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

List of scaffold names.

Return type:

(list[str])

getScaffolds(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) → DataFrame[source]

Get the subset of the data frame that contains only scaffolds.

Parameters:

scaffolds (list) – List of scaffold calculators of scaffolds to include.
include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

Data frame containing only scaffolds.

Return type:

pd.DataFrame

getSubset(subset: Iterable[str], ids: Iterable[str] | None = None, name: str | None = None, path: str = '.', **kwargs) → MoleculeTable[source]

Get a subset of the data frame.

Parameters:

subset (Iterable[str]) – List of properties to include in the subset.
ids (Iterable[str], optional) – IDs of the molecules to include in the subset.
name (str, optional) – Name of the new data set.
path (str) – Path to the directory where the data set will be stored.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

getSummary() → DataFrame[source]

Get a summary of the data set.

Returns:: Summary of the data set.
Return type:: (pd.DataFrame)
Raises:: NotImplementedError – Summary not yet available for MoleculeTable.

property hasClusters: bool

Check whether the data frame contains clusters.

Returns:: Whether the data frame contains clusters.
Return type:: bool

hasDescriptors(descriptors: list[DescriptorSet | str] | None = None) → bool | list[bool][source]

Check whether the data frame contains given descriptors.

Parameters:: None) ((list[DescriptorSet | str] |) – List of descriptor objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.
Returns:: Whether the data frame contains the given descriptors.
Return type:: (bool | list[bool])

hasProperty(name: str) → bool[source]

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.

property hasScaffoldGroups: bool

Check whether the data frame contains scaffold groups.

Returns:: Whether the data frame contains scaffold groups.
Return type:: (bool)

property hasScaffolds: bool

Check whether the data frame contains scaffolds.

Returns:: Whether the data frame contains scaffolds.
Return type:: bool

property idProp: str: Get the name of the property that contains the molecule IDs.

property identifier: ChemIdentifier: Get the identifier to use for the data set.

iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') → Generator[list[StoredMol], None, None][source]

Iterate over chunks of the data set.

Parameters:

size (int, optional) – Size of the chunks.
on_props (list, optional) – Properties to iterate over.
chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the chunks.

Return type:

(Generator[list[StoredMol], None, None])

property metaFile: str: Get the path to the meta file of the data set.

property nJobs: int: Get the number of jobs to use for parallel processing.

property name: str: Get the name of the data set.

processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) → Generator[Any, None, None][source]

Process molecules in the data set.

Parameters:

processor (MolProcessor) – Processor to use for molecule processing.
proc_args (tuple, optional) – Positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Keyword arguments to pass to the processor.
mol_type (Literal["smiles", "mol", "rdkit"], optional) – Type of molecules to process.
add_props (Iterable[str], optional) – Additional properties to add to the data frame.

Returns:

Generator of the results.

Return type:

(Generator[Any, None, None])

property randomState: int: Get the random state to use for shuffling and other random ops.

reload()[source]: Reload the data set from disk.

removeProperty(name: str) → bool[source]

Remove a property from the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property was removed successfully.
Return type:: (bool)

restoreDescriptorSets(descriptors: list[DescriptorSet | str])[source]

Restore descriptors that were previously removed.

Parameters:: descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
Raises:: ValueError – If any of the descriptors are not present in the data set.

sample(n: int, name: str | None = None, random_state: int | None = None) → MoleculeTable[source]

Sample n molecules from the table.

Parameters:

n (int) – Number of molecules to sample.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.
random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save()[source]: Save the whole storage to disk.

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None, path: str = '.') → MoleculeTable[source]

Search the data set based on a property.

Parameters:

prop_name (str) – Name of the property to search on.
values (list[float | int | str]) – Values to search for.
exact (bool) – Whether to perform an exact search.
name (str) – Name of the new table.
path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, path: str = '.') → MoleculeTable[source]

Search the data set with SMARTS patterns.

Parameters:

patterns (list[str]) – List of SMARTS patterns to search for.
operator (Literal["or", "and"]) – Operator to use for combining the patterns.
use_chirality (bool) – Whether to use chirality in the search.
name (str) – Name of the new table.
path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

property smiles: Generator[str, None, None]: Generator of SMILES strings of all molecules in the data set.

property smilesProp: str: Get the name of the property that contains the SMILES strings.

property standardizer: ChemStandardizer: Get the standardizer to use for the data set.

toFile(filename: str)[source]

Save the data set to a file.

Parameters:: filename (str) – Path to the file to save the data set to.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable[[Iterable[Any]], Iterable[Any]])[source]

Transform the properties of the data frame.

Parameters:

names (list[str]) – List of property names to transform.
transformer (Callable) – Function to use for transformation.

class qsprpred.data.QSPRTable(storage: ChemStore | None = None, name: str | None = None, target_props: list[TargetSpec | dict] | None = None, path: str = '.', random_state: int | None = None, store_format: str = 'pkl', drop_empty_target_props: bool = True)[source]

Bases: QSPRDataSet, MoleculeTable

Implementation of QSPRDataSet using a collection of PandasDataTable objects.

Variables:: targetProperties (str) – property to be predicted with QSPRmodel

Construct QSPRdata, also apply transformations of output property if specified.

Parameters:

storage (ChemStore | None) – storage object to use for saving the data. Defaults to None.
name (str) – data name, used in saving the data
target_props (list[TargetSpec | dict] | None) – target properties, names should correspond with target column names in df. If None, target specifications will be inferred if this data set has been saved previously. Defaults to None.
path (str, optional) – path to the directory where the data set will be saved. Defaults to “.”.
random_state (int, optional) – random state for splitting the data.
store_format (str, optional) – format to use for storing the data (‘pkl’ or ‘csv’).
drop_empty_target_props (bool, optional) – whether to ignore entries with empty target properties. Defaults to True.

Raises:

ValueError – Raised if threshold given with non-classification task.

addClusters(clusters: list[MoleculeClusters], recalculate: bool = False)

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:

clusters (list) – list of MoleculeClusters calculators.
recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[DescriptorSet], recalculate: bool = False, *args, **kwargs)

Add descriptors to the data frame with the given descriptor calculators.

Parameters:

descriptors (list[DescriptorSet]) – List of DescriptorSet objects to use for descriptor calculation.
recalculate (bool) – Whether to recalculate descriptors even if they are already present in the data frame. If False, existing descriptors are kept and no calculation takes place.
*args – Additional positional arguments to pass to each descriptor set.
**kwargs – Additional keyword arguments to pass to each descriptor set.

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the data set.

Parameters:

ids (list[str]) – IDs of the entries to add.
props (dict[str, list]) – Properties to add.
raise_on_existing (bool)
exist. (Whether to raise an error if the entries already)

Raises:

NotImplementedError – Adding entries is not yet available for the data set.

addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (Sized) – Property values.
ids (list[str], optional) – IDs of the molecules to add the property for.

Returns:

Whether the property was added successfully.

Return type:

(bool)

addScaffolds(scaffolds: list[Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:

scaffolds (list) – list of Scaffold calculators.
add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.
recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

addSplit(split: DataSplit, name: str)[source]

Add a split to the dataset.

Performs the split and stores the split object and the indices of the split. If the split has a random state, it will be set to the random state of the dataset if it is not set.

Parameters:

split (DataSplit) – split to add
name (str) – name of the split

addTargetProperty(target_spec: TargetSpec | dict, drop_empty: bool = True)[source]

Add a target property to the dataset.

Parameters:

target_spec (TargetSpec | dict) – target property specification to add or dictionary to initialize a TargetSpec
drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') → Generator[Iterable[Any], None, None]

Apply a function to the data set.

Parameters:

func (callable) – Function to apply.
func_args (list, optional) – Positional arguments to pass to the function.
func_kwargs (dict, optional) – Keyword arguments to pass to the function.
on_props (tuple[str, ...], optional) – Properties to apply the function on.
chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the results.

Return type:

(Generator[Iterable[Any], None, None])

applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the data set.

Parameters:: identifier (ChemIdentifier) – Identifier to apply.

applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the data set.

Parameters:: standardizer (ChemStandardizer) – Standardizer to apply.

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)

Attach descriptors to the data frame.

Parameters:

calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.
descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.
index_cols (list) – List of column names to use as index.

checkClassification(target_property: str) → bool[source]

Checks the validity of the target property for classification tasks.

Parameters:: target_property (str) – Name of the target property to use for classification
Returns:: True if the target property is correctly set up for classification, False otherwise.
Return type:: bool

property chunkSize: int: Get the size of chunks to use per job in parallel processing.

clear(): Clear the data set from memory and disk.

createScaffoldGroups(mols_per_group: int = 10)

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:: mols_per_group (int) – Number of molecules per scaffold group.

property descriptorSets: list[DescriptorSet]: Get the descriptor calculators for this table.

property descsPath

dropDescriptorSets(descriptors: list[DescriptorSet | str], full_removal: bool = False)

Drop descriptors from the given sets from the data frame.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

Raises:

AssertionError – If the data set does not contain any descriptors.

dropDescriptors(descriptors: list[str])

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:: descriptors (list[str]) – List of descriptor names to drop.

dropEmptyEntries(names: list[str])

Drop rows with missing values in the properties.

Parameters:: names (list[str]) – list property names

dropEntries(ids: Iterable[str])

Drop entries from the data set.

Parameters:: ids (Iterable[str]) – IDs of the entries to drop.

filter(table_filters: list[Callable])[source]

Filter the data set using the given filters.

Parameters:: table_filters (list[DataFilter]) – list of filters to apply

classmethod fromDF(name: str, df: DataFrame, target_props: list[TargetSpec | dict], path: str = '.', smiles_col: str = 'SMILES', drop_empty_target_props: bool = True, **kwargs) → QSPRTable[source]

Create QSPRTable from a pandas DataFrame.

Parameters:

name (str) – name of the data set
df (pd.DataFrame) – data frame containing the data
target_props (list[TargetProperty | dict]) – target properties to use
path (str) – path to the directory where the data set will be saved
smiles_col (str) – name of the column containing SMILES
drop_empty_target_props (bool, optional) – whether to drop rows with empty target property values. Defaults to True.
**kwargs – additional keyword arguments for MoleculeTable constructor

Returns:

created data set

Return type:

QSPRTable

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

classmethod fromMolTable(mol_table: MoleculeTable, target_props: list[TargetSpec | dict], *args, path: str = '.', name: str | None = None, **kwargs) → QSPRTable[source]

Create QSPRTable from a MoleculeTable.

Parameters:

mol_table (MoleculeTable) – MoleculeTable to use as the data source
target_props (list) – list of target properties to use
*args – additional positional arguments to pass to the constructor of QSPRTable
path (str) – path to the directory where the data set will be saved
name (str) – name of the data set
**kwargs – additional keyword arguments to pass to the constructor of QSPRTable

Returns:

created data set

Return type:

QSPRTable

classmethod fromSDF(name: str, filename: str, smiles_prop: str, *args, **kwargs)[source]

Create QSPRTable from SDF file.

It is currently not implemented for QSPRTable, but you can convert from ‘MoleculeTable’ with the ‘fromMolTable’ method.

Parameters:

name (str) – name of the data set
filename (str) – path to the SDF file
smiles_prop (str) – name of the property in the SDF file containing SMILES
*args – additional arguments for QSPRTable constructor
**kwargs – additional keyword arguments for QSPRTable constructor

classmethod fromSMILES(name: str, smiles: list, path: str, *args, **kwargs)

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:

name (str) – Name of the data set.
smiles (list) – list of SMILES sequences.
path (str) – Path to the directory where the data set will be stored.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

Returns:

The created data set.

Return type:

(MoleculeTable)

classmethod fromTableFile(name: str, filename: str, path: str, *args, sep: str = '\t', target_props: list[TargetSpec | dict] | None = None, **kwargs)[source]

Create QSPRTable from table file (i.e. CSV or TSV).

Parameters:

name (str) – name of the data set
filename (str) – path to the table file
path (str) – path to the directory where the data set will be saved
*args – additional arguments for MolTable constructor
sep (str, optional) – separator in the table file. Defaults to “t”.
target_props (list[TargetProperty | dict], optional) – target properties to use. Defaults to None.
**kwargs – additional keyword arguments for MolTable constructor

Returns:

QSPRTable object

Return type:

QSPRTable

generateDescriptorDataSetName(ds_set: str | DescriptorSet, name: str | None = None) → str

Generate a descriptor set name from a descriptor set.

Parameters:

ds_set (str | DescriptorSet) – Name of the descriptor set.
name (str) – Name of the data set.

Returns:

Name of the descriptor set.

Return type:

(str)

getClusterNames(clusters: list[MoleculeClusters] | None = None) → list[str]

Get the names of the clusters in the data frame.

Parameters:: clusters (list) – List of cluster calculators of clusters to include
Returns:: List of cluster names.
Return type:: (list[str])

getClusters(clusters: list[MoleculeClusters] | None = None)

Get the subset of the data frame that contains only clusters.

Parameters:: clusters (list) – List of cluster calculators of clusters to include.
Returns:: Data frame containing only clusters.
Return type:: pd.DataFrame

getDF() → DataFrame: Get the data frame of the data set.

getDescriptorNames() → list[str]

Get the names of the descriptors present for molecules in this data set.

Returns:: list of descriptor names.
Return type:: (list[str])

getDescriptors(active_only: bool = True) → DataFrame

Get the calculated descriptors as a pandas data frame.

Returns:: Data frame containing only descriptors.
Return type:: pd.DataFrame

getProperties() → list[str]: Get the names of the properties in the data frame.

getProperty(name: str, ids: tuple[str] | None = None) → Iterable[Any]

Get the property with the given name.

Parameters:

name (str) – Name of the property.
ids (tuple[str], optional) – IDs of the molecules to get the property for.

Returns:

Property values.

Return type:

(Iterable[Any])

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10) → Series

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:

scaffold_name (str) – Name of the scaffold.
mol_per_group (int) – Number of molecules per scaffold group.

Returns:

Series containing the scaffold groups.

Return type:

(pd.Series)

getScaffoldNames(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) → list[str]

Get the names of the scaffolds in the data frame.

Parameters:

scaffolds (list) – List of scaffold calculators of scaffolds to include.
include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

List of scaffold names.

Return type:

(list[str])

getScaffolds(scaffolds: list[Scaffold] | None = None, include_mols: bool = False) → DataFrame

Get the subset of the data frame that contains only scaffolds.

Parameters:

scaffolds (list) – List of scaffold calculators of scaffolds to include.
include_mols (bool) – Whether to include the RDKit scaffold columns as well.

Returns:

Data frame containing only scaffolds.

Return type:

pd.DataFrame

getSplit(name: str, as_type: str = 'split') → DataSplit | list[tuple[Index, Index]][source]

Get the split with the given name.

Parameters:: name (str) – name of the split

as_type (str): Determines the type of output. Can be one of:

“split”: Returns a DataSplit object.
“ids”: Returns train and test indices.

Returns:

split if as_type is “split” list[tuple[pd.Index, pd.Index]]:

train and test indices if as_type is “ids”

Return type:

DataSplit

getSubset(subset: list[str], ids: list[str] | None = None, name: str | None = None, path: str = '.', **kwargs) → QSPRTable[source]

Get a subset of the data set.

Parameters:

subset (list[str]) – list of columns to include in the subset
ids (list[str], optional) – list of IDs to include in the subset. Defaults to None.
name (str, optional) – name of the subset. Defaults to None.
path (str, optional) – path to the directory where the subset will be saved. Defaults to “.”.
**kwargs – additional keyword arguments for the constructor of QSPRTable.

Returns:

subset of the data set

Return type:

QSPRTable

getSummary() → DataFrame

Get a summary of the data set.

Returns:: Summary of the data set.
Return type:: (pd.DataFrame)
Raises:: NotImplementedError – Summary not yet available for MoleculeTable.

getTarget(name: str | TargetSpec) → Series[source]

Get the target property values for the given target property.

Parameters:: name (str | TargetSpec) – name or specification of the target property
Returns:: target property values
Return type:: (pd.Series)

getTargetPropertiesNames() → list[str]: Get the names of the target properties. :returns: list of target property names :rtype: (list[str])

getTargetSpec(name: str) → TargetSpec[source]

Get the target specification of a single target property by its name.

Parameters:: name (str) – name of the target property
Returns:: target specification with the given name
Return type:: TargetSpec
Raises:: ValueError – if the target property with the given name is not found

getTargetSpecs(names: list | None) → list[TargetSpec][source]

Get the target specifications with the given names.

Parameters:: names (list[str]) – name of the target properties
Returns:: list of target specifications
Return type:: (list[TargetSpec])

getTargets() → DataFrame[source]

Get the target property values

Returns:: target property values
Return type:: (pd.DataFrame)

property hasClusters: bool

Check whether the data frame contains clusters.

Returns:: Whether the data frame contains clusters.
Return type:: bool

hasDescriptors(descriptors: list[DescriptorSet | str] | None = None) → bool | list[bool]

Check whether the data frame contains given descriptors.

Parameters:: None) ((list[DescriptorSet | str] |) – List of descriptor objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.
Returns:: Whether the data frame contains the given descriptors.
Return type:: (bool | list[bool])

hasProperty(name: str) → bool

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.

property hasScaffoldGroups: bool

Check whether the data frame contains scaffold groups.

Returns:: Whether the data frame contains scaffold groups.
Return type:: (bool)

property hasScaffolds: bool

Check whether the data frame contains scaffolds.

Returns:: Whether the data frame contains scaffolds.
Return type:: bool

property idProp: str: Get the name of the property that contains the molecule IDs.

property identifier: ChemIdentifier: Get the identifier to use for the data set.

property isMultiTask: bool

Check if the dataset contains multiple target properties.

Returns:: True if the dataset contains multiple target properties
Return type:: (bool)

iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') → Generator[list[StoredMol], None, None]

Iterate over chunks of the data set.

Parameters:

size (int, optional) – Size of the chunks.
on_props (list, optional) – Properties to iterate over.
chunk_type (Literal["mol", "smiles", "rdkit", "df"], optional) – Type of chunks to use for processing.

Returns:

Generator of the chunks.

Return type:

(Generator[list[StoredMol], None, None])

iterSplit(name: str, as_type: str = 'ids') → Generator[tuple[Index, Index], None, None] | Generator[tuple[ndarray, ndarray, ndarray, ndarray], None, None] | Generator[tuple[DataFrame, DataFrame, DataFrame, DataFrame], None, None] | Generator[tuple[QSPRTable, QSPRTable], None, None][source]

Get the split with the given name.

Parameters:: name (str) – name of the split

as_type (str): Determines the type of output. Can be one of:

“ids”: yields train and test indices.
“numpy”: Yields train and test numpy arrays.
“pandas”: Yields train and test pandas DataFrames.
“QSPRTable”: Yields train and test QSPRTable objects.

Yields:

tuple[pd.Index, pd.Index] – train and test indices if as_type is “ids” tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:

train descriptors, train targets, test descriptors, test targets as_type is “numpy”

tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:: train descriptors, train targets, test descriptors, test targets as_type is “pandas”
tuple[QSPRTable, QSPRTable]:: train and test QSPRTable objects if as_type is “QSPRTable”

makeClassification(target_property: str, th: list[float] | None = None)[source]

Switch to classification task using the given threshold values.

Parameters:

target_property (str) – Name of target property to use for classification
th (list[float], optional) – list of threshold values. If not provided, it is assumed that the target property is already discretized and can be used for classification.

makeRegression(target_property: str)[source]

Switch to regression task using the given target property.

Parameters:: target_property (str) – name of the target property to use for regression

property metaFile: str: Get the path to the meta file of the data set.

property nJobs: int: Get the number of jobs to use for parallel processing.

property nTargetProperties: int: Get the number of target properties in the dataset.

property name: str: Get the name of the data set.

processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) → Generator[Any, None, None]

Process molecules in the data set.

Parameters:

processor (MolProcessor) – Processor to use for molecule processing.
proc_args (tuple, optional) – Positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Keyword arguments to pass to the processor.
mol_type (Literal["smiles", "mol", "rdkit"], optional) – Type of molecules to process.
add_props (Iterable[str], optional) – Additional properties to add to the data frame.

Returns:

Generator of the results.

Return type:

(Generator[Any, None, None])

property randomState: int: Get the random state to use for shuffling and other random ops.

reload(): Reload the data set from disk.

removeProperty(name: str) → bool

Remove a property from the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property was removed successfully.
Return type:: (bool)

restoreDescriptorSets(descriptors: list[DescriptorSet | str])

Restore descriptors that were previously removed.

Parameters:: descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
Raises:: ValueError – If any of the descriptors are not present in the data set.

restoreTargetProperty(prop: TargetSpec | str)[source]

Reset target property to its original value.

Parameters:: prop (TargetProperty | str) – target property to reset

sample(n: int, name: str | None = None, random_state: int | None = None) → MoleculeTable

Sample n molecules from the table.

Parameters:

n (int) – Number of molecules to sample.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.
random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save(): Save the whole storage to disk.

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None, path: str = '.') → MoleculeTable

Search the data set based on a property.

Parameters:

prop_name (str) – Name of the property to search on.
values (list[float | int | str]) – Values to search for.
exact (bool) – Whether to perform an exact search.
name (str) – Name of the new table.
path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, path: str = '.') → MoleculeTable

Search the data set with SMARTS patterns.

Parameters:

patterns (list[str]) – List of SMARTS patterns to search for.
operator (Literal["or", "and"]) – Operator to use for combining the patterns.
use_chirality (bool) – Whether to use chirality in the search.
name (str) – Name of the new table.
path (str) – Path to the directory where the new table will be stored.

Returns:

Data set containing the search results.

Return type:

(MoleculeTable)

setTargetProperties(target_props: list[TargetSpec | dict], drop_empty: bool = True)[source]

Set list of target properties for the dataset.

Parameters:

target_props (list[TargetSpec | dict]) – list of target properties specifications or dictionaries to initialize the TargetSpec objects from.
drop_empty (bool, optional) – whether to drop rows with empty target property values. Defaults to True.

property smiles: Generator[str, None, None]: Generator of SMILES strings of all molecules in the data set.

property smilesProp: str: Get the name of the property that contains the SMILES strings.

split(split: DataSplit) → Generator[tuple[Index, Index], None, None][source]

Create folds from Descriptors and Targets. Can be used either for cross-validation, bootstrapping or train-test split.

Parameters:

split (DataSplit) – Split to apply to the data
X (pd.DataFrame) – data to apply the split to
y (pd.DataFrame | None) – target data to apply the split to

Yields:

pd.Index, pd.Index – indices of the train and test set

property standardizer: ChemStandardizer: Get the standardizer to use for the data set.

property targetProperties: list[TargetSpec]: Returns the specifications of target properties of the dataset.

toFile(filename: str)

Save the data set to a file.

Parameters:: filename (str) – Path to the file to save the data set to.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable[[Iterable[Any]], Iterable[Any]])

Transform the properties of the data frame.

Parameters:

names (list[str]) – List of property names to transform.
transformer (Callable) – Function to use for transformation.

unsetTargetProperty(name: str | TargetSpec)[source]

Unset a target property. It will not remove it from the data set, but will make it unavailable for training.

Parameters:: name (str | TargetSpec) – name or specification of the target property to drop

class qsprpred.data.RandomSplit(test_fraction=0.1, seed: int | None = None)[source]

Bases: DataSplit, Randomized

Splits dataset in random train and test subsets.

Variables:

testFraction (float) – fraction of total dataset to testset
seed (int) – Random state to use for shuffling and other random operations.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

property randomState: int: Get the random state for the object.

split(X, y)[source]

Split the given data into one or multiple train/test subsets.

These classes handle partitioning of a feature matrix by returning an generator of train and test indices. It is compatible with the approach taken in the sklearn package (see sklearn.model_selection._BaseKFold). This can be used for both cross-validation or a one time train/test split.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix X (note that these are integer indices, rather than a pandas index!)

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.data.ScaffoldSplit(scaffold: ~qsprpred.data.chem.scaffolds.Scaffold = <qsprpred.data.chem.scaffolds.BemisMurckoRDKit object>, test_fraction: float = 0.1, n_folds: int = 1, custom_test_list: list | None = None, data_set: ~qsprpred.data.tables.interfaces.qspr_data_set.QSPRDataSet | None = None, **split_kwargs)[source]

Bases: GBMTDataSplit

Splits dataset into balanced train and test subsets based on molecular scaffolds.

Variables:

testFraction (float) – fraction of total dataset to testset
customTestList (list) – list of molecule indexes to force in test set
split_kwargs (dict) – additional arguments to be passed to the GloballyBalancedSplit

Initialize a GBMTDataSplit object.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) → None: Set the data set for this object.

split(X: ndarray | DataFrame, y: ndarray | DataFrame | Series) → Iterable[tuple[list[int], list[int]]]

Split dataset into balanced train and test subsets based on an initial clustering algorithm.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.data.TemporalSplit(timesplit: float | list[float], timeprop: str, data_set: QSPRDataSet | None = None)[source]

Bases: DataSplit, DataSetDependent

Splits dataset train and test subsets based on a threshold in time.

Variables:

timeSplit (float) – time point after which sample to test set
timeCol (str) – name of the column within the dataframe with timepoints

Initialize a TemporalSplit object.

Parameters:

timesplit (float | list[float]) – time point after which sample is moved to test set. If a list is provided, the splitter will split the dataset into multiple subsets based on the timepoints in the list.
timeprop (str) – name of the column within the dataset with timepoints
dataset (QSPRDataSet) – dataset that this splitter will be acting on

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) → None: Set the data set for this object.

split(X, y)[source]

Split single-task dataset based on a time threshold.

Parameters:

X (np.ndarray | pd.DataFrame) – the input data matrix
y (np.ndarray | pd.DataFrame | pd.Series) – the target variable(s)

Returns:

an generator over the generated subsets represented as a tuple of (train_indices, test_indices) where the indices are the row indices of the input data matrix

splitDataset(dataset: QSPRDataSet)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)