qsprpred.data.tables package

Submodules

qsprpred.data.tables.base module

class qsprpred.data.tables.base.DataSetDependant(dataset: MoleculeDataTable | None = None)[source]

Bases: object

Classes that need a data set to operate have to implement this.

getDataSet()[source]

Get the data set attached to this object.

Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

setDataSet(dataset: MoleculeDataTable)[source]

class qsprpred.data.tables.base.DataTable[source]

Bases: StoredTable

abstract addProperty(name: str, data: list)[source]

Add a property to the dataset.

Parameters:

name (str) – The name of the property.
data (list) – The data of the property.

abstract apply(func: callable, on_props: list[str] | None = None, func_args: list | None = None, func_kwargs: dict | None = None)[source]

Apply a function on all or selected properties. The properties are supplied as the first positional argument to the function.

Parameters:

func (callable) – The function to apply.
on_props (list, optional) – The properties to include.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.

abstract clearFiles(): Delete the files associated with the table.

abstract filter(table_filters: list[Callable])[source]

Filter the dataset.

Parameters:: table_filters (List[Callable]) – The filters to apply.

abstract static fromFile(filename: str) → StoredTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

abstract getProperties()[source]: Get the property names contained in the dataset.

abstract getProperty(name: str)[source]: Get values of a given property.

abstract getSubset(prefix: str)[source]

Get a subset of the dataset.

Parameters:: prefix (str) – The prefix of the subset.

abstract reload(): Reload the table from a file.

abstract removeProperty(name: str)[source]

Remove a property from the dataset.

Parameters:: name (str) – The name of the property.

abstract save(): Save the table to a file.

abstract transformProperties(names, transformers)[source]

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

class qsprpred.data.tables.base.MoleculeDataTable[source]

Bases: DataTable

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)[source]

Add descriptors to the dataset.

Parameters:

descriptors (list[DescriptorSet]) – The descriptors to add.
args – Additional positional arguments to be passed to each descriptor set.
kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addProperty(name: str, data: list)

Add a property to the dataset.

Parameters:

name (str) – The name of the property.
data (list) – The data of the property.

abstract apply(func: callable, on_props: list[str] | None = None, func_args: list | None = None, func_kwargs: dict | None = None)

Apply a function on all or selected properties. The properties are supplied as the first positional argument to the function.

Parameters:

func (callable) – The function to apply.
on_props (list, optional) – The properties to include.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.

abstract clearFiles(): Delete the files associated with the table.

abstract filter(table_filters: list[Callable])

Filter the dataset.

Parameters:: table_filters (List[Callable]) – The filters to apply.

abstract static fromFile(filename: str) → StoredTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

abstract getDescriptorNames() → list[str][source]

Get the names of the descriptors that are currently in the dataset.

Returns:: a list of descriptor names

abstract getDescriptors() → DataFrame[source]

Get the table of descriptors that are currently in the dataset.

Returns:: a pd.DataFrame with the descriptors

abstract getProperties(): Get the property names contained in the dataset.

abstract getProperty(name: str): Get values of a given property.

abstract getSubset(prefix: str)

Get a subset of the dataset.

Parameters:: prefix (str) – The prefix of the subset.

abstract hasDescriptors()[source]: Indicates if the dataset has descriptors.

abstract reload(): Reload the table from a file.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:: name (str) – The name of the property.

abstract save(): Save the table to a file.

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in the dataset.

Returns:: The SMILES strings of the molecules in the dataset.
Return type:: list[str]

abstract transformProperties(names, transformers)

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

class qsprpred.data.tables.base.StoredTable[source]

Bases: ABC

Abstract base class for tables that are stored in a file.

abstract clearFiles()[source]: Delete the files associated with the table.

abstract static fromFile(filename: str) → StoredTable[source]

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

abstract reload()[source]: Reload the table from a file.

abstract save()[source]: Save the table to a file.

qsprpred.data.tables.mol module

class qsprpred.data.tables.mol.DescriptorTable(calculator: DescriptorSet, name: str, df: DataFrame | None = None, store_dir: str = '.', overwrite: bool = False, key_cols: list | None = None, n_jobs: int = 1, chunk_size: int = 1000, autoindex_name: str = 'QSPRID', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: PandasDataTable

Pandas table that holds descriptor data for modelling and other analyses.

Variables:: calculator (DescriptorSet) – DescriptorSet used for descriptor calculation.

Initialize a DescriptorTable object.

Parameters:

calculator (DescriptorSet) – DescriptorSet used for descriptor calculation.
name (str) – Name of the new descriptor table.
df (pd.DataFrame) – data frame containing the descriptors. If you provide a dataframe for a dataset that already exists on disk, the dataframe from disk will override the supplied data frame. Set ‘overwrite’ to True to override the data frame on disk.
store_dir (str) – Directory to store the dataset files. Defaults to the current directory. If it already contains files with the same name, the existing data will be loaded.
overwrite (bool) – Overwrite existing dataset.
key_cols (list) – list of columns to use as index. If None, the index will be a custom generated ID.
n_jobs (int) – Number of jobs to use for parallel processing. If <= 0, all available cores will be used.
chunk_size (int) – Size of chunks to use per job in parallel processing.
autoindex_name (str) – Column name to use for automatically generated IDs.
random_state (int) – Random state to use for shuffling and other random ops.
store_format (str) – Format to use for storing the data (‘pkl’ or ‘csv’).

addProperty(name: str, data: list)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

property baseDir: str: The base directory of the data set folder.

property chunkSize: int

clearFiles(): Remove all files associated with this data set from disk.

dropEmptyProperties(names: list[str])

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

fillMissing(fill_value, names)[source]

Fill missing values in the descriptor table.

Parameters:

fill_value (float) – Value to fill missing values with.
names (list) – List of descriptor names to fill. If None, all descriptors are filled.

filter(table_filters: list[Callable])

Filter the data frame using a list of filters.

Each filter is a function that takes the data frame and returns a a new data frame with the filtered rows. The new data frame is then used as the input for the next filter. The final data frame is saved as the new data frame of the MoleculeTable.

classmethod fromFile(filename: str) → PandasDataTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getDF()

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getDescriptorNames(active_only=True)[source]

Get the names of the descriptors in this represented by this table. By default, only active descriptors are returned. You can use active_only=False to get all descriptors saved in the table.

Parameters:: active_only (bool) – Whether to return only descriptors that are active in the current descriptor set. Defaults to True.

getDescriptors(active_only=True)[source]: Get the descriptors stored in this table.

getProperties() → list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: list

getProperty(name: str) → Series

Get property values from the data set.

Parameters:: name (str) – Name of the property to get.
Returns:: List of values for the property.
Return type:: pd.Series

getSubset(prefix: str)

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:: prefix (str) – Prefix of the column names to select.

hasProperty(name)

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

imputeProperties(names: list[str], imputer: Callable)

Impute missing property values.

Parameters:

names (list) – List of property names to impute.
imputer (Callable) –

imputer object implementing the fit_transform
method from scikit-learn API.

iterChunks(include_props: list[str] | None = None, as_dict: bool = False, chunk_size: int | None = None) → Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:

include_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

keepDescriptors(descriptors: list[str]) → list[str][source]

Mark only the given descriptors as active in this set.

Parameters:: descriptors (list) – list of descriptor names to keep
Returns:: list of descriptor names that were kept
Return type:: list[str]
Raises:: ValueError – If any of the descriptors are not present in the table.

property metaFile: The path to the meta file of this data set.

property nJobs

reload(): Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

restoreDescriptors() → list[str][source]: Restore all descriptors to active in this set.

save()

Save the data frame to disk and all associated files.

Returns:: Path to the saved data frame.
Return type:: str

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

setRandomState(random_state: int)

Set the random state for this instance.

Parameters:: random_state (int) – Random state to use for shuffling and other random operations.

shuffle(random_state: int | None = None): Shuffle the internal data frame.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

toFile(filename: str)

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable)

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

class qsprpred.data.tables.mol.MoleculeTable(name: str, df: DataFrame | None = None, smiles_col: str = 'SMILES', add_rdkit: bool = False, store_dir: str = '.', overwrite: bool = False, n_jobs: int | None = 1, chunk_size: int | None = None, drop_invalids: bool = True, index_cols: list[str] | None = None, autoindex_name: str = 'QSPRID', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: PandasDataTable, SearchableMolTable, Summarizable

Class that holds and prepares molecule data for modelling and other analyses.

Variables:

smilesCol (str) – Name of the column containing the SMILES sequences of molecules.
includesRdkit (bool) – Whether the data frame contains RDKit molecules as one of the properties.
descriptors (list[DescriptorTable]) – List of DescriptorTable objects containing the descriptors calculated for this table.

Initialize a MoleculeTable object.

This object wraps a pandas dataframe and provides short-hand methods to prepare molecule data for modelling and analysis.

Parameters:

name (str) – Name of the dataset. You can use this name to load the dataset from disk anytime and create a new instance.
df (pd.DataFrame) – Pandas dataframe containing the data. If you provide a dataframe for a dataset that already exists on disk,
Set (the dataframe from disk will override the supplied data frame.) – ‘overwrite’ to True to override the data frame on disk.
smiles_col (str) – Name of the column containing the SMILES sequences of molecules.
add_rdkit (bool) – Add RDKit molecule instances to the dataframe. WARNING: This can take a lot of memory.
store_dir (str) – Directory to store the dataset files. Defaults to the current directory. If it already contains files with the same name, the existing data will be loaded.
overwrite (bool) – Overwrite existing dataset.
n_jobs (int) – Number of jobs to use for parallel processing. If <= 0, all available cores will be used.
chunk_size (int) – Size of chunks to use per job in parallel processing.
drop_invalids (bool) – Drop invalid molecules from the data frame.
index_cols (list[str]) – list of columns to use as index. If None, the index will be a custom generated ID.
autoindex_name (str) – Column name to use for automatically generated IDs.
random_state (int) – Random state to use for shuffling and other random ops.
store_format (str) – Format to use for storing the data (‘pkl’ or ‘csv’).

addClusters(clusters: list['MoleculeClusters'], recalculate: bool = False)[source]

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:

clusters (list) – list of MoleculeClusters calculators.
recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet], recalculate: bool = False, fail_on_invalid: bool = True, *args, **kwargs)[source]

Add descriptors to the data frame with the given descriptor calculators.

Parameters:

descriptors (list[DescriptorSet]) – List of DescriptorSet objects to use for descriptor calculation.
recalculate (bool) – Whether to recalculate descriptors even if they are already present in the data frame. If False, existing descriptors are kept and no calculation takes place.
fail_on_invalid (bool) – Whether to throw an exception if any molecule is invalid.
*args – Additional positional arguments to pass to each descriptor set.
**kwargs – Additional keyword arguments to pass to each descriptor set.

addProperty(name: str, data: list)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.

addScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)[source]

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:

scaffolds (list) – list of Scaffold calculators.
add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.
recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)[source]

Attach descriptors to the data frame.

Parameters:

calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.
descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.
index_cols (list) – List of column names to use as index.

property baseDir: str: The base directory of the data set folder.

checkMols(throw: bool = True)[source]

Returns a boolean array indicating whether each molecule is valid or not. If throw is True, an exception is thrown if any molecule is invalid.

Parameters:: throw (bool) – Whether to throw an exception if any molecule is invalid.
Returns:: Boolean series indicating whether each molecule is valid.
Return type:: mask (pd.Series)

property chunkSize: int

clearFiles(): Remove all files associated with this data set from disk.

createScaffoldGroups(mols_per_group: int = 10)[source]

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:: mols_per_group (int) – number of molecules per scaffold group.

property descriptorSets: Get the descriptor calculators for this table.

dropDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str], full_removal: bool = False)[source]

Drop descriptors from the given sets from the data frame.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

dropDescriptors(descriptors: list[str])[source]

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:: descriptors (list[str]) – List of descriptor names to drop.

dropEmptyProperties(names: list[str])

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

dropEmptySmiles()[source]: Drop rows with empty SMILES from the data set.

dropInvalids()[source]

Drops invalid molecules from the data set.

Returns:

Boolean mask of invalid molecules in the original: data set.

Return type:

mask (pd.Series)

filter(table_filters: list[Callable])

Filter the data frame using a list of filters.

Each filter is a function that takes the data frame and returns a a new data frame with the filtered rows. The new data frame is then used as the input for the next filter. The final data frame is saved as the new data frame of the MoleculeTable.

classmethod fromFile(filename: str) → PandasDataTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

static fromSDF(name, filename, smiles_prop, *args, **kwargs)[source]

Create a MoleculeTable instance from an SDF file.

Parameters:

name (str) – Name of the data set.
filename (str) – Path to the SDF file.
smiles_prop (str) – Name of the property in the SDF file containing the SMILES sequence.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

static fromSMILES(name: str, smiles: list, *args, **kwargs)[source]

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:

name (str) – Name of the data set.
smiles (list) – list of SMILES sequences.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

static fromTableFile(name: str, filename: str, sep='\t', *args, **kwargs)[source]

Create a MoleculeTable instance from a file containing a table of molecules (i.e. a CSV file).

Parameters:

name (str) – Name of the data set.
filename (str) – Path to the file containing the table.
sep (str) – Separator used in the file for different columns.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

generateDescriptorDataSetName(ds_set: str | DescriptorSet)[source]: Generate a descriptor set name from a descriptor set.

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getClusterNames(clusters: list['MoleculeClusters'] | None = None)[source]

Get the names of the clusters in the data frame.

Returns:: List of cluster names.
Return type:: list

getClusters(clusters: list['MoleculeClusters'] | None = None)[source]

Get the subset of the data frame that contains only clusters.

Returns:: Data frame containing only clusters.
Return type:: pd.DataFrame

getDF()

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getDescriptorNames()[source]

Get the names of the descriptors present for molecules in this data set.

Returns:: list of descriptor names.
Return type:: list

getDescriptors(active_only=False)[source]

Get the calculated descriptors as a pandas data frame.

Returns:: Data frame containing only descriptors.
Return type:: pd.DataFrame

getProperties() → list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: list

getProperty(name: str) → Series

Get property values from the data set.

Parameters:: name (str) – Name of the property to get.
Returns:: List of values for the property.
Return type:: pd.Series

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10)[source]

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:

scaffold_name (str) – Name of the scaffold.
mol_per_group (int) – Number of molecules per scaffold group.

Returns:

list of scaffold groups.

Return type:

list

getScaffoldNames(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)[source]

Get the names of the scaffolds in the data frame.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: List of scaffold names.
Return type:: list

getScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)[source]

Get the subset of the data frame that contains only scaffolds.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: Data frame containing only scaffolds.
Return type:: pd.DataFrame

getSubset(prefix: str)

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:: prefix (str) – Prefix of the column names to select.

getSummary()[source]

Make a summary with some statistics about the molecules in this table. The summary contains the number of molecules per target and the number of unique molecules per target.

Requires this data set to be imported from Papyrus for now.

Returns:: A dataframe with the summary statistics.
Return type:: (pd.DataFrame)

property hasClusters

Check whether the data frame contains clusters.

Returns:: Whether the data frame contains clusters.
Return type:: bool

hasDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str] | None = None) → bool | list[bool][source]

Check whether the data frame contains given descriptors.

Parameters:: descriptors (list) – list of DescriptorSet objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.
Returns:: list of booleans indicating whether each descriptor is present or not.
Return type:: list

hasProperty(name)

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

property hasScaffoldGroups

Check whether the data frame contains scaffold groups.

Returns:: Whether the data frame contains scaffold groups.
Return type:: bool

property hasScaffolds

Check whether the data frame contains scaffolds.

Returns:: Whether the data frame contains scaffolds.
Return type:: bool

imputeProperties(names: list[str], imputer: Callable)

Impute missing property values.

Parameters:

names (list) – List of property names to impute.
imputer (Callable) –

imputer object implementing the fit_transform
method from scikit-learn API.

iterChunks(include_props: list[str] | None = None, as_dict: bool = False, chunk_size: int | None = None) → Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:

include_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

property metaFile: The path to the meta file of this data set.

property nJobs

processMols(processor: MolProcessor, proc_args: tuple[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, add_props: list[str] | None = None, as_rdkit: bool = False, chunk_size: int | None = None, n_jobs: int | None = None) → Generator[source]

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing, the processor must be picklable. Also note that the returned generator will produce results as soon as they are ready, which means that the chunks of data will not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules. See CheckSmilesValid for an example.

Parameters:

processor (MolProcessor) – MolProcessor object to use for processing.
proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.
as_rdkit (bool, optional) – Whether to convert the molecules to RDKit molecules before applying the processor.
chunk_size (int, optional) – Size of chunks to use per job in parallel. If not specified, self.chunkSize is used.
n_jobs (int, optional) – Number of jobs to use for parallel processing. If not specified, self.nJobs is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

reload(): Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

restoreDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str])[source]

Restore descriptors that were previously removed.

Parameters:: descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

classmethod runMolProcess(props: dict[str, list] | DataFrame, func: MolProcessor, add_rdkit: bool, smiles_col: str, *args, **kwargs)[source]

A helper method to run a MolProcessor on a list of molecules via apply. It converts the SMILES to RDKit molecules if required and then applies the function to the MolProcessor object.

Parameters:

props (dict) – Dictionary of properties that will be passed in addition to the molecule structure.
func (MolProcessor) – MolProcessor object to use for processing.
add_rdkit (bool) – Whether to convert the SMILES to RDKit molecules before applying the function.
smiles_col (str) – Name of the property containing the SMILES sequences.
*args – Additional positional arguments to pass to the function.
**kwargs – Additional keyword arguments to pass to the function.

sample(n: int, name: str | None = None, random_state: int | None = None) → MoleculeTable[source]

Sample n molecules from the table.

Parameters:

n (int) – Number of molecules to sample.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.
random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save()

Save the data frame to disk and all associated files.

Returns:: Path to the saved data frame.
Return type:: str

searchOnProperty(prop_name: str, values: list[str], name: str | None = None, exact=False) → MoleculeTable[source]

Search in this table using a property name and a list of values. It is assumed that the property is searchable with string matching. Either an exact match or a partial match can be used. If ‘exact’ is False, the search will be performed with partial matching, i.e. all molecules that contain any of the given values in the property will be returned. If ‘exact’ is True, only molecules that have the exact property value for any of the given values will be returned.

Parameters:

prop_name (str) – Name of the property to search on.
values (list[str]) – List of values to search for. If any of the values is found in the property, the molecule will be considered a match.
name (str | None, optional) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.
exact (bool, optional) – Whether to use exact matching, i.e. whether to search for exact strings or just substrings. Defaults to False.

Returns:

A new table with the molecules from the old table with the given property values.

Return type:

MoleculeTable

searchWithIndex(index: Index, name: str | None = None) → MoleculeTable[source]

Search in this table using a pandas index. The return values is a new table with the molecules from the old table with the given indices.

Parameters:

index (pd.Index) – Indices to search for in this table.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.

Returns:

A new table with the molecules from the old table with the given indices.

Return type:

MoleculeTable

searchWithSMARTS(patterns: list[str], operator: ~typing.Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, match_function: ~typing.Callable = <function match_mol_to_smarts>) → MoleculeTable[source]

Search the molecules in the table with a SMARTS pattern.

Parameters:

patterns – List of SMARTS patterns to search with.
operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.
use_chirality – Whether to use chirality in the search.
name – Name of the new table. Defaults to the name of the old table, plus the smarts_searched suffix.
match_function – Function to use for matching the molecules to the SMARTS patterns. Defaults to match_mol_to_smarts.

Returns:

A dataframe with the molecules that match the pattern.

Return type:

(MolTable)

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

setRandomState(random_state: int)

Set the random state for this instance.

Parameters:: random_state (int) – Random state to use for shuffling and other random operations.

shuffle(random_state: int | None = None): Shuffle the internal data frame.

property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in the data frame.

Returns:: Generator of SMILES strings.
Return type:: Generator[str, None, None]

standardizeSmiles(smiles_standardizer, drop_invalid=True)[source]

Apply smiles_standardizer to the compounds in parallel

Parameters:

() (smiles_standardizer) – either None to skip the standardization, chembl, old, or a partial function that reads and standardizes smiles.
drop_invalid (bool) – whether to drop invalid SMILES from the data set. Defaults to True. If False, invalid SMILES will be retained in their original form. If self.invalidsRemoved is True, there will be no effect even if drop_invalid is True. Set self.invalidsRemoved to False on this instance to force the removal of invalid SMILES.

Raises:

ValueError – when smiles_standardizer is not a callable or one of the predefined strings.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

toFile(filename: str)[source]

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable)

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

qsprpred.data.tables.pandas module

class qsprpred.data.tables.pandas.PandasDataTable(name: str, df: DataFrame | None = None, store_dir: str = '.', overwrite: bool = False, index_cols: list[str] | None = None, n_jobs: int = 1, chunk_size: int | None = None, autoindex_name: str = 'QSPRID', random_state: int | None = None, store_format: str = 'pkl', parallel_generator: ParallelGenerator | None = None)[source]

Bases: DataTable, JSONSerializable

A Pandas DataFrame wrapper class to enable data processing functions on QSPRpred data.

Variables:

name (str) – Name of the data set. You can use this name to load the dataset from disk anytime and create a new instance.
df (pd.DataFrame) – Pandas dataframe containing the data. You can modify this one directly, but note that removing rows, adding rows, or changing the index or other automatic properties of the data frame might break the data set. In that case, it is recommended to recreate the data set from scratch.
indexCols (List) – List of columns to use as index. If None, the index will be a custom generated ID. Note that if you specify multiple columns their values will be joined with a ‘~’ character rather than using the default pandas multi-index.
nJobs (int) – Number of jobs to use for parallel processing. If set to None or 0, all available cores will be set.
chunkSize (int) – Size of chunks to use per job in parallel processing. This is automatically set to the number of rows in the data frame divided by nJobs. However, you can also set it manually if you want to use a different chunk size. Set to None to again use the default value determined by nJobs.
randomState (int) – Random state to use for all random operations.
idProp (str) – Column name to use for automatically generated IDs. Defaults to ‘QSPRID’. If indexCols is set, this will be the names of the columns joined by ‘~’.
storeFormat (str) – Format to use for storing the data frame. Currently only ‘pkl’ and ‘csv’ are supported. Defaults to ‘pkl’ because it is faster. However, ‘csv’ is more portable and can be opened in other programs.
parallelGenerator (Callable) – A ParallelGenerator to use for parallel processing of chunks of data. Defaults to qsprpred.utils.parallel.MultiprocessingPoolGenerator. You can replace this with your own parallel generator function if you want to use a different parallelization strategy (i.e. utilize remote servers instead of local processes).

Initialize a PandasDataTable object. Args

name (str):
Name of the data set. You can use this name to load the dataset from disk anytime and create a new instance.

df (pd.DataFrame):
Pandas dataframe containing the data. If you provide a dataframe for a dataset that already exists on disk, the dataframe from disk will override the supplied data frame. Set ‘overwrite’ to True to override the data frame on disk.

store_dir (str):
Directory to store the dataset files. Defaults to the current directory. If it already contains files with the same name, the existing data will be loaded.

overwrite (bool):
Overwrite existing dataset.

index_cols (List):
List of columns to use as index. If None, the index will be a custom generated ID.

n_jobs (int):
Number of jobs to use for parallel processing. If <= 0, all available cores will be used.

chunk_size (int):
Size of chunks to use per job in parallel processing. If None, the chunk size will be set to the number of rows in the data frame divided by nJobs.

autoindex_name (str):
Column name to use for automatically generated IDs.

random_state (int):
Random state to use for all random operations for reproducibility. If not specified, the state is generated randomly. The state is saved upon save so if you want to change the state later, call the setRandomState method after loading.

store_format (str):
Format to use for storing the data frame. Currently only ‘pkl’ and ‘csv’ are supported.

parallel_generator (ParallelGenerator | None):
A ParallelGenerator to use for parallel processing of chunks of data. Defaults to qsprpred.utils.parallel.MultiprocessingPoolGenerator. You can replace this with your own parallel generator function if you want to use a different parallelization strategy (i.e. utilize remote servers instead of local processes).

addProperty(name: str, data: list)[source]

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

property baseDir: str: The base directory of the data set folder.

property chunkSize: int

clearFiles()[source]: Remove all files associated with this data set from disk.

dropEmptyProperties(names: list[str])[source]

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

filter(table_filters: list[Callable])[source]

Filter the data frame using a list of filters.

Each filter is a function that takes the data frame and returns a a new data frame with the filtered rows. The new data frame is then used as the input for the next filter. The final data frame is saved as the new data frame of the MoleculeTable.

classmethod fromFile(filename: str) → PandasDataTable[source]

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

generateIndex(name: str | None = None, prefix: str | None = None)[source]

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getDF()[source]

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getProperties() → list[str][source]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: list

getProperty(name: str) → Series[source]

Get property values from the data set.

Parameters:: name (str) – Name of the property to get.
Returns:: List of values for the property.
Return type:: pd.Series

getSubset(prefix: str)[source]

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:: prefix (str) – Prefix of the column names to select.

hasProperty(name)[source]

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

imputeProperties(names: list[str], imputer: Callable)[source]

Impute missing property values.

Parameters:

names (list) – List of property names to impute.
imputer (Callable) –

imputer object implementing the fit_transform
method from scikit-learn API.

iterChunks(include_props: list[str] | None = None, as_dict: bool = False, chunk_size: int | None = None) → Generator[DataFrame | dict, None, None][source]

Batch a data frame into chunks of the given size.

Parameters:

include_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

property metaFile: The path to the meta file of this data set.

property nJobs

reload()[source]: Reload the data table from disk.

removeProperty(name)[source]

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

save()[source]

Save the data frame to disk and all associated files.

Returns:: Path to the saved data frame.
Return type:: str

setIndex(cols: list[str])[source]

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

setRandomState(random_state: int)[source]

Set the random state for this instance.

Parameters:: random_state (int) – Random state to use for shuffling and other random operations.

shuffle(random_state: int | None = None)[source]: Shuffle the internal data frame.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

toFile(filename: str)[source]

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable)[source]

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

qsprpred.data.tables.qspr module

class qsprpred.data.tables.qspr.QSPRDataset(name: str, target_props: list[qsprpred.tasks.TargetProperty | dict] | None = None, df: DataFrame | None = None, smiles_col: str = 'SMILES', add_rdkit: bool = False, store_dir: str = '.', overwrite: bool = False, n_jobs: int | None = 1, chunk_size: int | None = None, drop_invalids: bool = True, drop_empty: bool = True, index_cols: list[str] | None = None, autoindex_name: str = 'QSPRID', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: MoleculeTable

Prepare dataset for QSPR model training.

It splits the data in train and test set, as well as creating cross-validation folds. Optionally low quality data is filtered out. For classification the dataset samples are labelled as active/inactive.

Variables:

targetProperties (str) – property to be predicted with QSPRmodel
df (pd.dataframe) – dataset
X (np.ndarray/pd.DataFrame) – m x n feature matrix for cross validation, where m is the number of samplesand n is the number of features.
y (np.ndarray/pd.DataFrame) – m-d label array for cross validation, where m is the number of samples and equals to row of X.
X_ind (np.ndarray/pd.DataFrame) – m x n Feature matrix for independent set, where m is the number of samples and n is the number of features.
y_ind (np.ndarray/pd.DataFrame) – m-l label array for independent set, where m is the number of samples and equals to row of X_ind, and l is the number of types.
X_ind_outliers (np.ndarray/pd.DataFrame) – m x n Feature matrix for outliers in independent set, where m is the number of samples and n is the number of features.
y_ind_outliers (np.ndarray/pd.DataFrame) – m-l label array for outliers in independent set, where m is the number of samples and equals to row of X_ind_outliers, and l is the number of types.
featureNames (list of str) – feature names
featureStandardizer (SKLearnStandardizer) – feature standardizer
applicabilityDomain (ApplicabilityDomain) – applicability domain

Construct QSPRdata, also apply transformations of output property if: specified.

Parameters:

name (str) – data name, used in saving the data
target_props (list[TargetProperty | dict] | None) – target properties, names should correspond with target columnname in df. If None, target properties will be inferred if this data set has been saved previously. Defaults to None.
df (pd.DataFrame, optional) – input dataframe containing smiles and target property. Defaults to None.
smiles_col (str, optional) – name of column in df containing SMILES. Defaults to “SMILES”.
add_rdkit (bool, optional) – if true, column with rdkit molecules will be added to df. Defaults to False.
store_dir (str, optional) – directory for saving the output data. Defaults to ‘.’.
overwrite (bool, optional) – if already saved data at output dir if should be overwritten. Defaults to False.
n_jobs (int, optional) – number of parallel jobs. If <= 0, all available cores will be used. Defaults to 1.
chunk_size (int, optional) – chunk size for parallel processing. Defaults to 50.
drop_invalids (bool, optional) – if true, invalid SMILES will be dropped. Defaults to True.
drop_empty (bool, optional) – if true, rows with empty target property will be removed.
index_cols (list[str], optional) – columns to be used as index in the dataframe. Defaults to None in which case a custom ID will be generated.
autoindex_name (str) – Column name to use for automatically generated IDs.
random_state (int, optional) – random state for splitting the data.
store_format (str, optional) – format to use for storing the data (‘pkl’ or ‘csv’).

Raises:

ValueError – Raised if threshold given with non-classification task.

addClusters(clusters: list['MoleculeClusters'], recalculate: bool = False)

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:

clusters (list) – list of MoleculeClusters calculators.
recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet], recalculate: bool = False, featurize: bool = True, *args, **kwargs)[source]

Add descriptors to the data set.

If descriptors are already present, they will be recalculated if recalculate is True. Featurization will be performed after adding descriptors if featurize is True. Featurization converts current data matrices to pure numeric matrices of selected descriptors (features).

Parameters:

descriptors (list[DescriptorSet]) – list of descriptor sets to add
recalculate (bool, optional) – whether to recalculate descriptors if they are already present. Defaults to False.
featurize (bool, optional) – whether to featurize the data set splits after adding descriptors. Defaults to True.
*args – additional positional arguments to pass to each descriptor set
**kwargs – additional keyword arguments to pass to each descriptor set

addFeatures(feature_calculators: list[qsprpred.data.descriptors.sets.DescriptorSet], recalculate: bool = False)[source]

Add features to the data set.

Parameters:

feature_calculators (list[DescriptorSet]) – list of feature calculators to add. Defaults to None.
recalculate (bool) – if True, recalculate features even if they are already present in the data set. Defaults to False.

addProperty(name: str, data: list)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.

addScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:

scaffolds (list) – list of Scaffold calculators.
add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.
recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)

Attach descriptors to the data frame.

Parameters:

calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.
descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.
index_cols (list) – List of column names to use as index.

property baseDir: str: The base directory of the data set folder.

checkFeatures()[source]: Check consistency of features and descriptors.

checkMols(throw: bool = True)

Returns a boolean array indicating whether each molecule is valid or not. If throw is True, an exception is thrown if any molecule is invalid.

Parameters:: throw (bool) – Whether to throw an exception if any molecule is invalid.
Returns:: Boolean series indicating whether each molecule is valid.
Return type:: mask (pd.Series)

property chunkSize: int

clearFiles(): Remove all files associated with this data set from disk.

createScaffoldGroups(mols_per_group: int = 10)

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:: mols_per_group (int) – number of molecules per scaffold group.

property descriptorSets: Get the descriptor calculators for this table.

dropDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str], full_removal: bool = False)

Drop descriptors from the given sets from the data frame.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

dropDescriptors(descriptors: list[str])[source]

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:: descriptors (list[str]) – List of descriptor names to drop.

dropEmptyProperties(names: list[str])[source]

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

dropEmptySmiles(): Drop rows with empty SMILES from the data set.

dropInvalids()[source]

Drops invalid molecules from the data set.

Returns:

Boolean mask of invalid molecules in the original: data set.

Return type:

mask (pd.Series)

dropOutliers()[source]: Drop outliers from the test set based on the applicability domain.

featurize(update_splits=True)[source]

featurizeSplits(shuffle: bool = True, random_state: int | None = None)[source]

If the data set has descriptors, load them into the train and test splits.

If no descriptors are available, remove all features from the splits. They will become zero length along the feature axis (columns), but will retain their original length along the sample axis (rows). This is useful for the case where the data set has no descriptors, but the user wants to retain train and test splits.

shuffle (bool): whether to shuffle the training and test sets random_state (int): random state for shuffling

fillMissing(fill_value: float, columns: list[str] | None = None)[source]

Fill missing values in the data set with a given value.

Parameters:

fill_value (float) – value to fill missing values with
columns (list[str], optional) – columns to fill missing values in. Defaults to None.

filter(table_filters: list[Callable])[source]

Filter the data set using the given filters.

Parameters:: table_filters (list[Callable]) – list of filters to apply

filterFeatures(feature_filters: list[Callable])[source]

Filter features in the data set.

Parameters:: feature_filters (list[Callable]) – list of feature filter functions that take X feature matrix and y target vector as arguments

classmethod fromFile(filename: str) → PandasDataTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

static fromMolTable(mol_table: MoleculeTable, target_props: list[qsprpred.tasks.TargetProperty | dict], name=None, **kwargs) → QSPRDataset[source]

Create QSPRDataset from a MoleculeTable.

Parameters:

mol_table (MoleculeTable) – MoleculeTable to use as the data source
target_props (list) – list of target properties to use
name (str, optional) – name of the data set. Defaults to None.
kwargs – additional keyword arguments to pass to the constructor

Returns:

created data set

Return type:

QSPRDataset

static fromSDF(name: str, filename: str, smiles_prop: str, *args, **kwargs)[source]

Create QSPRDataset from SDF file.

It is currently not implemented for QSPRDataset, but you can convert from ‘MoleculeTable’ with the ‘fromMolTable’ method.

Parameters:

name (str) – name of the data set
filename (str) – path to the SDF file
smiles_prop (str) – name of the property in the SDF file containing SMILES
*args – additional arguments for QSPRDataset constructor
**kwargs – additional keyword arguments for QSPRDataset constructor

static fromSMILES(name: str, smiles: list, *args, **kwargs)

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:

name (str) – Name of the data set.
smiles (list) – list of SMILES sequences.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

static fromTableFile(name: str, filename: str, sep: str = '\t', *args, **kwargs)[source]

Create QSPRDataset from table file (i.e. CSV or TSV).

Parameters:

name (str) – name of the data set
filename (str) – path to the table file
sep (str, optional) – separator in the table file. Defaults to “t”.
*args – additional arguments for QSPRDataset constructor
**kwargs – additional keyword arguments for QSPRDataset constructor

Returns:

QSPRDataset object

Return type:

QSPRDataset

generateDescriptorDataSetName(ds_set: str | DescriptorSet): Generate a descriptor set name from a descriptor set.

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getApplicability()[source]: Get applicability predictions for the test set.

getClusterNames(clusters: list['MoleculeClusters'] | None = None)

Get the names of the clusters in the data frame.

Returns:: List of cluster names.
Return type:: list

getClusters(clusters: list['MoleculeClusters'] | None = None)

Get the subset of the data frame that contains only clusters.

Returns:: Data frame containing only clusters.
Return type:: pd.DataFrame

getDF()

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getDescriptorNames()

Get the names of the descriptors present for molecules in this data set.

Returns:: list of descriptor names.
Return type:: list

getDescriptors(active_only=False)

Get the calculated descriptors as a pandas data frame.

Returns:: Data frame containing only descriptors.
Return type:: pd.DataFrame

getFeatureNames() → list[str][source]

Get current feature names for this data set.

Returns:: list of feature names
Return type:: list[str]

getFeatures(inplace: bool = False, concat: bool = False, raw: bool = False, ordered: bool = False, refit_standardizer: bool = True)[source]

Get the current feature sets (training and test) from the dataset.

This method also applies any feature standardizers that have been set on the dataset during preparation. Outliers are dropped from the test set if they are present, unless concat is True.

Parameters:

inplace (bool) – If True, the created feature matrices will be saved to the dataset object itself as ‘X’ and ‘X_ind’ attributes. Note that this will overwrite any existing feature matrices and if the data preparation workflow changes, these are not kept up to date. Therefore, it is recommended to generate new feature sets after any data set changes.
concat (bool) – If True, the training and test feature matrices will be concatenated into a single matrix. This is useful for training models that do not require separate training and test sets (i.e. the final optimized models).
raw (bool) – If True, the raw feature matrices will be returned without any standardization applied.
ordered (bool) – If True, the returned feature matrices will be ordered according to the original order of the data set. This is only relevant if concat is True.
refit_standardizer (bool) – If True, the feature standardizer will be refit on the training set upon this call. If False, the previously fitted standardizer will be used. Defaults to True. Use False if this dataset is used for prediction only and the standardizer has been initialized already.

getProperties() → list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: list

getProperty(name: str) → Series

Get property values from the data set.

Parameters:: name (str) – Name of the property to get.
Returns:: List of values for the property.
Return type:: pd.Series

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10)

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:

scaffold_name (str) – Name of the scaffold.
mol_per_group (int) – Number of molecules per scaffold group.

Returns:

list of scaffold groups.

Return type:

list

getScaffoldNames(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)

Get the names of the scaffolds in the data frame.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: List of scaffold names.
Return type:: list

getScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)

Get the subset of the data frame that contains only scaffolds.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: Data frame containing only scaffolds.
Return type:: pd.DataFrame

getSubset(prefix: str)

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:: prefix (str) – Prefix of the column names to select.

getSummary()

Make a summary with some statistics about the molecules in this table. The summary contains the number of molecules per target and the number of unique molecules per target.

Requires this data set to be imported from Papyrus for now.

Returns:: A dataframe with the summary statistics.
Return type:: (pd.DataFrame)

getTargetProperties(names: list) → list[qsprpred.tasks.TargetProperty][source]

Get the target properties with the given names.

Parameters:: names (list[str]) – name of the target properties
Returns:: list of target properties
Return type:: list[TargetProperty]

getTargetPropertiesValues(concat: bool = False, ordered: bool = False)[source]

Get the response values (training and test) for the set target property.

Parameters:

concat (bool) – if True, return concatenated training and validation set target properties
ordered (bool) – if True, return the target properties in the original order of the data set. This is only relevant if concat is True.

Returns:

tuple of (train_responses, test_responses) or pandas.DataFrame of all target property values

property hasClusters

Check whether the data frame contains clusters.

Returns:: Whether the data frame contains clusters.
Return type:: bool

hasDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str] | None = None) → bool | list[bool]

Check whether the data frame contains given descriptors.

Parameters:: descriptors (list) – list of DescriptorSet objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.
Returns:: list of booleans indicating whether each descriptor is present or not.
Return type:: list

property hasFeatures: Check whether the currently selected set of features is not empty.

hasProperty(name)

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

property hasScaffoldGroups

Check whether the data frame contains scaffold groups.

Returns:: Whether the data frame contains scaffold groups.
Return type:: bool

property hasScaffolds

Check whether the data frame contains scaffolds.

Returns:: Whether the data frame contains scaffolds.
Return type:: bool

imputeProperties(names: list[str], imputer: Callable)[source]

Impute missing property values.

Parameters:

names (list) – List of property names to impute.
imputer (Callable) –

imputer object implementing the fit_transform
method from scikit-learn API.

property isMultiTask

Check if the dataset contains multiple target properties.

Returns:: True if the dataset contains multiple target properties
Return type:: bool

iterChunks(include_props: list[str] | None = None, as_dict: bool = False, chunk_size: int | None = None) → Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:

include_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

iterFolds(split: DataSplit, concat: bool = False) → Generator[tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame | pandas.core.series.Series, pandas.core.frame.DataFrame | pandas.core.series.Series, list[int], list[int]], None, None][source]

Iterate over the folds of the dataset.

Parameters:

split (DataSplit) – split instance orchestrating the split
concat (bool) – whether to concatenate the training and test feature matrices

Yields:

tuple – training and test feature matrices and target vectors for each fold

loadDescriptorsToSplits(shuffle: bool = True, random_state: int | None = None)[source]

Load all available descriptors into the train and test splits.

If no descriptors are available, an exception will be raised.

Parameters:

shuffle (bool) – whether to shuffle the training and test sets
random_state (int) – random state for shuffling

Raises:

ValueError – if no descriptors are available

makeClassification(target_property: str, th: list[float] | None = None)[source]

Switch to classification task using the given threshold values.

Parameters:

target_property (str) – Target property to use for classification or name of the target property.
th (list[float], optional) – list of threshold values. If not provided, the values will be inferred from th specified in TargetProperty. Defaults to None.

makeRegression(target_property: str)[source]

Switch to regression task using the given target property.

Parameters:: target_property (str) – name of the target property to use for regression

property metaFile: The path to the meta file of this data set.

property nJobs

property nTargetProperties: Get the number of target properties in the dataset.

prepareDataset(smiles_standardizer: str | ~typing.Callable | None = 'chembl', data_filters: list | None = (<qsprpred.data.processing.data_filters.RepeatsFilter object>, ), split=None, feature_calculators: list[qsprpred.data.descriptors.sets.DescriptorSet] | None = None, feature_filters: list | None = None, feature_standardizer: ~qsprpred.data.processing.feature_standardizers.SKLearnStandardizer | None = None, feature_fill_value: float = nan, applicability_domain: ~qsprpred.data.processing.applicability_domain.ApplicabilityDomain | ~mlchemad.base.ApplicabilityDomain | None = None, drop_outliers: bool = False, recalculate_features: bool = False, shuffle: bool = True, random_state: int | None = None)[source]

Prepare the dataset for use in QSPR model.

Parameters:

smiles_standardizer (str | Callable) – either chembl, old, or a partial function that reads and standardizes smiles. If None, no standardization will be performed. Defaults to chembl.
data_filters (list of datafilter obj) – filters number of rows from dataset
split (datasplitter obj) – splits the dataset into train and test set
feature_calculators (list[DescriptorSet]) – descriptor sets to add to the data set
feature_filters (list of feature filter objs) – filters features
feature_standardizer (SKLearnStandardizer or sklearn.base.BaseEstimator) – standardizes and/or scales features
feature_fill_value (float) – value to fill missing values with. Defaults to numpy.nan
applicability_domain (applicabilityDomain obj) – attaches an applicability domain calculator to the dataset and fits it on the training set
drop_outliers (bool) – whether to drop samples that are outside the applicability domain from the test set, if one is attached.
recalculate_features (bool) – recalculate features even if they are already present in the file
shuffle (bool) – whether to shuffle the created training and test sets
random_state (int) – random state for shuffling

processMols(processor: MolProcessor, proc_args: tuple[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, add_props: list[str] | None = None, as_rdkit: bool = False, chunk_size: int | None = None, n_jobs: int | None = None) → Generator

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing, the processor must be picklable. Also note that the returned generator will produce results as soon as they are ready, which means that the chunks of data will not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules. See CheckSmilesValid for an example.

Parameters:

processor (MolProcessor) – MolProcessor object to use for processing.
proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.
as_rdkit (bool, optional) – Whether to convert the molecules to RDKit molecules before applying the processor.
chunk_size (int, optional) – Size of chunks to use per job in parallel. If not specified, self.chunkSize is used.
n_jobs (int, optional) – Number of jobs to use for parallel processing. If not specified, self.nJobs is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

reload(): Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

reset()[source]: Reset the data set. Splits will be removed and all descriptors will be moved to the training data. Molecule standardization and molecule filtering are not affected.

resetTargetProperty(prop: TargetProperty | str)[source]

Reset target property to its original value.

Parameters:: prop (TargetProperty | str) – target property to reset

restoreDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str])[source]

Restore descriptors that were previously removed.

Parameters:: descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

restoreTrainingData()[source]

Restore training data from the data frame.

If the data frame contains a column ‘Split_IsTrain’, the data will be split into training and independent sets. Otherwise, the independent set will be empty. If descriptors are available, the resulting training matrices will be featurized.

classmethod runMolProcess(props: dict[str, list] | DataFrame, func: MolProcessor, add_rdkit: bool, smiles_col: str, *args, **kwargs)

A helper method to run a MolProcessor on a list of molecules via apply. It converts the SMILES to RDKit molecules if required and then applies the function to the MolProcessor object.

Parameters:

props (dict) – Dictionary of properties that will be passed in addition to the molecule structure.
func (MolProcessor) – MolProcessor object to use for processing.
add_rdkit (bool) – Whether to convert the SMILES to RDKit molecules before applying the function.
smiles_col (str) – Name of the property containing the SMILES sequences.
*args – Additional positional arguments to pass to the function.
**kwargs – Additional keyword arguments to pass to the function.

sample(n: int, name: str | None = None, random_state: int | None = None) → MoleculeTable

Sample n molecules from the table.

Parameters:

n (int) – Number of molecules to sample.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.
random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save(save_split: bool = True)[source]

Save the data set to file and serialize metadata.

Parameters:: save_split (bool) – whether to save split data to the managed data frame.

saveSplit()[source]: Save split data to the managed data frame.

searchOnProperty(prop_name: str, values: list[str], name: str | None = None, exact=False) → MoleculeTable

Search in this table using a property name and a list of values. It is assumed that the property is searchable with string matching. Either an exact match or a partial match can be used. If ‘exact’ is False, the search will be performed with partial matching, i.e. all molecules that contain any of the given values in the property will be returned. If ‘exact’ is True, only molecules that have the exact property value for any of the given values will be returned.

Parameters:

prop_name (str) – Name of the property to search on.
values (list[str]) – List of values to search for. If any of the values is found in the property, the molecule will be considered a match.
name (str | None, optional) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.
exact (bool, optional) – Whether to use exact matching, i.e. whether to search for exact strings or just substrings. Defaults to False.

Returns:

A new table with the molecules from the old table with the given property values.

Return type:

MoleculeTable

searchWithIndex(index: Index, name: str | None = None) → MoleculeTable[source]

Search in this table using a pandas index. The return values is a new table with the molecules from the old table with the given indices.

Parameters:

index (pd.Index) – Indices to search for in this table.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.

Returns:

A new table with the molecules from the old table with the given indices.

Return type:

MoleculeTable

searchWithSMARTS(patterns: list[str], operator: ~typing.Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, match_function: ~typing.Callable = <function match_mol_to_smarts>) → MoleculeTable

Search the molecules in the table with a SMARTS pattern.

Parameters:

patterns – List of SMARTS patterns to search with.
operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.
use_chirality – Whether to use chirality in the search.
name – Name of the new table. Defaults to the name of the old table, plus the smarts_searched suffix.
match_function – Function to use for matching the molecules to the SMARTS patterns. Defaults to match_mol_to_smarts.

Returns:

A dataframe with the molecules that match the pattern.

Return type:

(MolTable)

setApplicabilityDomain(applicability_domain: ApplicabilityDomain | ApplicabilityDomain)[source]

Set the applicability domain calculator.

Parameters:: applicability_domain (ApplicabilityDomain | MLChemADApplicabilityDomain) – applicability domain calculator instance

setFeatureStandardizer(feature_standardizer)[source]

Set feature standardizer.

Parameters:: feature_standardizer (SKLearnStandardizer | BaseEstimator) – feature standardizer

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

setRandomState(random_state: int)

Set the random state for this instance.

Parameters:: random_state (int) – Random state to use for shuffling and other random operations.

setTargetProperties(target_props: list[qsprpred.tasks.TargetProperty | dict], drop_empty: bool = True)[source]

Set list of target properties and apply transformations if specified.

Parameters:

target_props (list[TargetProperty]) – list of target properties
drop_empty (bool, optional) – whether to drop rows with empty target property values. Defaults to True.

setTargetProperty(prop: TargetProperty | dict, drop_empty: bool = True)[source]

Add a target property to the dataset.

Parameters:

prop (TargetProperty) – name of the target property to add
drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

shuffle(random_state: int | None = None)[source]: Shuffle the internal data frame.

property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in the data frame.

Returns:: Generator of SMILES strings.
Return type:: Generator[str, None, None]

split(split: DataSplit, featurize: bool = False)[source]

Split dataset into train and test set.

You can either split tha data frame itself or you can set featurize to True if you want to use feature matrices instead of the raw data frame.

Parameters:

split (DataSplit) – split instance orchestrating the split
featurize (bool) – whether to featurize the data set splits after splitting. Defaults to False.

standardizeSmiles(smiles_standardizer, drop_invalid=True)

Apply smiles_standardizer to the compounds in parallel

Parameters:

() (smiles_standardizer) – either None to skip the standardization, chembl, old, or a partial function that reads and standardizes smiles.
drop_invalid (bool) – whether to drop invalid SMILES from the data set. Defaults to True. If False, invalid SMILES will be retained in their original form. If self.invalidsRemoved is True, there will be no effect even if drop_invalid is True. Set self.invalidsRemoved to False on this instance to force the removal of invalid SMILES.

Raises:

ValueError – when smiles_standardizer is not a callable or one of the predefined strings.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

property targetPropertyNames: Get the names of the target properties.

toFile(filename: str)

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(targets: list[str], transformer: Callable)[source]

Transform the target properties using the given transformer.

Parameters:

targets (list[str]) – list of target properties names to transform
transformer (Callable) – transformer function
add_as (list[str] | None, optional) – list of names to add the transformed target properties as. If None, the original target properties will be overwritten. Defaults to None.

unsetTargetProperty(name: str | TargetProperty)[source]

Unset the target property. It will not remove it from the data set, but will make it unavailable for training.

Parameters:: name (str | TargetProperty) – name of the target property to drop or the property itself

qsprpred.data.tables.searchable module

class qsprpred.data.tables.searchable.SearchableMolTable[source]

Bases: MoleculeDataTable

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)

Add descriptors to the dataset.

Parameters:

descriptors (list[DescriptorSet]) – The descriptors to add.
args – Additional positional arguments to be passed to each descriptor set.
kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addProperty(name: str, data: list)

Add a property to the dataset.

Parameters:

name (str) – The name of the property.
data (list) – The data of the property.

abstract apply(func: callable, on_props: list[str] | None = None, func_args: list | None = None, func_kwargs: dict | None = None)

Apply a function on all or selected properties. The properties are supplied as the first positional argument to the function.

Parameters:

func (callable) – The function to apply.
on_props (list, optional) – The properties to include.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.

abstract clearFiles(): Delete the files associated with the table.

abstract filter(table_filters: list[Callable])

Filter the dataset.

Parameters:: table_filters (List[Callable]) – The filters to apply.

abstract static fromFile(filename: str) → StoredTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

abstract getDescriptorNames() → list[str]

Get the names of the descriptors that are currently in the dataset.

Returns:: a list of descriptor names

abstract getDescriptors() → DataFrame

Get the table of descriptors that are currently in the dataset.

Returns:: a pd.DataFrame with the descriptors

abstract getProperties(): Get the property names contained in the dataset.

abstract getProperty(name: str): Get values of a given property.

abstract getSubset(prefix: str)

Get a subset of the dataset.

Parameters:: prefix (str) – The prefix of the subset.

abstract hasDescriptors(): Indicates if the dataset has descriptors.

abstract reload(): Reload the table from a file.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:: name (str) – The name of the property.

abstract save(): Save the table to a file.

abstract searchOnProperty(prop_name: str, values: list[str], name: str | None = None, exact=False) → MoleculeDataTable[source]

Search the molecules within this MoleculeDataSet on a property value.

Parameters:

prop_name – Name of the column to search on.
values – Values to search for.
name – Name of the new table.
exact – Whether to search for exact matches or not.

Returns:

A data set with the molecules that match the search.

Return type:

(MoleculeDataTable)

abstract searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None) → MoleculeDataTable[source]

Search the molecules within this MoleculeDataSet with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.
operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.
use_chirality – Whether to use chirality in the search.
name – Name of the new table.

Returns:

A dataframe with the molecules that match the pattern.

Return type:

(MoleculeDataTable)

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in the dataset.

Returns:: The SMILES strings of the molecules in the dataset.
Return type:: list[str]

abstract transformProperties(names, transformers)

Transform property values using a transformer function.

Parameters:

targets (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

qsprpred.data.tables.tests module

class qsprpred.data.tables.tests.TestApply(methodName='runTest')[source]

Bases: DataSetsPathMixIn, QSPRTestCase

Tests the apply method of the data set.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

classmethod addClassCleanup(function, /, *args, **kwargs): Same as addCleanup, except the cleanup items are called even if setUpClass fails (unlike tearDownClass).

addCleanup(function, /, *args, **kwargs)

Add a function, with arguments, to be called when the test is completed. Functions added are called on a LIFO basis and are called after tearDown on test failure or success.

Cleanup items are called even if setUp fails (unlike tearDown).

addTypeEqualityFunc(typeobj, function)

Add a type specific assertEqual style function to compare a type.

This method is for use by TestCase subclasses that need to register their own type equality functions to provide nicer error messages.

Parameters:

typeobj – The data type to call this function on when both values are of the same type in assertEqual().
function – The callable taking two arguments and an optional msg= argument that raises self.failureException with a useful error message when the two arguments are not equal.

assertAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are unequal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is more than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

If the two objects compare equal then they will automatically compare almost equal.

assertCountEqual(first, second, msg=None)

Asserts that two iterables have the same elements, the same number of times, without regard to order.

self.assertEqual(Counter(list(first)),
Counter(list(second)))

Example:

[0, 1, 1] and [1, 0, 1] compare equal.

[0, 0, 1] and [0, 1] compare unequal.

assertDictEqual(d1, d2, msg=None)

assertEqual(first, second, msg=None): Fail if the two objects are unequal as determined by the ‘==’ operator.

assertFalse(expr, msg=None): Check that the expression is false.

assertGreater(a, b, msg=None): Just like self.assertTrue(a > b), but with a nicer default message.

assertGreaterEqual(a, b, msg=None): Just like self.assertTrue(a >= b), but with a nicer default message.

assertIn(member, container, msg=None): Just like self.assertTrue(a in b), but with a nicer default message.

assertIs(expr1, expr2, msg=None): Just like self.assertTrue(a is b), but with a nicer default message.

assertIsInstance(obj, cls, msg=None): Same as self.assertTrue(isinstance(obj, cls)), with a nicer default message.

assertIsNone(obj, msg=None): Same as self.assertTrue(obj is None), with a nicer default message.

assertIsNot(expr1, expr2, msg=None): Just like self.assertTrue(a is not b), but with a nicer default message.

assertIsNotNone(obj, msg=None): Included for symmetry with assertIsNone.

assertLess(a, b, msg=None): Just like self.assertTrue(a < b), but with a nicer default message.

assertLessEqual(a, b, msg=None): Just like self.assertTrue(a <= b), but with a nicer default message.

assertListEqual(list1, list2, msg=None)

A list-specific equality assertion.

Parameters:

list1 – The first list to compare.
list2 – The second list to compare.
msg – Optional message to use on failure instead of a list of differences.

assertLogs(logger=None, level=None)

Fail unless a log message of level level or higher is emitted on logger_name or its children. If omitted, level defaults to INFO and logger defaults to the root logger.

This method must be used as a context manager, and will yield a recording object with two attributes: output and records. At the end of the context manager, the output attribute will be a list of the matching formatted log messages and the records attribute will be a list of the corresponding LogRecord objects.

Example:

with self.assertLogs('foo', level='INFO') as cm:
    logging.getLogger('foo').info('first message')
    logging.getLogger('foo.bar').error('second message')
self.assertEqual(cm.output, ['INFO:foo:first message',
                             'ERROR:foo.bar:second message'])

assertMultiLineEqual(first, second, msg=None): Assert that two multi-line strings are equal.

assertNoLogs(logger=None, level=None)

Fail unless no log messages of level level or higher are emitted on logger_name or its children.

This method must be used as a context manager.

assertNotAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are equal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is less than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

Objects that are equal automatically fail.

assertNotEqual(first, second, msg=None): Fail if the two objects are equal as determined by the ‘!=’ operator.

assertNotIn(member, container, msg=None): Just like self.assertTrue(a not in b), but with a nicer default message.

assertNotIsInstance(obj, cls, msg=None): Included for symmetry with assertIsInstance.

assertNotRegex(text, unexpected_regex, msg=None): Fail the test if the text matches the regular expression.

assertRaises(expected_exception, *args, **kwargs)

Fail unless an exception of class expected_exception is raised by the callable when invoked with specified positional and keyword arguments. If a different type of exception is raised, it will not be caught, and the test case will be deemed to have suffered an error, exactly as for an unexpected exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertRaises(SomeException):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertRaises is used as a context object.

The context manager keeps a reference to the exception as the ‘exception’ attribute. This allows you to inspect the exception after the assertion:

with self.assertRaises(SomeException) as cm:
    do_something()
the_exception = cm.exception
self.assertEqual(the_exception.error_code, 3)

assertRaisesRegex(expected_exception, expected_regex, *args, **kwargs)

Asserts that the message in a raised exception matches a regex.

Parameters:

expected_exception – Exception class expected to be raised.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertRaisesRegex is used as a context manager.

assertRegex(text, expected_regex, msg=None): Fail the test unless the text matches the regular expression.

assertSequenceEqual(seq1, seq2, msg=None, seq_type=None)

An equality assertion for ordered sequences (like lists and tuples).

For the purposes of this function, a valid ordered sequence type is one which can be indexed, has a length, and has an equality operator.

Parameters:

seq1 – The first sequence to compare.
seq2 – The second sequence to compare.
seq_type – The expected datatype of the sequences, or None if no datatype should be enforced.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual(set1, set2, msg=None)

A set-specific equality assertion.

Parameters:

set1 – The first set to compare.
set2 – The second set to compare.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual uses ducktyping to support different types of sets, and is optimized for sets specifically (parameters must support a difference method).

assertTrue(expr, msg=None): Check that the expression is true.

assertTupleEqual(tuple1, tuple2, msg=None)

A tuple-specific equality assertion.

Parameters:

tuple1 – The first tuple to compare.
tuple2 – The second tuple to compare.
msg – Optional message to use on failure instead of a list of differences.

assertWarns(expected_warning, *args, **kwargs)

Fail unless a warning of class warnClass is triggered by the callable when invoked with specified positional and keyword arguments. If a different type of warning is triggered, it will not be handled: depending on the other warning filtering rules in effect, it might be silenced, printed out, or raised as an exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertWarns(SomeWarning):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertWarns is used as a context object.

The context manager keeps a reference to the first matching warning as the ‘warning’ attribute; similarly, the ‘filename’ and ‘lineno’ attributes give you information about the line of Python code from which the warning was triggered. This allows you to inspect the warning after the assertion:

with self.assertWarns(SomeWarning) as cm:
    do_something()
the_warning = cm.warning
self.assertEqual(the_warning.some_attribute, 147)

assertWarnsRegex(expected_warning, expected_regex, *args, **kwargs)

Asserts that the message in a triggered warning matches a regexp. Basic functioning is similar to assertWarns() with the addition that only warnings whose messages also match the regular expression are considered successful matches.

Parameters:

expected_warning – Warning class expected to be triggered.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertWarnsRegex is used as a context manager.

clearGenerated(): Remove the directories that are used for testing.

countTestCases()

createLargeMultitaskDataSet(name='QSPRDataset_multi_test', target_props=[{'name': 'HBD', 'task': <TargetTasks.MULTICLASS: 'MULTICLASS'>, 'th': [-1, 1, 2, 100]}, {'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
preparation_settings (dict) – dictionary containing preparation settings
random_state (int) – random state to use for splitting and shuffling

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createLargeTestDataSet(name='QSPRDataset_test_large', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42, n_jobs=1, chunk_size=None)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createSmallTestDataSet(name='QSPRDataset_test_small', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a small dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createTestDataSetFromFrame(df, name='QSPRDataset_test', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], random_state=None, prep=None, n_jobs=1, chunk_size=None)

Create a dataset for testing purposes from the given data frame.

Parameters:

df (pd.DataFrame) – data frame containing the dataset
name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
prep (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

debug(): Run the test without collecting errors in a TestResult

defaultTestResult()

classmethod doClassCleanups(): Execute all class cleanup functions. Normally called for you after tearDownClass.

doCleanups(): Execute all cleanup functions. Normally called for you after tearDown.

classmethod enterClassContext(cm): Same as enterContext, but class-wide.

enterContext(cm)

Enters the supplied context manager.

If successful, also adds its __exit__ method as a cleanup function and returns the result of the __enter__ method.

fail(msg=None): Fail immediately, with the given message.

failureException: alias of AssertionError

classmethod getAllDescriptors()

Return a list of (ideally) all available descriptor sets. For now they need to be added manually to the list below.

TODO: would be nice to create the list automatically by implementing a descriptor set registry that would hold all installed descriptor sets.

Returns:: list of DescriptorCalculator objects
Return type:: list

getBigDF()

Get a large data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

classmethod getDataPrepGrid()

Return a list of many possible combinations of descriptor calculators, splits, feature standardizers, feature filters and data filters. Again, this is not exhaustive, but should cover a lot of cases.

Returns:: a generator that yields tuples of all possible combinations as stated above, each tuple is defined as: (descriptor_calculator, split, feature_standardizer, feature_filters, data_filters)
Return type:: grid

classmethod getDefaultCalculatorCombo()

Makes a list of default descriptor calculators that can be used in tests. It creates a calculator with only morgan fingerprints and rdkit descriptors, but also one with them both to test behaviour with multiple descriptor sets. Override this method if you want to test with other descriptor sets and calculator combinations.

Returns:: list of created DescriptorCalculator objects
Return type:: list

static getDefaultPrep(): Return a dictionary with default preparation settings.

classmethod getPrepCombos()

Return a list of all possible preparation combinations as generated by getDataPrepGrid as well as their names. The generated list can be used to parameterize tests with the given named combinations.

Returns:: list of `list`s of all possible combinations of preparation
Return type:: list

getSmallDF()

Get a small data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

id()

longMessage = True

maxDiff = 640

static regularFunc(props, *args, **kwargs)[source]

run(result=None)

setUp()[source]: Hook method for setting up the test fixture before exercising it.

classmethod setUpClass(): Hook method for setting up class fixture before running tests in the class.

setUpPaths(): Create the directories that are used for testing.

shortDescription()

Returns a one-line description of the test, or None if no description has been provided.

The default implementation of this method returns the first line of the specified test method’s docstring.

skipTest(reason): Skip this test.

subTest(msg=<object object>, **params): Return a context manager that will return the enclosed block of code in a subtest identified by the optional message and keyword parameters. A failure in the subtest marks the test case as failed but resumes execution at the end of the enclosed block, allowing further test code to be executed.

tearDown(): Remove all files and directories that are used for testing.

classmethod tearDownClass(): Hook method for deconstructing the class fixture after running all tests in the class.

testRegular = None

testRegular_0(**kw)

testRegular_1(**kw)

testRegular_2(**kw)

testRegular_3(**kw)

validate_split(dataset): Check if the split has the data it should have after splitting.

class qsprpred.data.tables.tests.TestDataSetCreationAndSerialization(methodName='runTest')[source]

Bases: DataSetsPathMixIn, QSPRTestCase

Simple tests for dataset creation and serialization under different conditions and error states.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

classmethod addClassCleanup(function, /, *args, **kwargs): Same as addCleanup, except the cleanup items are called even if setUpClass fails (unlike tearDownClass).

addCleanup(function, /, *args, **kwargs)

Add a function, with arguments, to be called when the test is completed. Functions added are called on a LIFO basis and are called after tearDown on test failure or success.

Cleanup items are called even if setUp fails (unlike tearDown).

addTypeEqualityFunc(typeobj, function)

Add a type specific assertEqual style function to compare a type.

This method is for use by TestCase subclasses that need to register their own type equality functions to provide nicer error messages.

Parameters:

typeobj – The data type to call this function on when both values are of the same type in assertEqual().
function – The callable taking two arguments and an optional msg= argument that raises self.failureException with a useful error message when the two arguments are not equal.

assertAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are unequal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is more than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

If the two objects compare equal then they will automatically compare almost equal.

assertCountEqual(first, second, msg=None)

Asserts that two iterables have the same elements, the same number of times, without regard to order.

self.assertEqual(Counter(list(first)),
Counter(list(second)))

Example:

[0, 1, 1] and [1, 0, 1] compare equal.

[0, 0, 1] and [0, 1] compare unequal.

assertDictEqual(d1, d2, msg=None)

assertEqual(first, second, msg=None): Fail if the two objects are unequal as determined by the ‘==’ operator.

assertFalse(expr, msg=None): Check that the expression is false.

assertGreater(a, b, msg=None): Just like self.assertTrue(a > b), but with a nicer default message.

assertGreaterEqual(a, b, msg=None): Just like self.assertTrue(a >= b), but with a nicer default message.

assertIn(member, container, msg=None): Just like self.assertTrue(a in b), but with a nicer default message.

assertIs(expr1, expr2, msg=None): Just like self.assertTrue(a is b), but with a nicer default message.

assertIsInstance(obj, cls, msg=None): Same as self.assertTrue(isinstance(obj, cls)), with a nicer default message.

assertIsNone(obj, msg=None): Same as self.assertTrue(obj is None), with a nicer default message.

assertIsNot(expr1, expr2, msg=None): Just like self.assertTrue(a is not b), but with a nicer default message.

assertIsNotNone(obj, msg=None): Included for symmetry with assertIsNone.

assertLess(a, b, msg=None): Just like self.assertTrue(a < b), but with a nicer default message.

assertLessEqual(a, b, msg=None): Just like self.assertTrue(a <= b), but with a nicer default message.

assertListEqual(list1, list2, msg=None)

A list-specific equality assertion.

Parameters:

list1 – The first list to compare.
list2 – The second list to compare.
msg – Optional message to use on failure instead of a list of differences.

assertLogs(logger=None, level=None)

Fail unless a log message of level level or higher is emitted on logger_name or its children. If omitted, level defaults to INFO and logger defaults to the root logger.

This method must be used as a context manager, and will yield a recording object with two attributes: output and records. At the end of the context manager, the output attribute will be a list of the matching formatted log messages and the records attribute will be a list of the corresponding LogRecord objects.

Example:

with self.assertLogs('foo', level='INFO') as cm:
    logging.getLogger('foo').info('first message')
    logging.getLogger('foo.bar').error('second message')
self.assertEqual(cm.output, ['INFO:foo:first message',
                             'ERROR:foo.bar:second message'])

assertMultiLineEqual(first, second, msg=None): Assert that two multi-line strings are equal.

assertNoLogs(logger=None, level=None)

Fail unless no log messages of level level or higher are emitted on logger_name or its children.

This method must be used as a context manager.

assertNotAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are equal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is less than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

Objects that are equal automatically fail.

assertNotEqual(first, second, msg=None): Fail if the two objects are equal as determined by the ‘!=’ operator.

assertNotIn(member, container, msg=None): Just like self.assertTrue(a not in b), but with a nicer default message.

assertNotIsInstance(obj, cls, msg=None): Included for symmetry with assertIsInstance.

assertNotRegex(text, unexpected_regex, msg=None): Fail the test if the text matches the regular expression.

assertRaises(expected_exception, *args, **kwargs)

Fail unless an exception of class expected_exception is raised by the callable when invoked with specified positional and keyword arguments. If a different type of exception is raised, it will not be caught, and the test case will be deemed to have suffered an error, exactly as for an unexpected exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertRaises(SomeException):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertRaises is used as a context object.

The context manager keeps a reference to the exception as the ‘exception’ attribute. This allows you to inspect the exception after the assertion:

with self.assertRaises(SomeException) as cm:
    do_something()
the_exception = cm.exception
self.assertEqual(the_exception.error_code, 3)

assertRaisesRegex(expected_exception, expected_regex, *args, **kwargs)

Asserts that the message in a raised exception matches a regex.

Parameters:

expected_exception – Exception class expected to be raised.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertRaisesRegex is used as a context manager.

assertRegex(text, expected_regex, msg=None): Fail the test unless the text matches the regular expression.

assertSequenceEqual(seq1, seq2, msg=None, seq_type=None)

An equality assertion for ordered sequences (like lists and tuples).

For the purposes of this function, a valid ordered sequence type is one which can be indexed, has a length, and has an equality operator.

Parameters:

seq1 – The first sequence to compare.
seq2 – The second sequence to compare.
seq_type – The expected datatype of the sequences, or None if no datatype should be enforced.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual(set1, set2, msg=None)

A set-specific equality assertion.

Parameters:

set1 – The first set to compare.
set2 – The second set to compare.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual uses ducktyping to support different types of sets, and is optimized for sets specifically (parameters must support a difference method).

assertTrue(expr, msg=None): Check that the expression is true.

assertTupleEqual(tuple1, tuple2, msg=None)

A tuple-specific equality assertion.

Parameters:

tuple1 – The first tuple to compare.
tuple2 – The second tuple to compare.
msg – Optional message to use on failure instead of a list of differences.

assertWarns(expected_warning, *args, **kwargs)

Fail unless a warning of class warnClass is triggered by the callable when invoked with specified positional and keyword arguments. If a different type of warning is triggered, it will not be handled: depending on the other warning filtering rules in effect, it might be silenced, printed out, or raised as an exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertWarns(SomeWarning):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertWarns is used as a context object.

The context manager keeps a reference to the first matching warning as the ‘warning’ attribute; similarly, the ‘filename’ and ‘lineno’ attributes give you information about the line of Python code from which the warning was triggered. This allows you to inspect the warning after the assertion:

with self.assertWarns(SomeWarning) as cm:
    do_something()
the_warning = cm.warning
self.assertEqual(the_warning.some_attribute, 147)

assertWarnsRegex(expected_warning, expected_regex, *args, **kwargs)

Asserts that the message in a triggered warning matches a regexp. Basic functioning is similar to assertWarns() with the addition that only warnings whose messages also match the regular expression are considered successful matches.

Parameters:

expected_warning – Warning class expected to be triggered.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertWarnsRegex is used as a context manager.

checkBadInit(ds)[source]

checkClassification(ds, target_names, ths)[source]

checkConsistency(ds: QSPRDataset)[source]

checkConsistencyMulticlass(ds)[source]

checkConsistencySingleclass(ds)[source]

checkRegression(ds, target_names)[source]

clearGenerated(): Remove the directories that are used for testing.

countTestCases()

createLargeMultitaskDataSet(name='QSPRDataset_multi_test', target_props=[{'name': 'HBD', 'task': <TargetTasks.MULTICLASS: 'MULTICLASS'>, 'th': [-1, 1, 2, 100]}, {'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
preparation_settings (dict) – dictionary containing preparation settings
random_state (int) – random state to use for splitting and shuffling

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createLargeTestDataSet(name='QSPRDataset_test_large', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42, n_jobs=1, chunk_size=None)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createSmallTestDataSet(name='QSPRDataset_test_small', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a small dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createTestDataSetFromFrame(df, name='QSPRDataset_test', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], random_state=None, prep=None, n_jobs=1, chunk_size=None)

Create a dataset for testing purposes from the given data frame.

Parameters:

df (pd.DataFrame) – data frame containing the dataset
name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
prep (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

debug(): Run the test without collecting errors in a TestResult

defaultTestResult()

classmethod doClassCleanups(): Execute all class cleanup functions. Normally called for you after tearDownClass.

doCleanups(): Execute all cleanup functions. Normally called for you after tearDown.

classmethod enterClassContext(cm): Same as enterContext, but class-wide.

enterContext(cm)

Enters the supplied context manager.

If successful, also adds its __exit__ method as a cleanup function and returns the result of the __enter__ method.

fail(msg=None): Fail immediately, with the given message.

failureException: alias of AssertionError

classmethod getAllDescriptors()

Return a list of (ideally) all available descriptor sets. For now they need to be added manually to the list below.

TODO: would be nice to create the list automatically by implementing a descriptor set registry that would hold all installed descriptor sets.

Returns:: list of DescriptorCalculator objects
Return type:: list

getBigDF()

Get a large data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

classmethod getDataPrepGrid()

Return a list of many possible combinations of descriptor calculators, splits, feature standardizers, feature filters and data filters. Again, this is not exhaustive, but should cover a lot of cases.

Returns:: a generator that yields tuples of all possible combinations as stated above, each tuple is defined as: (descriptor_calculator, split, feature_standardizer, feature_filters, data_filters)
Return type:: grid

classmethod getDefaultCalculatorCombo()

Makes a list of default descriptor calculators that can be used in tests. It creates a calculator with only morgan fingerprints and rdkit descriptors, but also one with them both to test behaviour with multiple descriptor sets. Override this method if you want to test with other descriptor sets and calculator combinations.

Returns:: list of created DescriptorCalculator objects
Return type:: list

static getDefaultPrep(): Return a dictionary with default preparation settings.

classmethod getPrepCombos()

Return a list of all possible preparation combinations as generated by getDataPrepGrid as well as their names. The generated list can be used to parameterize tests with the given named combinations.

Returns:: list of `list`s of all possible combinations of preparation
Return type:: list

getSmallDF()

Get a small data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

id()

longMessage = True

maxDiff = 640

run(result=None)

setUp()[source]: Hook method for setting up the test fixture before exercising it.

classmethod setUpClass(): Hook method for setting up class fixture before running tests in the class.

setUpPaths(): Create the directories that are used for testing.

shortDescription()

Returns a one-line description of the test, or None if no description has been provided.

The default implementation of this method returns the first line of the specified test method’s docstring.

skipTest(reason): Skip this test.

subTest(msg=<object object>, **params): Return a context manager that will return the enclosed block of code in a subtest identified by the optional message and keyword parameters. A failure in the subtest marks the test case as failed but resumes execution at the end of the enclosed block, allowing further test code to be executed.

tearDown(): Remove all files and directories that are used for testing.

classmethod tearDownClass(): Hook method for deconstructing the class fixture after running all tests in the class.

testDefaults()[source]: Test basic dataset creation and serialization with mostly default options.

testIndexing()[source]

testInvalidsDetection = None

testInvalidsDetection_0(**kw)

testInvalidsDetection_1(**kw)

testMultitask()[source]: Test multi-task dataset creation and functionality.

testRandomStateFeaturization()[source]

testRandomStateFolds()[source]

testRandomStateShuffle()[source]

testTargetProperty()[source]: Test target property creation and serialization in the context of a dataset.

validate_split(dataset): Check if the split has the data it should have after splitting.

class qsprpred.data.tables.tests.TestDataSetPreparation(methodName='runTest')[source]

Bases: DataSetsPathMixIn, DataPrepCheckMixIn, QSPRTestCase

Test as many possible combinations of data sets and their preparation settings. These can run potentially for a long time so use the skip decorator if you want to skip all these tests to speed things up during development.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

classmethod addClassCleanup(function, /, *args, **kwargs): Same as addCleanup, except the cleanup items are called even if setUpClass fails (unlike tearDownClass).

addCleanup(function, /, *args, **kwargs)

Add a function, with arguments, to be called when the test is completed. Functions added are called on a LIFO basis and are called after tearDown on test failure or success.

Cleanup items are called even if setUp fails (unlike tearDown).

addTypeEqualityFunc(typeobj, function)

Add a type specific assertEqual style function to compare a type.

This method is for use by TestCase subclasses that need to register their own type equality functions to provide nicer error messages.

Parameters:

typeobj – The data type to call this function on when both values are of the same type in assertEqual().
function – The callable taking two arguments and an optional msg= argument that raises self.failureException with a useful error message when the two arguments are not equal.

assertAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are unequal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is more than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

If the two objects compare equal then they will automatically compare almost equal.

assertCountEqual(first, second, msg=None)

Asserts that two iterables have the same elements, the same number of times, without regard to order.

self.assertEqual(Counter(list(first)),
Counter(list(second)))

Example:

[0, 1, 1] and [1, 0, 1] compare equal.

[0, 0, 1] and [0, 1] compare unequal.

assertDictEqual(d1, d2, msg=None)

assertEqual(first, second, msg=None): Fail if the two objects are unequal as determined by the ‘==’ operator.

assertFalse(expr, msg=None): Check that the expression is false.

assertGreater(a, b, msg=None): Just like self.assertTrue(a > b), but with a nicer default message.

assertGreaterEqual(a, b, msg=None): Just like self.assertTrue(a >= b), but with a nicer default message.

assertIn(member, container, msg=None): Just like self.assertTrue(a in b), but with a nicer default message.

assertIs(expr1, expr2, msg=None): Just like self.assertTrue(a is b), but with a nicer default message.

assertIsInstance(obj, cls, msg=None): Same as self.assertTrue(isinstance(obj, cls)), with a nicer default message.

assertIsNone(obj, msg=None): Same as self.assertTrue(obj is None), with a nicer default message.

assertIsNot(expr1, expr2, msg=None): Just like self.assertTrue(a is not b), but with a nicer default message.

assertIsNotNone(obj, msg=None): Included for symmetry with assertIsNone.

assertLess(a, b, msg=None): Just like self.assertTrue(a < b), but with a nicer default message.

assertLessEqual(a, b, msg=None): Just like self.assertTrue(a <= b), but with a nicer default message.

assertListEqual(list1, list2, msg=None)

A list-specific equality assertion.

Parameters:

list1 – The first list to compare.
list2 – The second list to compare.
msg – Optional message to use on failure instead of a list of differences.

assertLogs(logger=None, level=None)

Fail unless a log message of level level or higher is emitted on logger_name or its children. If omitted, level defaults to INFO and logger defaults to the root logger.

This method must be used as a context manager, and will yield a recording object with two attributes: output and records. At the end of the context manager, the output attribute will be a list of the matching formatted log messages and the records attribute will be a list of the corresponding LogRecord objects.

Example:

with self.assertLogs('foo', level='INFO') as cm:
    logging.getLogger('foo').info('first message')
    logging.getLogger('foo.bar').error('second message')
self.assertEqual(cm.output, ['INFO:foo:first message',
                             'ERROR:foo.bar:second message'])

assertMultiLineEqual(first, second, msg=None): Assert that two multi-line strings are equal.

assertNoLogs(logger=None, level=None)

Fail unless no log messages of level level or higher are emitted on logger_name or its children.

This method must be used as a context manager.

assertNotAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are equal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is less than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

Objects that are equal automatically fail.

assertNotEqual(first, second, msg=None): Fail if the two objects are equal as determined by the ‘!=’ operator.

assertNotIn(member, container, msg=None): Just like self.assertTrue(a not in b), but with a nicer default message.

assertNotIsInstance(obj, cls, msg=None): Included for symmetry with assertIsInstance.

assertNotRegex(text, unexpected_regex, msg=None): Fail the test if the text matches the regular expression.

assertRaises(expected_exception, *args, **kwargs)

Fail unless an exception of class expected_exception is raised by the callable when invoked with specified positional and keyword arguments. If a different type of exception is raised, it will not be caught, and the test case will be deemed to have suffered an error, exactly as for an unexpected exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertRaises(SomeException):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertRaises is used as a context object.

The context manager keeps a reference to the exception as the ‘exception’ attribute. This allows you to inspect the exception after the assertion:

with self.assertRaises(SomeException) as cm:
    do_something()
the_exception = cm.exception
self.assertEqual(the_exception.error_code, 3)

assertRaisesRegex(expected_exception, expected_regex, *args, **kwargs)

Asserts that the message in a raised exception matches a regex.

Parameters:

expected_exception – Exception class expected to be raised.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertRaisesRegex is used as a context manager.

assertRegex(text, expected_regex, msg=None): Fail the test unless the text matches the regular expression.

assertSequenceEqual(seq1, seq2, msg=None, seq_type=None)

An equality assertion for ordered sequences (like lists and tuples).

For the purposes of this function, a valid ordered sequence type is one which can be indexed, has a length, and has an equality operator.

Parameters:

seq1 – The first sequence to compare.
seq2 – The second sequence to compare.
seq_type – The expected datatype of the sequences, or None if no datatype should be enforced.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual(set1, set2, msg=None)

A set-specific equality assertion.

Parameters:

set1 – The first set to compare.
set2 – The second set to compare.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual uses ducktyping to support different types of sets, and is optimized for sets specifically (parameters must support a difference method).

assertTrue(expr, msg=None): Check that the expression is true.

assertTupleEqual(tuple1, tuple2, msg=None)

A tuple-specific equality assertion.

Parameters:

tuple1 – The first tuple to compare.
tuple2 – The second tuple to compare.
msg – Optional message to use on failure instead of a list of differences.

assertWarns(expected_warning, *args, **kwargs)

Fail unless a warning of class warnClass is triggered by the callable when invoked with specified positional and keyword arguments. If a different type of warning is triggered, it will not be handled: depending on the other warning filtering rules in effect, it might be silenced, printed out, or raised as an exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertWarns(SomeWarning):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertWarns is used as a context object.

The context manager keeps a reference to the first matching warning as the ‘warning’ attribute; similarly, the ‘filename’ and ‘lineno’ attributes give you information about the line of Python code from which the warning was triggered. This allows you to inspect the warning after the assertion:

with self.assertWarns(SomeWarning) as cm:
    do_something()
the_warning = cm.warning
self.assertEqual(the_warning.some_attribute, 147)

assertWarnsRegex(expected_warning, expected_regex, *args, **kwargs)

Asserts that the message in a triggered warning matches a regexp. Basic functioning is similar to assertWarns() with the addition that only warnings whose messages also match the regular expression are considered successful matches.

Parameters:

expected_warning – Warning class expected to be triggered.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertWarnsRegex is used as a context manager.

checkDescriptors(dataset: QSPRDataset, target_props: list[dict | qsprpred.tasks.TargetProperty])

Check if information about descriptors is consistent in the data set. Checks if calculators are consistent with the descriptors contained in the data set. This is tested also before and after serialization.

Parameters:

dataset (QSPRDataset) – The data set to check.
target_props (List of dicts or TargetProperty) – list of target properties

Raises:

AssertionError – If the consistency check fails.

checkFeatures(ds: QSPRDataset, expected_length: int)

Check if the feature names and the feature matrix of a data set is consistent with expected number of variables.

Parameters:

ds (QSPRDataset) – The data set to check.
expected_length (int) – The expected number of features.

Raises:

AssertionError – If the feature names or the feature matrix is not consistent

checkPrep(dataset, feature_calculators, split, feature_standardizer, feature_filter, data_filter, applicability_domain, expected_target_props): Check the consistency of the dataset after preparation.

clearGenerated(): Remove the directories that are used for testing.

countTestCases()

createLargeMultitaskDataSet(name='QSPRDataset_multi_test', target_props=[{'name': 'HBD', 'task': <TargetTasks.MULTICLASS: 'MULTICLASS'>, 'th': [-1, 1, 2, 100]}, {'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
preparation_settings (dict) – dictionary containing preparation settings
random_state (int) – random state to use for splitting and shuffling

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createLargeTestDataSet(name='QSPRDataset_test_large', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42, n_jobs=1, chunk_size=None)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createSmallTestDataSet(name='QSPRDataset_test_small', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a small dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createTestDataSetFromFrame(df, name='QSPRDataset_test', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], random_state=None, prep=None, n_jobs=1, chunk_size=None)

Create a dataset for testing purposes from the given data frame.

Parameters:

df (pd.DataFrame) – data frame containing the dataset
name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
prep (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

debug(): Run the test without collecting errors in a TestResult

defaultTestResult()

classmethod doClassCleanups(): Execute all class cleanup functions. Normally called for you after tearDownClass.

doCleanups(): Execute all cleanup functions. Normally called for you after tearDown.

classmethod enterClassContext(cm): Same as enterContext, but class-wide.

enterContext(cm)

Enters the supplied context manager.

If successful, also adds its __exit__ method as a cleanup function and returns the result of the __enter__ method.

fail(msg=None): Fail immediately, with the given message.

failureException: alias of AssertionError

classmethod getAllDescriptors()

Return a list of (ideally) all available descriptor sets. For now they need to be added manually to the list below.

TODO: would be nice to create the list automatically by implementing a descriptor set registry that would hold all installed descriptor sets.

Returns:: list of DescriptorCalculator objects
Return type:: list

getBigDF()

Get a large data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

classmethod getDataPrepGrid()

Return a list of many possible combinations of descriptor calculators, splits, feature standardizers, feature filters and data filters. Again, this is not exhaustive, but should cover a lot of cases.

Returns:: a generator that yields tuples of all possible combinations as stated above, each tuple is defined as: (descriptor_calculator, split, feature_standardizer, feature_filters, data_filters)
Return type:: grid

classmethod getDefaultCalculatorCombo()

Makes a list of default descriptor calculators that can be used in tests. It creates a calculator with only morgan fingerprints and rdkit descriptors, but also one with them both to test behaviour with multiple descriptor sets. Override this method if you want to test with other descriptor sets and calculator combinations.

Returns:: list of created DescriptorCalculator objects
Return type:: list

static getDefaultPrep(): Return a dictionary with default preparation settings.

classmethod getPrepCombos()

Return a list of all possible preparation combinations as generated by getDataPrepGrid as well as their names. The generated list can be used to parameterize tests with the given named combinations.

Returns:: list of `list`s of all possible combinations of preparation
Return type:: list

getSmallDF()

Get a small data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

id()

longMessage = True

maxDiff = 640

run(result=None)

setUp()[source]: Hook method for setting up the test fixture before exercising it.

classmethod setUpClass(): Hook method for setting up class fixture before running tests in the class.

setUpPaths(): Create the directories that are used for testing.

shortDescription()

Returns a one-line description of the test, or None if no description has been provided.

The default implementation of this method returns the first line of the specified test method’s docstring.

skipTest(reason): Skip this test.

subTest(msg=<object object>, **params): Return a context manager that will return the enclosed block of code in a subtest identified by the optional message and keyword parameters. A failure in the subtest marks the test case as failed but resumes execution at the end of the enclosed block, allowing further test code to be executed.

tearDown(): Remove all files and directories that are used for testing.

classmethod tearDownClass(): Hook method for deconstructing the class fixture after running all tests in the class.

testPrepCombos = None

testPrepCombos_00_MorganFP_None_None_None_None_None(**kw)

Tests one combination of a data set and its preparation settings [with _=’MorganFP_None_None_None_None_None’, name=’MorganFP_None_None_None_None_None’, feature_calculators=(<qsprpred.data.descriptors.fing…anFP object at 0x7efff7ec5f70>,), split=None, feature_standardizer=None, feature_filter=None, data_filter=None, applicability_domain=None].