qsprpred.extra.data.tables package

Submodules

qsprpred.extra.data.tables.pcm module

class qsprpred.extra.data.tables.pcm.PCMDataSet(name: str, protein_col: str, target_props: list[qsprpred.tasks.TargetProperty | dict], df: DataFrame | None = None, smiles_col: str = 'SMILES', protein_seq_provider: Callable | None = None, add_rdkit: bool = False, store_dir: str = '.', overwrite: bool = False, n_jobs: int | None = 1, chunk_size: int | None = None, drop_invalids: bool = True, drop_empty: bool = True, index_cols: list[str] | None = None, autoindex_name: str = 'QSPRID', random_state: int | None = None, store_format: str = 'pkl')[source]

Bases: QSPRDataset

Extension of QSARDataset for PCM modelling.

It allows specification of a column with protein identifiers and the calculation of protein descriptors.

Variables:

proteinCol (str) – name of column in df containing the protein target identifier (usually a UniProt ID) to use for protein descriptors for PCM modelling and other protein related tasks.
proteinSeqProvider (Callable) – function that takes a list of protein identifiers and returns a dict mapping those identifiers to their sequences. Defaults to None.

Construct a data set to handle PCM data.

Parameters:

name (str) – data name, used in saving the data
protein_col (str) – name of column in df containing the protein target identifier (usually a UniProt ID) to use for protein descriptors for PCM modelling and other protein related tasks.
protein_seq_provider – Callable = None, optional): function that takes a list of protein identifiers and returns a dict mapping those identifiers to their sequences. Defaults to None.
target_props (list[TargetProperty | dict]) – target properties, names should correspond with target column name in df
df (pd.DataFrame, optional) – input dataframe containing smiles and target property. Defaults to None.
smiles_col (str, optional) – name of column in df containing SMILES. Defaults to “SMILES”.
add_rdkit (bool, optional) – if True, column with rdkit molecules will be added to df. Defaults to False.
store_dir (str, optional) – directory for saving the output data. Defaults to ‘.’.
overwrite (bool, optional) – if True, existing data will be overwritten. Defaults to False.
n_jobs (int, optional) – number of parallel jobs. If <= 0, all available cores will be used. Defaults to 1.
chunk_size (int, optional) – chunk size for parallel processing. Defaults to 50.
drop_invalids (bool, optional) – If True, invalid SMILES will be dropped. Defaults to True.
drop_empty (bool, optional) – If True, rows with empty SMILES will be dropped. Defaults to True.
index_cols (List[str], optional) – columns to be used as index in the dataframe. Defaults to None in which case a custom ID will be generated.
autoindex_name (str, optional) – Column name to use for automatically generated IDs.
random_state (int, optional) – random state for reproducibility. Defaults to None.
store_format – format to use for storing the data (‘pkl’ or ‘csv’).

Raises:

ValueError – Raised if threshold given with non-classification task.

addClusters(clusters: list['MoleculeClusters'], recalculate: bool = False)

Add clusters to the data frame.

A new column is created that contains the identifier of the corresponding cluster calculator.

Parameters:

clusters (list) – list of MoleculeClusters calculators.
recalculate (bool) – Whether to recalculate clusters even if they are already present in the data frame.

addDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | qsprpred.extra.data.descriptors.sets.ProteinDescriptorSet], recalculate: bool = False, featurize: bool = True, *args, **kwargs)[source]

Add descriptors to the data set.

If descriptors are already present, they will be recalculated if recalculate is True. Featurization will be performed after adding descriptors if featurize is True. Featurization converts current data matrices to pure numeric matrices of selected descriptors (features).

Parameters:

descriptors (list[DescriptorSet]) – list of descriptor sets to add
recalculate (bool, optional) – whether to recalculate descriptors if they are already present. Defaults to False.
featurize (bool, optional) – whether to featurize the data set splits after adding descriptors. Defaults to True.
*args – additional positional arguments to pass to each descriptor set
**kwargs – additional keyword arguments to pass to each descriptor set

addFeatures(feature_calculators: list[qsprpred.data.descriptors.sets.DescriptorSet], recalculate: bool = False)

Add features to the data set.

Parameters:

feature_calculators (list[DescriptorSet]) – list of feature calculators to add. Defaults to None.
recalculate (bool) – if True, recalculate features even if they are already present in the data set. Defaults to False.

addProperty(name: str, data: list)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.

addScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold], add_rdkit_scaffold: bool = False, recalculate: bool = False)

Add scaffolds to the data frame.

A new column is created that contains the SMILES of the corresponding scaffold. If add_rdkit_scaffold is set to True, a new column is created that contains the RDKit scaffold of the corresponding molecule.

Parameters:

scaffolds (list) – list of Scaffold calculators.
add_rdkit_scaffold (bool) – Whether to add the RDKit scaffold of the molecule as a new column.
recalculate (bool) – Whether to recalculate scaffolds even if they are already present in the data frame.

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

attachDescriptors(calculator: DescriptorSet, descriptors: DataFrame, index_cols: list)

Attach descriptors to the data frame.

Parameters:

calculator (DescriptorsCalculator) – DescriptorsCalculator object to use for descriptor calculation.
descriptors (pd.DataFrame) – DataFrame containing the descriptors to attach.
index_cols (list) – List of column names to use as index.

property baseDir: str: The base directory of the data set folder.

checkFeatures(): Check consistency of features and descriptors.

checkMols(throw: bool = True)

Returns a boolean array indicating whether each molecule is valid or not. If throw is True, an exception is thrown if any molecule is invalid.

Parameters:: throw (bool) – Whether to throw an exception if any molecule is invalid.
Returns:: Boolean series indicating whether each molecule is valid.
Return type:: mask (pd.Series)

property chunkSize: int

clearFiles(): Remove all files associated with this data set from disk.

createScaffoldGroups(mols_per_group: int = 10)

Create scaffold groups.

A scaffold group is a list of molecules that share the same scaffold. New columns are created that contain the scaffold group ID and the scaffold group size.

Parameters:: mols_per_group (int) – number of molecules per scaffold group.

property descriptorSets: Get the descriptor calculators for this table.

dropDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str], full_removal: bool = False)

Drop descriptors from the given sets from the data frame.

Parameters:

descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.
full_removal (bool) – Whether to remove the descriptor data (will perform full removal). By default, a soft removal is performed by just rendering the descriptors inactive. A full removal will remove the descriptorSet from the dataset, including the saved files. It is not possible to restore a descriptorSet after a full removal.

dropDescriptors(descriptors: list[str])

Drop descriptors by name. Performs a simple feature selection by removing the given descriptor names from the data set.

Parameters:: descriptors (list[str]) – List of descriptor names to drop.

dropEmptyProperties(names: list[str])

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

dropEmptySmiles(): Drop rows with empty SMILES from the data set.

dropInvalids()

Drops invalid molecules from the data set.

Returns:

Boolean mask of invalid molecules in the original: data set.

Return type:

mask (pd.Series)

dropOutliers(): Drop outliers from the test set based on the applicability domain.

featurize(update_splits=True)

featurizeSplits(shuffle: bool = True, random_state: int | None = None)

If the data set has descriptors, load them into the train and test splits.

If no descriptors are available, remove all features from the splits. They will become zero length along the feature axis (columns), but will retain their original length along the sample axis (rows). This is useful for the case where the data set has no descriptors, but the user wants to retain train and test splits.

shuffle (bool): whether to shuffle the training and test sets random_state (int): random state for shuffling

fillMissing(fill_value: float, columns: list[str] | None = None)

Fill missing values in the data set with a given value.

Parameters:

fill_value (float) – value to fill missing values with
columns (list[str], optional) – columns to fill missing values in. Defaults to None.

filter(table_filters: list[Callable])

Filter the data set using the given filters.

Parameters:: table_filters (list[Callable]) – list of filters to apply

filterFeatures(feature_filters: list[Callable])

Filter features in the data set.

Parameters:: feature_filters (list[Callable]) – list of feature filter functions that take X feature matrix and y target vector as arguments

classmethod fromFile(filename: str) → PandasDataTable

Load a StoredTable object from a file.

Parameters:: filename (str) – The name of the file to load the object from.
Returns:: The StoredTable object itself.

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

static fromMolTable(mol_table: MoleculeTable, protein_col: str, target_props: list[qsprpred.tasks.TargetProperty | dict] | None = None, name: str | None = None, **kwargs) → PCMDataSet[source]

Construct a data set to handle PCM data from a MoleculeTable.

Parameters:

mol_table (MoleculeTable) – MoleculeTable instance containing the PCM data.
protein_col (str) – name of column in df containing the protein target identifier (usually a UniProt ID) to use for protein descriptors for PCM modelling and other protein related tasks.
target_props (list[TargetProperty | dict], optional) – target properties, names should correspond with target column name in df
name (str, optional) – data name, used in saving the data. Defaults to None.
**kwargs – keyword arguments to be passed to the PCMDataset constructor.

Returns:

PCMDataset instance containing the PCM data.

Return type:

PCMDataSet

static fromSDF(name, filename, smiles_prop, *args, **kwargs)[source]

Create QSPRDataset from SDF file.

It is currently not implemented for QSPRDataset, but you can convert from ‘MoleculeTable’ with the ‘fromMolTable’ method.

Parameters:

name (str) – name of the data set
filename (str) – path to the SDF file
smiles_prop (str) – name of the property in the SDF file containing SMILES
*args – additional arguments for QSPRDataset constructor
**kwargs – additional keyword arguments for QSPRDataset constructor

static fromSMILES(name: str, smiles: list, *args, **kwargs)

Create a MoleculeTable instance from a list of SMILES sequences.

Parameters:

name (str) – Name of the data set.
smiles (list) – list of SMILES sequences.
*args – Additional arguments to pass to the MoleculeTable constructor.
**kwargs – Additional keyword arguments to pass to the MoleculeTable constructor.

static fromTableFile(name: str, filename: str, sep: str = '\t', *args, **kwargs)

Create QSPRDataset from table file (i.e. CSV or TSV).

Parameters:

name (str) – name of the data set
filename (str) – path to the table file
sep (str, optional) – separator in the table file. Defaults to “t”.
*args – additional arguments for QSPRDataset constructor
**kwargs – additional keyword arguments for QSPRDataset constructor

Returns:

QSPRDataset object

Return type:

QSPRDataset

generateDescriptorDataSetName(ds_set: str | DescriptorSet): Generate a descriptor set name from a descriptor set.

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getApplicability(): Get applicability predictions for the test set.

getClusterNames(clusters: list['MoleculeClusters'] | None = None)

Get the names of the clusters in the data frame.

Returns:: List of cluster names.
Return type:: list

getClusters(clusters: list['MoleculeClusters'] | None = None)

Get the subset of the data frame that contains only clusters.

Returns:: Data frame containing only clusters.
Return type:: pd.DataFrame

getDF()

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getDescriptorNames()

Get the names of the descriptors present for molecules in this data set.

Returns:: list of descriptor names.
Return type:: list

getDescriptors(active_only=False)

Get the calculated descriptors as a pandas data frame.

Returns:: Data frame containing only descriptors.
Return type:: pd.DataFrame

getFeatureNames() → list[str]

Get current feature names for this data set.

Returns:: list of feature names
Return type:: list[str]

getFeatures(inplace: bool = False, concat: bool = False, raw: bool = False, ordered: bool = False, refit_standardizer: bool = True)

Get the current feature sets (training and test) from the dataset.

This method also applies any feature standardizers that have been set on the dataset during preparation. Outliers are dropped from the test set if they are present, unless concat is True.

Parameters:

inplace (bool) – If True, the created feature matrices will be saved to the dataset object itself as ‘X’ and ‘X_ind’ attributes. Note that this will overwrite any existing feature matrices and if the data preparation workflow changes, these are not kept up to date. Therefore, it is recommended to generate new feature sets after any data set changes.
concat (bool) – If True, the training and test feature matrices will be concatenated into a single matrix. This is useful for training models that do not require separate training and test sets (i.e. the final optimized models).
raw (bool) – If True, the raw feature matrices will be returned without any standardization applied.
ordered (bool) – If True, the returned feature matrices will be ordered according to the original order of the data set. This is only relevant if concat is True.
refit_standardizer (bool) – If True, the feature standardizer will be refit on the training set upon this call. If False, the previously fitted standardizer will be used. Defaults to True. Use False if this dataset is used for prediction only and the standardizer has been initialized already.

getProperties() → list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: list

getProperty(name: str) → Series

Get property values from the data set.

Parameters:: name (str) – Name of the property to get.
Returns:: List of values for the property.
Return type:: pd.Series

getProteinKeys() → list[str][source]

Return a list of keys identifying the proteins in the data frame.

Returns:: List of protein keys.
Return type:: keys (list)

getProteinSequences() → dict[str, str][source]

Return a dictionary of protein sequences for the proteins in the data frame.

Returns:: Dictionary of protein sequences.
Return type:: sequences (dict)

getScaffoldGroups(scaffold_name: str, mol_per_group: int = 10)

Get the scaffold groups for a given combination of scaffold and number of molecules per scaffold group.

Parameters:

scaffold_name (str) – Name of the scaffold.
mol_per_group (int) – Number of molecules per scaffold group.

Returns:

list of scaffold groups.

Return type:

list

getScaffoldNames(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)

Get the names of the scaffolds in the data frame.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: List of scaffold names.
Return type:: list

getScaffolds(scaffolds: list[qsprpred.data.chem.scaffolds.Scaffold] | None = None, include_mols: bool = False)

Get the subset of the data frame that contains only scaffolds.

Parameters:: include_mols (bool) – Whether to include the RDKit scaffold columns as well.
Returns:: Data frame containing only scaffolds.
Return type:: pd.DataFrame

getSubset(prefix: str)

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:: prefix (str) – Prefix of the column names to select.

getSummary()

Make a summary with some statistics about the molecules in this table. The summary contains the number of molecules per target and the number of unique molecules per target.

Requires this data set to be imported from Papyrus for now.

Returns:: A dataframe with the summary statistics.
Return type:: (pd.DataFrame)

getTargetProperties(names: list) → list[qsprpred.tasks.TargetProperty]

Get the target properties with the given names.

Parameters:: names (list[str]) – name of the target properties
Returns:: list of target properties
Return type:: list[TargetProperty]

getTargetPropertiesValues(concat: bool = False, ordered: bool = False)

Get the response values (training and test) for the set target property.

Parameters:

concat (bool) – if True, return concatenated training and validation set target properties
ordered (bool) – if True, return the target properties in the original order of the data set. This is only relevant if concat is True.

Returns:

tuple of (train_responses, test_responses) or pandas.DataFrame of all target property values

property hasClusters

Check whether the data frame contains clusters.

Returns:: Whether the data frame contains clusters.
Return type:: bool

hasDescriptors(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str] | None = None) → bool | list[bool]

Check whether the data frame contains given descriptors.

Parameters:: descriptors (list) – list of DescriptorSet objects or prefixes of descriptors to check for. If None, all descriptors are checked for and a single boolean is returned if any descriptors are found.
Returns:: list of booleans indicating whether each descriptor is present or not.
Return type:: list

property hasFeatures: Check whether the currently selected set of features is not empty.

hasProperty(name)

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

property hasScaffoldGroups

Check whether the data frame contains scaffold groups.

Returns:: Whether the data frame contains scaffold groups.
Return type:: bool

property hasScaffolds

Check whether the data frame contains scaffolds.

Returns:: Whether the data frame contains scaffolds.
Return type:: bool

imputeProperties(names: list[str], imputer: Callable)

Impute missing property values.

Parameters:

names (list) – List of property names to impute.
imputer (Callable) –

imputer object implementing the fit_transform
method from scikit-learn API.

property isMultiTask

Check if the dataset contains multiple target properties.

Returns:: True if the dataset contains multiple target properties
Return type:: bool

iterChunks(include_props: list[str] | None = None, as_dict: bool = False, chunk_size: int | None = None) → Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:

include_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

iterFolds(split: DataSplit, concat: bool = False) → Generator[tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame | pandas.core.series.Series, pandas.core.frame.DataFrame | pandas.core.series.Series, list[int], list[int]], None, None]

Iterate over the folds of the dataset.

Parameters:

split (DataSplit) – split instance orchestrating the split
concat (bool) – whether to concatenate the training and test feature matrices

Yields:

tuple – training and test feature matrices and target vectors for each fold

loadDescriptorsToSplits(shuffle: bool = True, random_state: int | None = None)

Load all available descriptors into the train and test splits.

If no descriptors are available, an exception will be raised.

Parameters:

shuffle (bool) – whether to shuffle the training and test sets
random_state (int) – random state for shuffling

Raises:

ValueError – if no descriptors are available

makeClassification(target_property: str, th: list[float] | None = None)

Switch to classification task using the given threshold values.

Parameters:

target_property (str) – Target property to use for classification or name of the target property.
th (list[float], optional) – list of threshold values. If not provided, the values will be inferred from th specified in TargetProperty. Defaults to None.

makeRegression(target_property: str)

Switch to regression task using the given target property.

Parameters:: target_property (str) – name of the target property to use for regression

property metaFile: The path to the meta file of this data set.

property nJobs

property nTargetProperties: Get the number of target properties in the dataset.

prepareDataset(smiles_standardizer: str | ~typing.Callable | None = 'chembl', data_filters: list | None = (<qsprpred.data.processing.data_filters.RepeatsFilter object>, ), split=None, feature_calculators: list[qsprpred.data.descriptors.sets.DescriptorSet] | None = None, feature_filters: list | None = None, feature_standardizer: ~qsprpred.data.processing.feature_standardizers.SKLearnStandardizer | None = None, feature_fill_value: float = nan, applicability_domain: ~qsprpred.data.processing.applicability_domain.ApplicabilityDomain | ~mlchemad.base.ApplicabilityDomain | None = None, drop_outliers: bool = False, recalculate_features: bool = False, shuffle: bool = True, random_state: int | None = None)

Prepare the dataset for use in QSPR model.

Parameters:

smiles_standardizer (str | Callable) – either chembl, old, or a partial function that reads and standardizes smiles. If None, no standardization will be performed. Defaults to chembl.
data_filters (list of datafilter obj) – filters number of rows from dataset
split (datasplitter obj) – splits the dataset into train and test set
feature_calculators (list[DescriptorSet]) – descriptor sets to add to the data set
feature_filters (list of feature filter objs) – filters features
feature_standardizer (SKLearnStandardizer or sklearn.base.BaseEstimator) – standardizes and/or scales features
feature_fill_value (float) – value to fill missing values with. Defaults to numpy.nan
applicability_domain (applicabilityDomain obj) – attaches an applicability domain calculator to the dataset and fits it on the training set
drop_outliers (bool) – whether to drop samples that are outside the applicability domain from the test set, if one is attached.
recalculate_features (bool) – recalculate features even if they are already present in the file
shuffle (bool) – whether to shuffle the created training and test sets
random_state (int) – random state for shuffling

processMols(processor: MolProcessor, proc_args: tuple[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, add_props: list[str] | None = None, as_rdkit: bool = False, chunk_size: int | None = None, n_jobs: int | None = None) → Generator

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing, the processor must be picklable. Also note that the returned generator will produce results as soon as they are ready, which means that the chunks of data will not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules. See CheckSmilesValid for an example.

Parameters:

processor (MolProcessor) – MolProcessor object to use for processing.
proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.
as_rdkit (bool, optional) – Whether to convert the molecules to RDKit molecules before applying the processor.
chunk_size (int, optional) – Size of chunks to use per job in parallel. If not specified, self.chunkSize is used.
n_jobs (int, optional) – Number of jobs to use for parallel processing. If not specified, self.nJobs is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

reload(): Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

reset(): Reset the data set. Splits will be removed and all descriptors will be moved to the training data. Molecule standardization and molecule filtering are not affected.

resetTargetProperty(prop: TargetProperty | str)

Reset target property to its original value.

Parameters:: prop (TargetProperty | str) – target property to reset

restoreDescriptorSets(descriptors: list[qsprpred.data.descriptors.sets.DescriptorSet | str])

Restore descriptors that were previously removed.

Parameters:: descriptors (list[DescriptorSet | str]) – List of DescriptorSet objects or their names. Name of a descriptor set corresponds to the result returned by its __str__ method.

restoreTrainingData()

Restore training data from the data frame.

If the data frame contains a column ‘Split_IsTrain’, the data will be split into training and independent sets. Otherwise, the independent set will be empty. If descriptors are available, the resulting training matrices will be featurized.

classmethod runMolProcess(props: dict[str, list] | DataFrame, func: MolProcessor, add_rdkit: bool, smiles_col: str, *args, **kwargs)

A helper method to run a MolProcessor on a list of molecules via apply. It converts the SMILES to RDKit molecules if required and then applies the function to the MolProcessor object.

Parameters:

props (dict) – Dictionary of properties that will be passed in addition to the molecule structure.
func (MolProcessor) – MolProcessor object to use for processing.
add_rdkit (bool) – Whether to convert the SMILES to RDKit molecules before applying the function.
smiles_col (str) – Name of the property containing the SMILES sequences.
*args – Additional positional arguments to pass to the function.
**kwargs – Additional keyword arguments to pass to the function.

sample(n: int, name: str | None = None, random_state: int | None = None) → MoleculeTable

Sample n molecules from the table.

Parameters:

n (int) – Number of molecules to sample.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _sampled suffix.
random_state (int) – Random state to use for shuffling and other random ops.

Returns:

A dataframe with the sampled molecules.

Return type:

(MoleculeTable)

save(save_split: bool = True)

Save the data set to file and serialize metadata.

Parameters:: save_split (bool) – whether to save split data to the managed data frame.

saveSplit(): Save split data to the managed data frame.

searchOnProperty(prop_name: str, values: list[str], name: str | None = None, exact=False) → MoleculeTable

Search in this table using a property name and a list of values. It is assumed that the property is searchable with string matching. Either an exact match or a partial match can be used. If ‘exact’ is False, the search will be performed with partial matching, i.e. all molecules that contain any of the given values in the property will be returned. If ‘exact’ is True, only molecules that have the exact property value for any of the given values will be returned.

Parameters:

prop_name (str) – Name of the property to search on.
values (list[str]) – List of values to search for. If any of the values is found in the property, the molecule will be considered a match.
name (str | None, optional) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.
exact (bool, optional) – Whether to use exact matching, i.e. whether to search for exact strings or just substrings. Defaults to False.

Returns:

A new table with the molecules from the old table with the given property values.

Return type:

MoleculeTable

searchWithIndex(index: Index, name: str | None = None) → MoleculeTable[source]

Search in this table using a pandas index. The return values is a new table with the molecules from the old table with the given indices.

Parameters:

index (pd.Index) – Indices to search for in this table.
name (str) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.

Returns:

A new table with the molecules from the old table with the given indices.

Return type:

MoleculeTable

searchWithSMARTS(patterns: list[str], operator: ~typing.Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, match_function: ~typing.Callable = <function match_mol_to_smarts>) → MoleculeTable

Search the molecules in the table with a SMARTS pattern.

Parameters:

patterns – List of SMARTS patterns to search with.
operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.
use_chirality – Whether to use chirality in the search.
name – Name of the new table. Defaults to the name of the old table, plus the smarts_searched suffix.
match_function – Function to use for matching the molecules to the SMARTS patterns. Defaults to match_mol_to_smarts.

Returns:

A dataframe with the molecules that match the pattern.

Return type:

(MolTable)

setApplicabilityDomain(applicability_domain: ApplicabilityDomain | ApplicabilityDomain)

Set the applicability domain calculator.

Parameters:: applicability_domain (ApplicabilityDomain | MLChemADApplicabilityDomain) – applicability domain calculator instance

setFeatureStandardizer(feature_standardizer)

Set feature standardizer.

Parameters:: feature_standardizer (SKLearnStandardizer | BaseEstimator) – feature standardizer

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

setRandomState(random_state: int)

Set the random state for this instance.

Parameters:: random_state (int) – Random state to use for shuffling and other random operations.

setTargetProperties(target_props: list[qsprpred.tasks.TargetProperty | dict], drop_empty: bool = True)

Set list of target properties and apply transformations if specified.

Parameters:

target_props (list[TargetProperty]) – list of target properties
drop_empty (bool, optional) – whether to drop rows with empty target property values. Defaults to True.

setTargetProperty(prop: TargetProperty | dict, drop_empty: bool = True)

Add a target property to the dataset.

Parameters:

prop (TargetProperty) – name of the target property to add
drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

shuffle(random_state: int | None = None): Shuffle the internal data frame.

property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in the data frame.

Returns:: Generator of SMILES strings.
Return type:: Generator[str, None, None]

split(split: DataSplit, featurize: bool = False)

Split dataset into train and test set.

You can either split tha data frame itself or you can set featurize to True if you want to use feature matrices instead of the raw data frame.

Parameters:

split (DataSplit) – split instance orchestrating the split
featurize (bool) – whether to featurize the data set splits after splitting. Defaults to False.

standardizeSmiles(smiles_standardizer, drop_invalid=True)

Apply smiles_standardizer to the compounds in parallel

Parameters:

() (smiles_standardizer) – either None to skip the standardization, chembl, old, or a partial function that reads and standardizes smiles.
drop_invalid (bool) – whether to drop invalid SMILES from the data set. Defaults to True. If False, invalid SMILES will be retained in their original form. If self.invalidsRemoved is True, there will be no effect even if drop_invalid is True. Set self.invalidsRemoved to False on this instance to force the removal of invalid SMILES.

Raises:

ValueError – when smiles_standardizer is not a callable or one of the predefined strings.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

property targetPropertyNames: Get the names of the target properties.

toFile(filename: str)

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(targets: list[str], transformer: Callable)

Transform the target properties using the given transformer.

Parameters:

targets (list[str]) – list of target properties names to transform
transformer (Callable) – transformer function
add_as (list[str] | None, optional) – list of names to add the transformed target properties as. If None, the original target properties will be overwritten. Defaults to None.

unsetTargetProperty(name: str | TargetProperty)

Unset the target property. It will not remove it from the data set, but will make it unavailable for training.

Parameters:: name (str | TargetProperty) – name of the target property to drop or the property itself

qsprpred.extra.data.tables.tests module

class qsprpred.extra.data.tables.tests.TestPCMDataSetPreparation(methodName='runTest')[source]

Bases: DataSetsMixInExtras, DataPrepCheckMixIn, TestCase

Test the preparation of the PCMDataSet.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

classmethod addClassCleanup(function, /, *args, **kwargs): Same as addCleanup, except the cleanup items are called even if setUpClass fails (unlike tearDownClass).

addCleanup(function, /, *args, **kwargs)

Add a function, with arguments, to be called when the test is completed. Functions added are called on a LIFO basis and are called after tearDown on test failure or success.

Cleanup items are called even if setUp fails (unlike tearDown).

addTypeEqualityFunc(typeobj, function)

Add a type specific assertEqual style function to compare a type.

This method is for use by TestCase subclasses that need to register their own type equality functions to provide nicer error messages.

Parameters:

typeobj – The data type to call this function on when both values are of the same type in assertEqual().
function – The callable taking two arguments and an optional msg= argument that raises self.failureException with a useful error message when the two arguments are not equal.

assertAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are unequal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is more than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

If the two objects compare equal then they will automatically compare almost equal.

assertCountEqual(first, second, msg=None)

Asserts that two iterables have the same elements, the same number of times, without regard to order.

self.assertEqual(Counter(list(first)),
Counter(list(second)))

Example:

[0, 1, 1] and [1, 0, 1] compare equal.

[0, 0, 1] and [0, 1] compare unequal.

assertDictEqual(d1, d2, msg=None)

assertEqual(first, second, msg=None): Fail if the two objects are unequal as determined by the ‘==’ operator.

assertFalse(expr, msg=None): Check that the expression is false.

assertGreater(a, b, msg=None): Just like self.assertTrue(a > b), but with a nicer default message.

assertGreaterEqual(a, b, msg=None): Just like self.assertTrue(a >= b), but with a nicer default message.

assertIn(member, container, msg=None): Just like self.assertTrue(a in b), but with a nicer default message.

assertIs(expr1, expr2, msg=None): Just like self.assertTrue(a is b), but with a nicer default message.

assertIsInstance(obj, cls, msg=None): Same as self.assertTrue(isinstance(obj, cls)), with a nicer default message.

assertIsNone(obj, msg=None): Same as self.assertTrue(obj is None), with a nicer default message.

assertIsNot(expr1, expr2, msg=None): Just like self.assertTrue(a is not b), but with a nicer default message.

assertIsNotNone(obj, msg=None): Included for symmetry with assertIsNone.

assertLess(a, b, msg=None): Just like self.assertTrue(a < b), but with a nicer default message.

assertLessEqual(a, b, msg=None): Just like self.assertTrue(a <= b), but with a nicer default message.

assertListEqual(list1, list2, msg=None)

A list-specific equality assertion.

Parameters:

list1 – The first list to compare.
list2 – The second list to compare.
msg – Optional message to use on failure instead of a list of differences.

assertLogs(logger=None, level=None)

Fail unless a log message of level level or higher is emitted on logger_name or its children. If omitted, level defaults to INFO and logger defaults to the root logger.

This method must be used as a context manager, and will yield a recording object with two attributes: output and records. At the end of the context manager, the output attribute will be a list of the matching formatted log messages and the records attribute will be a list of the corresponding LogRecord objects.

Example:

with self.assertLogs('foo', level='INFO') as cm:
    logging.getLogger('foo').info('first message')
    logging.getLogger('foo.bar').error('second message')
self.assertEqual(cm.output, ['INFO:foo:first message',
                             'ERROR:foo.bar:second message'])

assertMultiLineEqual(first, second, msg=None): Assert that two multi-line strings are equal.

assertNoLogs(logger=None, level=None)

Fail unless no log messages of level level or higher are emitted on logger_name or its children.

This method must be used as a context manager.

assertNotAlmostEqual(first, second, places=None, msg=None, delta=None)

Fail if the two objects are equal as determined by their difference rounded to the given number of decimal places (default 7) and comparing to zero, or by comparing that the difference between the two objects is less than the given delta.

Note that decimal places (from zero) are usually not the same as significant digits (measured from the most significant digit).

Objects that are equal automatically fail.

assertNotEqual(first, second, msg=None): Fail if the two objects are equal as determined by the ‘!=’ operator.

assertNotIn(member, container, msg=None): Just like self.assertTrue(a not in b), but with a nicer default message.

assertNotIsInstance(obj, cls, msg=None): Included for symmetry with assertIsInstance.

assertNotRegex(text, unexpected_regex, msg=None): Fail the test if the text matches the regular expression.

assertRaises(expected_exception, *args, **kwargs)

Fail unless an exception of class expected_exception is raised by the callable when invoked with specified positional and keyword arguments. If a different type of exception is raised, it will not be caught, and the test case will be deemed to have suffered an error, exactly as for an unexpected exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertRaises(SomeException):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertRaises is used as a context object.

The context manager keeps a reference to the exception as the ‘exception’ attribute. This allows you to inspect the exception after the assertion:

with self.assertRaises(SomeException) as cm:
    do_something()
the_exception = cm.exception
self.assertEqual(the_exception.error_code, 3)

assertRaisesRegex(expected_exception, expected_regex, *args, **kwargs)

Asserts that the message in a raised exception matches a regex.

Parameters:

expected_exception – Exception class expected to be raised.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertRaisesRegex is used as a context manager.

assertRegex(text, expected_regex, msg=None): Fail the test unless the text matches the regular expression.

assertSequenceEqual(seq1, seq2, msg=None, seq_type=None)

An equality assertion for ordered sequences (like lists and tuples).

For the purposes of this function, a valid ordered sequence type is one which can be indexed, has a length, and has an equality operator.

Parameters:

seq1 – The first sequence to compare.
seq2 – The second sequence to compare.
seq_type – The expected datatype of the sequences, or None if no datatype should be enforced.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual(set1, set2, msg=None)

A set-specific equality assertion.

Parameters:

set1 – The first set to compare.
set2 – The second set to compare.
msg – Optional message to use on failure instead of a list of differences.

assertSetEqual uses ducktyping to support different types of sets, and is optimized for sets specifically (parameters must support a difference method).

assertTrue(expr, msg=None): Check that the expression is true.

assertTupleEqual(tuple1, tuple2, msg=None)

A tuple-specific equality assertion.

Parameters:

tuple1 – The first tuple to compare.
tuple2 – The second tuple to compare.
msg – Optional message to use on failure instead of a list of differences.

assertWarns(expected_warning, *args, **kwargs)

Fail unless a warning of class warnClass is triggered by the callable when invoked with specified positional and keyword arguments. If a different type of warning is triggered, it will not be handled: depending on the other warning filtering rules in effect, it might be silenced, printed out, or raised as an exception.

If called with the callable and arguments omitted, will return a context object used like this:

with self.assertWarns(SomeWarning):
    do_something()

An optional keyword argument ‘msg’ can be provided when assertWarns is used as a context object.

The context manager keeps a reference to the first matching warning as the ‘warning’ attribute; similarly, the ‘filename’ and ‘lineno’ attributes give you information about the line of Python code from which the warning was triggered. This allows you to inspect the warning after the assertion:

with self.assertWarns(SomeWarning) as cm:
    do_something()
the_warning = cm.warning
self.assertEqual(the_warning.some_attribute, 147)

assertWarnsRegex(expected_warning, expected_regex, *args, **kwargs)

Asserts that the message in a triggered warning matches a regexp. Basic functioning is similar to assertWarns() with the addition that only warnings whose messages also match the regular expression are considered successful matches.

Parameters:

expected_warning – Warning class expected to be triggered.
expected_regex – Regex (re.Pattern object or string) expected to be found in error message.
args – Function to be called and extra positional args.
kwargs – Extra kwargs.
msg – Optional message used in case of failure. Can only be used when assertWarnsRegex is used as a context manager.

checkDescriptors(dataset: QSPRDataset, target_props: list[dict | qsprpred.tasks.TargetProperty])

Check if information about descriptors is consistent in the data set. Checks if calculators are consistent with the descriptors contained in the data set. This is tested also before and after serialization.

Parameters:

dataset (QSPRDataset) – The data set to check.
target_props (List of dicts or TargetProperty) – list of target properties

Raises:

AssertionError – If the consistency check fails.

checkFeatures(ds: QSPRDataset, expected_length: int)

Check if the feature names and the feature matrix of a data set is consistent with expected number of variables.

Parameters:

ds (QSPRDataset) – The data set to check.
expected_length (int) – The expected number of features.

Raises:

AssertionError – If the feature names or the feature matrix is not consistent

checkPrep(dataset, feature_calculators, split, feature_standardizer, feature_filter, data_filter, applicability_domain, expected_target_props): Check the consistency of the dataset after preparation.

clearGenerated(): Remove the directories that are used for testing.

countTestCases()

createLargeMultitaskDataSet(name='QSPRDataset_multi_test', target_props=[{'name': 'HBD', 'task': <TargetTasks.MULTICLASS: 'MULTICLASS'>, 'th': [-1, 1, 2, 100]}, {'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
preparation_settings (dict) – dictionary containing preparation settings
random_state (int) – random state to use for splitting and shuffling

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createLargeTestDataSet(name='QSPRDataset_test_large', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42, n_jobs=1, chunk_size=None)

Create a large dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createPCMDataSet(name: str = 'QSPRDataset_test_pcm', target_props: list[qsprpred.tasks.TargetProperty] | list[dict] = [{'name': 'pchembl_value_Median', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings: dict | None = None, protein_col: str = 'accession', random_state: int | None = None)

Create a small dataset for testing purposes.

Parameters:

name (str, optional) – name of the dataset. Defaults to “QSPRDataset_test”.
target_props (list[TargetProperty] | list[dict], optional) – target properties.
preparation_settings (dict | None, optional) – preparation settings. Defaults to None.
protein_col (str, optional) – name of the column with protein accessions. Defaults to “accession”.
random_state (int, optional) – random seed to use in the dataset. Defaults to None

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createSmallTestDataSet(name='QSPRDataset_test_small', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], preparation_settings=None, random_state=42)

Create a small dataset for testing purposes.

Parameters:

name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
preparation_settings (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

createTestDataSetFromFrame(df, name='QSPRDataset_test', target_props=[{'name': 'CL', 'task': <TargetTasks.REGRESSION: 'REGRESSION'>}], random_state=None, prep=None, n_jobs=1, chunk_size=None)

Create a dataset for testing purposes from the given data frame.

Parameters:

df (pd.DataFrame) – data frame containing the dataset
name (str) – name of the dataset
target_props (List of dicts or TargetProperty) – list of target properties
random_state (int) – random state to use for splitting and shuffling
prep (dict) – dictionary containing preparation settings

Returns:

a QSPRDataset object

Return type:

QSPRDataset

debug(): Run the test without collecting errors in a TestResult

defaultTestResult()

classmethod doClassCleanups(): Execute all class cleanup functions. Normally called for you after tearDownClass.

doCleanups(): Execute all cleanup functions. Normally called for you after tearDown.

classmethod enterClassContext(cm): Same as enterContext, but class-wide.

enterContext(cm)

Enters the supplied context manager.

If successful, also adds its __exit__ method as a cleanup function and returns the result of the __enter__ method.

fail(msg=None): Fail immediately, with the given message.

failureException: alias of AssertionError

fetchDataset(name: str) → PCMDataSet[source]

Create a quick dataset with the given name.

Parameters:: name (str) – Name of the dataset.
Returns:: The dataset.
Return type:: PCMDataSet

classmethod getAllDescriptors() → list[qsprpred.data.descriptors.sets.DescriptorSet]

Return a list of all available molecule descriptor sets.

Returns:: list of MoleculeDescriptorSet objects
Return type:: list

classmethod getAllProteinDescriptors() → list[qsprpred.extra.data.descriptors.sets.ProteinDescriptorSet]

Return a list of all available protein descriptor sets.

Returns:: list of ProteinDescriptorSet objects
Return type:: list

getBigDF()

Get a large data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

classmethod getDataPrepGrid()

Return a list of many possible combinations of descriptor calculators, splits, feature standardizers, feature filters and data filters. Again, this is not exhaustive, but should cover a lot of cases.

Returns:: a generator that yields tuples of all possible combinations as stated above, each tuple is defined as: (descriptor_calculator, split, feature_standardizer, feature_filters, data_filters)
Return type:: grid

classmethod getDefaultCalculatorCombo(): Return the default descriptor calculator combo.

static getDefaultPrep(): Return a dictionary with default preparation settings.

classmethod getMSAProvider(out_dir: str)

getPCMDF() → DataFrame

Return a test dataframe with PCM data.

Returns:: dataframe with PCM data
Return type:: pd.DataFrame

getPCMSeqProvider() → Callable[[list[str]], tuple[dict[str, str], dict[str, dict]]]

Return a function that provides sequences for given accessions.

Returns:: function that provides sequences for given accessions
Return type:: Callable[[list[str]], tuple[dict[str, str], dict[str, dict]]]

getPCMTargetsDF() → DataFrame

Return a test dataframe with PCM targets and their sequences.

Returns:: dataframe with PCM targets and their sequences
Return type:: pd.DataFrame

classmethod getPrepCombos()

Return a list of all possible preparation combinations as generated by getDataPrepGrid as well as their names. The generated list can be used to parameterize tests with the given named combinations.

Returns:: list of `list`s of all possible combinations of preparation
Return type:: list

getSmallDF()

Get a small data frame for testing purposes.

Returns:: a pandas.DataFrame containing the dataset
Return type:: pd.DataFrame

id()

longMessage = True

maxDiff = 640

run(result=None)

setUp()[source]: Hook method for setting up the test fixture before exercising it.

classmethod setUpClass(): Hook method for setting up class fixture before running tests in the class.

setUpPaths(): Create the directories that are used for testing.

shortDescription()

Returns a one-line description of the test, or None if no description has been provided.

The default implementation of this method returns the first line of the specified test method’s docstring.

skipTest(reason): Skip this test.

subTest(msg=<object object>, **params): Return a context manager that will return the enclosed block of code in a subtest identified by the optional message and keyword parameters. A failure in the subtest marks the test case as failed but resumes execution at the end of the enclosed block, allowing further test code to be executed.

tearDown(): Remove all files and directories that are used for testing.

classmethod tearDownClass(): Hook method for deconstructing the class fixture after running all tests in the class.

testPrepCombinations = None

testPrepCombinations_00_MorganFP_ProDec_None_None_None_None_None(**kw)

Test the preparation of the dataset [with _=’MorganFP_ProDec_None_None_None_None_None’, name=’MorganFP_ProDec_None_None_None_None_None’, feature_calculators=(<qsprpred.data.descriptors.fing…roDec object at 0x7efff3c4d7f0>), split=None, feature_standardizer=None, feature_filter=None, data_filter=None, applicability_domain=None].