qsprpred.data.tables.interfaces package

Submodules

qsprpred.data.tables.interfaces.data_set_dependent module

class qsprpred.data.tables.interfaces.data_set_dependent.DataSetDependent(dataset: QSPRDataSet | None = None, **kwargs: Any)[source]

Bases: JSONSerializable

Classes that need an attached QSPRDataSet should inherit from this class, and it will be supplied to them via this API.

Variables:: dataSet (QSPRDataSet) – The data set attached to this object.

Initialize the object with a data set.

Parameters:: dataset (QSPRDataSet, optional) – The data set to attach to this object. Defaults to None.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getDataSet() → QSPRDataSet[source]

Get the data set attached to this object.

Returns:: The data set attached to this object
Return type:: QSPRDataSet
Raises:: ValueError – If no data set is attached to this object.

property hasDataSet: bool: Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) → None[source]: Set the data set for this object.

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

qsprpred.data.tables.interfaces.molecule_data_set module

class qsprpred.data.tables.interfaces.molecule_data_set.MoleculeDataSet[source]

Bases: PropertyStorage, DescriptorProvider, MolProcessable, SMARTSSearchable, Summarizable, Randomized, Identifiable, Standardizable, ABC

Interface for storing and managing chemical data sets for machine learning.

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)

Add descriptors to the dataset.

Parameters:

descriptors (list[DescriptorSet]) – The descriptors to add.
args – Additional positional arguments to be passed to each descriptor set.
kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:

ids (list) – The IDs of the entries to add.
props (dict) – The properties to add.
raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:

name (str) – The name of the property.
data (list) – The data of the property.
ids (list, optional) – The IDs of the entries to add the property for.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) → Generator[Iterable[Any], None, None]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:

func (callable) – The function to apply.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.
on_props (list, optional) – The properties to apply the function on.
as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:: identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:: standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkSize: int: The size of the chunks to iterate over.

abstract clear(): Delete entries in the persistent storage.

abstract property descriptorSets: list[DescriptorSet]

Get the descriptor sets that are currently in the storage.

Returns:: a list of descriptor sets

abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])

Drop descriptor sets from the storage.

Parameters:: descriptors – The descriptor sets to drop.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:: ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

abstract getDF() → DataFrame

Get the stored properties as a pandas DataFrame.

Returns:: The data as a pandas DataFrame.
Return type:: pd.DataFrame

abstract getDescriptorNames() → list[str]

Get the names of the descriptors that are currently in the storage.

Returns:: a list of descriptor names

abstract getDescriptors() → DataFrame

Get the table of descriptors that are currently in the storage.

Returns:: a pd.DataFrame with the descriptors

abstract getProperties() → list[str]: Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) → Iterable[Any]

Get values of a given property.

Parameters:

name (str) – The name of the property.
ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) → PropertyStorage

Get a subset of the storage for the given properties.

Parameters:

subset (list) – The list of property names to include in the subset.
ids (list, optional) – The IDs of the entries to include in the subset.
name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract getSummary() → DataFrame

Make a summary with some statistics about this object or action.

Returns:: A dataframe with the summary statistics.
Return type:: (pd.DataFrame)

abstract hasDescriptors(): Indicates if the storage has descriptors.

abstract hasProperty(name: str) → bool

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

abstract property idProp: str: Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:: The identifier used by this instance.
Return type:: ChemIdentifier

abstract iterChunks(size: int | None = None, on_props: list | None = None) → Generator[list[Any], None, None]

Iterate over chunks of molecules across the store.

Returns:: an iterable of lists of stored molecules

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:: The absolute path to the metadata file.
Return type:: str

abstract property name: str: Get the name of the storage.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) → Generator[Any, None, None]

Process the molecules in this instance with a given MolProcessor.

Parameters:

processor (MolProcessor) – The processor to use.
proc_args (tuple, optional) – Additional arguments to pass to the processor.
proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.
mol_type (str, optional) – The type of molecule to process.
add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

abstract property randomState: int: Get the random state for the object.

abstract reload(): Reset the current state by reloading from storage.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:: name (str) – The name of the property.

abstract save() → str

Save current state to storage and return the path to the serialized file.

Returns:: The path to the serialized file.
Return type:: str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) → PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:

prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract searchWithSMARTS(patterns: list[str]) → SMARTSSearchable

Search the molecules within this instance with SMARTS patterns.

Parameters:: patterns – List of SMARTS patterns to search with.
Returns:: Another instance that can be filtered further.
Return type:: (SMARTSSearchable)

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in this instance.

Returns:: Generator of SMILES strings.
Return type:: Generator[str, None, None]

abstract property smilesProp: str: Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:: The standardizer used by the store.
Return type:: ChemStandardizer

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

qsprpred.data.tables.interfaces.qspr_data_set module

class qsprpred.data.tables.interfaces.qspr_data_set.QSPRDataSet[source]

Bases: MoleculeDataSet, ABC

Interface for storing and managing QSPR-specific data sets.

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)

Add descriptors to the dataset.

Parameters:

descriptors (list[DescriptorSet]) – The descriptors to add.
args – Additional positional arguments to be passed to each descriptor set.
kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:

ids (list) – The IDs of the entries to add.
props (dict) – The properties to add.
raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:

name (str) – The name of the property.
data (list) – The data of the property.
ids (list, optional) – The IDs of the entries to add the property for.

abstract addTargetProperty(prop: TargetSpec | dict, drop_empty: bool = True)[source]

Add a target property to the dataset.

Parameters:

prop (TargetSpec) – name of the target property to add
drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) → Generator[Iterable[Any], None, None]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:

func (callable) – The function to apply.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.
on_props (list, optional) – The properties to apply the function on.
as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:: identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:: standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkSize: int: The size of the chunks to iterate over.

abstract clear(): Delete entries in the persistent storage.

abstract property descriptorSets: list[DescriptorSet]

Get the descriptor sets that are currently in the storage.

Returns:: a list of descriptor sets

abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])

Drop descriptor sets from the storage.

Parameters:: descriptors – The descriptor sets to drop.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:: ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

abstract getDF() → DataFrame

Get the stored properties as a pandas DataFrame.

Returns:: The data as a pandas DataFrame.
Return type:: pd.DataFrame

abstract getDescriptorNames() → list[str]

Get the names of the descriptors that are currently in the storage.

Returns:: a list of descriptor names

abstract getDescriptors() → DataFrame

Get the table of descriptors that are currently in the storage.

Returns:: a pd.DataFrame with the descriptors

abstract getProperties() → list[str]: Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) → Iterable[Any]

Get values of a given property.

Parameters:

name (str) – The name of the property.
ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) → PropertyStorage

Get a subset of the storage for the given properties.

Parameters:

subset (list) – The list of property names to include in the subset.
ids (list, optional) – The IDs of the entries to include in the subset.
name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract getSummary() → DataFrame

Make a summary with some statistics about this object or action.

Returns:: A dataframe with the summary statistics.
Return type:: (pd.DataFrame)

getTargetPropertiesNames() → list[str][source]: Get the names of the target properties. :returns: list of target property names :rtype: (list[str])

abstract hasDescriptors(): Indicates if the storage has descriptors.

abstract hasProperty(name: str) → bool

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

abstract property idProp: str: Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:: The identifier used by this instance.
Return type:: ChemIdentifier

abstract property isMultiTask: bool: Indicates if the dataset is a multi-task dataset.

abstract iterChunks(size: int | None = None, on_props: list | None = None) → Generator[list[Any], None, None]

Iterate over chunks of molecules across the store.

Returns:: an iterable of lists of stored molecules

abstract makeClassification(target_property: str, threshold: float | list[float])[source]

Make this a classification dataset for the given target property.

Parameters:

target_property (str) – The name of the target property.
threshold (float | list[float]) – The threshold for the classification.

abstract makeRegression(target_property: str)[source]

Make this a regression dataset for the given target property.: This is only possible if the target property was previously converted to classification.

Parameters:: target_property (str) – The name of the target property.

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:: The absolute path to the metadata file.
Return type:: str

abstract property name: str: Get the name of the storage.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) → Generator[Any, None, None]

Process the molecules in this instance with a given MolProcessor.

Parameters:

processor (MolProcessor) – The processor to use.
proc_args (tuple, optional) – Additional arguments to pass to the processor.
proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.
mol_type (str, optional) – The type of molecule to process.
add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

abstract property randomState: int: Get the random state for the object.

abstract reload(): Reset the current state by reloading from storage.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:: name (str) – The name of the property.

abstract restoreTargetProperty(prop: TargetSpec | str)[source]

Restore a target property to the original state.

Parameters:: prop (TargetSpec | str) – The target property to restore.

abstract save() → str

Save current state to storage and return the path to the serialized file.

Returns:: The path to the serialized file.
Return type:: str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) → PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:

prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract searchWithSMARTS(patterns: list[str]) → SMARTSSearchable

Search the molecules within this instance with SMARTS patterns.

Parameters:: patterns – List of SMARTS patterns to search with.
Returns:: Another instance that can be filtered further.
Return type:: (SMARTSSearchable)

abstract setTargetProperties(target_props: list[TargetSpec | dict], drop_empty: bool = True)[source]

Set the target properties for the dataset.

Parameters:

target_props (list[TargetSpec | dict]) – The target properties to add.
drop_empty (bool) – If True, drop rows with missing target properties.

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in this instance.

Returns:: Generator of SMILES strings.
Return type:: Generator[str, None, None]

abstract property smilesProp: str: Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:: The standardizer used by the store.
Return type:: ChemStandardizer

abstract property targetProperties: list[TargetSpec]

Get the target properties of the dataset.

Returns:: list of target properties
Return type:: (list)

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

abstract unsetTargetProperty(name: str | TargetSpec)[source]

Unset the target property with the given name.

Parameters:: TargetSpec) ((str |) – name of the target property to unset

qsprpred.data.tables.interfaces package

Submodules

qsprpred.data.tables.interfaces.data_set_dependent module

qsprpred.data.tables.interfaces.molecule_data_set module

qsprpred.data.tables.interfaces.qspr_data_set module

Module contents