qsprpred.data.tables.interfaces package

Submodules

qsprpred.data.tables.interfaces.data_set_dependent module

class qsprpred.data.tables.interfaces.data_set_dependent.DataSetDependent(dataset: QSPRDataSet | None = None, **kwargs: Any)[source]

Bases: JSONSerializable

Classes that need an attached QSPRDataSet should inherit from this class, and it will be supplied to them via this API.

Variables:

dataSet (QSPRDataSet) – The data set attached to this object.

Initialize the object with a data set.

Parameters:

dataset (QSPRDataSet, optional) – The data set to attach to this object. Defaults to None.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDataSet() QSPRDataSet[source]

Get the data set attached to this object.

Returns:

The data set attached to this object

Return type:

QSPRDataSet

Raises:

ValueError – If no data set is attached to this object.

property hasDataSet: bool

Indicates if this object has a data set attached to it.

setDataSet(dataset: QSPRDataSet | None) None[source]

Set the data set for this object.

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.tables.interfaces.molecule_data_set module

class qsprpred.data.tables.interfaces.molecule_data_set.MoleculeDataSet[source]

Bases: PropertyStorage, DescriptorProvider, MolProcessable, SMARTSSearchable, Summarizable, Randomized, Identifiable, Standardizable, ABC

Interface for storing and managing chemical data sets for machine learning.

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)

Add descriptors to the dataset.

Parameters:
  • descriptors (list[DescriptorSet]) – The descriptors to add.

  • args – Additional positional arguments to be passed to each descriptor set.

  • kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) Generator[Iterable[Any], None, None]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

  • on_props (list, optional) – The properties to apply the function on.

  • as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:

identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract property descriptorSets: list[DescriptorSet]

Get the descriptor sets that are currently in the storage.

Returns:

a list of descriptor sets

abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])

Drop descriptor sets from the storage.

Parameters:

descriptors – The descriptor sets to drop.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getDescriptorNames() list[str]

Get the names of the descriptors that are currently in the storage.

Returns:

a list of descriptor names

abstract getDescriptors() DataFrame

Get the table of descriptors that are currently in the storage.

Returns:

a pd.DataFrame with the descriptors

abstract getProperties() list[str]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract getSummary() DataFrame

Make a summary with some statistics about this object or action.

Returns:

A dataframe with the summary statistics.

Return type:

(pd.DataFrame)

abstract hasDescriptors()

Indicates if the storage has descriptors.

abstract hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:

The identifier used by this instance.

Return type:

ChemIdentifier

abstract iterChunks(size: int | None = None, on_props: list | None = None) Generator[list[Any], None, None]

Iterate over chunks of molecules across the store.

Returns:

an iterable of lists of stored molecules

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract property name: str

Get the name of the storage.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None]

Process the molecules in this instance with a given MolProcessor.

Parameters:
  • processor (MolProcessor) – The processor to use.

  • proc_args (tuple, optional) – Additional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – The type of molecule to process.

  • add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

abstract property randomState: int

Get the random state for the object.

abstract reload()

Reset the current state by reloading from storage.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable

Search the molecules within this instance with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.

Returns:

Another instance that can be filtered further.

Return type:

(SMARTSSearchable)

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in this instance.

Returns:

Generator of SMILES strings.

Return type:

Generator[str, None, None]

abstract property smilesProp: str

Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.tables.interfaces.qspr_data_set module

class qsprpred.data.tables.interfaces.qspr_data_set.QSPRDataSet[source]

Bases: MoleculeDataSet, ABC

Interface for storing and managing QSPR-specific data sets.

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)

Add descriptors to the dataset.

Parameters:
  • descriptors (list[DescriptorSet]) – The descriptors to add.

  • args – Additional positional arguments to be passed to each descriptor set.

  • kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

abstract addTargetProperty(prop: TargetSpec | dict, drop_empty: bool = True)[source]

Add a target property to the dataset.

Parameters:
  • prop (TargetSpec) – name of the target property to add

  • drop_empty (bool) – whether to drop rows with empty target property values. Defaults to True.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) Generator[Iterable[Any], None, None]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

  • on_props (list, optional) – The properties to apply the function on.

  • as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:

identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract property descriptorSets: list[DescriptorSet]

Get the descriptor sets that are currently in the storage.

Returns:

a list of descriptor sets

abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])

Drop descriptor sets from the storage.

Parameters:

descriptors – The descriptor sets to drop.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getDescriptorNames() list[str]

Get the names of the descriptors that are currently in the storage.

Returns:

a list of descriptor names

abstract getDescriptors() DataFrame

Get the table of descriptors that are currently in the storage.

Returns:

a pd.DataFrame with the descriptors

abstract getProperties() list[str]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract getSummary() DataFrame

Make a summary with some statistics about this object or action.

Returns:

A dataframe with the summary statistics.

Return type:

(pd.DataFrame)

getTargetPropertiesNames() list[str][source]

Get the names of the target properties. :returns: list of target property names :rtype: (list[str])

abstract hasDescriptors()

Indicates if the storage has descriptors.

abstract hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:

The identifier used by this instance.

Return type:

ChemIdentifier

abstract property isMultiTask: bool

Indicates if the dataset is a multi-task dataset.

abstract iterChunks(size: int | None = None, on_props: list | None = None) Generator[list[Any], None, None]

Iterate over chunks of molecules across the store.

Returns:

an iterable of lists of stored molecules

abstract makeClassification(target_property: str, threshold: float | list[float])[source]

Make this a classification dataset for the given target property.

Parameters:
  • target_property (str) – The name of the target property.

  • threshold (float | list[float]) – The threshold for the classification.

abstract makeRegression(target_property: str)[source]
Make this a regression dataset for the given target property.

This is only possible if the target property was previously converted to classification.

Parameters:

target_property (str) – The name of the target property.

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract property name: str

Get the name of the storage.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None]

Process the molecules in this instance with a given MolProcessor.

Parameters:
  • processor (MolProcessor) – The processor to use.

  • proc_args (tuple, optional) – Additional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – The type of molecule to process.

  • add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

abstract property randomState: int

Get the random state for the object.

abstract reload()

Reset the current state by reloading from storage.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract restoreTargetProperty(prop: TargetSpec | str)[source]

Restore a target property to the original state.

Parameters:

prop (TargetSpec | str) – The target property to restore.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable

Search the molecules within this instance with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.

Returns:

Another instance that can be filtered further.

Return type:

(SMARTSSearchable)

abstract setTargetProperties(target_props: list[TargetSpec | dict], drop_empty: bool = True)[source]

Set the target properties for the dataset.

Parameters:
  • target_props (list[TargetSpec | dict]) – The target properties to add.

  • drop_empty (bool) – If True, drop rows with missing target properties.

abstract property smiles: Generator[str, None, None]

Get the SMILES strings of the molecules in this instance.

Returns:

Generator of SMILES strings.

Return type:

Generator[str, None, None]

abstract property smilesProp: str

Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

abstract property targetProperties: list[TargetSpec]

Get the target properties of the dataset.

Returns:

list of target properties

Return type:

(list)

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

abstract unsetTargetProperty(name: str | TargetSpec)[source]

Unset the target property with the given name.

Parameters:

TargetSpec) ((str |) – name of the target property to unset

Module contents