qsprpred.data.storage.interfaces package

Submodules

qsprpred.data.storage.interfaces.chem_store module

class qsprpred.data.storage.interfaces.chem_store.ChemStore[source]

Bases: PropertyStorage, MolProcessable, Identifiable, Standardizable, ABC

Interface for storing and managing chemical data.

It can be used as an abstraction layer for different storage backends that store data about molecules and their properties or other metadata. Check the documentation of the specific implementation for more details on how to use it and the base classes this class inherits from for more details on its functionality.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol][source]

Add a molecule to the store.

This method should not perform any standardization or identifier calculation. The add_mol_from_smiles method should be used instead if automatic standardization and identification should be performed before storage.

Parameters:
  • smiles (Iterable[str]) – molecules to add as SMILES

  • props (dict, optional) – additional metadata to store with the molecules

  • args – Additional positional arguments to be passed to each molecule.

  • kwargs – Additional keyword arguments to be passed to each molecule.

Returns:

instances of the added molecules

Return type:

list[StoredMol]

Raises:

ValueError – if the molecules cannot be added

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[Iterable[StoredMol | str | Mol | DataFrame], None, None][source]

Apply a function on all or selected properties of the chunks of data. The requested chunk type is supplied as the first positional argument to the function. Properties are attached to it as appropriate. The format of the properties is up to the downstream implementation, but they should be attached to the objects in chunks somehow.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

  • on_props (list, optional) – The properties to apply the function on.

  • chunk_type (str, optional) – The type of chunks to yield.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:

identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getMol(mol_id: str) StoredMol[source]

Get a molecule from the store using its ID.

Parameters:

mol_id (str) – identifier of the molecule to search

Returns:

instance of the molecule

Return type:

StoredMol

abstract getMolCount()[source]

Get the number of molecules in the store.

Returns:

(int) number of molecules

abstract getMolIDs() tuple[str, ...][source]

Get all molecule IDs in the store.

Returns:

molecule IDs

Return type:

tuple[str]

abstract getProperties() list[str]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:

The identifier used by this instance.

Return type:

ChemIdentifier

abstract iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]

Iterate over chunks of molecules across the store.

Parameters:
  • size (int, optional) – The size of the chunks.

  • on_props (list, optional) – The properties to include in the chunks.

  • chunk_type (str, optional) – The type of chunks to yield.

Returns:

an iterable of lists of stored molecules

abstract iterMols() Generator[StoredMol, None, None][source]

Iterate over all molecules in the store.

Returns:

iterator over StoredMol instances

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

property nMols: int

Number of molecules in storage.

abstract property name: str

Get the name of the storage.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None]

Process the molecules in this instance with a given MolProcessor.

Parameters:
  • processor (MolProcessor) – The processor to use.

  • proc_args (tuple, optional) – Additional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – The type of molecule to process.

  • add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

abstract reload()

Reset the current state by reloading from storage.

abstract removeMol(mol_id: str)[source]

Remove a molecule from the store.

Parameters:

mol_id (str) – identifier of the molecule to remove

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in storage.

abstract property smilesProp: str

Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.storage.interfaces.chunk_iterable module

class qsprpred.data.storage.interfaces.chunk_iterable.ChunkIterable[source]

Bases: ABC

Objects that can be iterated over and processed in chunks.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None) Generator[Iterable[Any], None, None][source]

Apply a function on chunks of data. The chunks are supplied as the first positional argument to the function. The format of the chunks is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract iterChunks(size: int | None = None) Generator[list[Any], None, None][source]

Iterate over chunks of the storage.

Parameters:

size (int) – The size of each chunk.

Returns:

A generator that yields chunks of the storage in any format.

qsprpred.data.storage.interfaces.data_store module

class qsprpred.data.storage.interfaces.data_store.DataStorage[source]

Bases: JSONSerializable, ABC

Abstract base class defining an API to interact with persistent data storage. This does not mean that the data is all stored locally, but only database or REST API connection details can be saved into this file as well. It assumes existence of metaFile attribute that points to a metadata file that describes this instance and it should be possible to initialize it from this file.

abstract clear()[source]

Delete entries in the persistent storage.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract reload()[source]

Reset the current state by reloading from storage.

abstract save() str[source]

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.storage.interfaces.descriptor_provider module

class qsprpred.data.storage.interfaces.descriptor_provider.DescriptorProvider[source]

Bases: ABC

Classes that implement this interface provide a way to store and retrieve molecular descriptors or other high-dimensional embeddings of molecules. It assumes that descriptors are divided into sets of related descriptors and described by a DescriptorSet object.

abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)[source]

Add descriptors to the dataset.

Parameters:
  • descriptors (list[DescriptorSet]) – The descriptors to add.

  • args – Additional positional arguments to be passed to each descriptor set.

  • kwargs – Additional keyword arguments to be passed to each descriptor set.

abstract property descriptorSets: list[DescriptorSet]

Get the descriptor sets that are currently in the storage.

Returns:

a list of descriptor sets

abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])[source]

Drop descriptor sets from the storage.

Parameters:

descriptors – The descriptor sets to drop.

abstract getDescriptorNames() list[str][source]

Get the names of the descriptors that are currently in the storage.

Returns:

a list of descriptor names

abstract getDescriptors() DataFrame[source]

Get the table of descriptors that are currently in the storage.

Returns:

a pd.DataFrame with the descriptors

abstract hasDescriptors()[source]

Indicates if the storage has descriptors.

qsprpred.data.storage.interfaces.mol_processable module

class qsprpred.data.storage.interfaces.mol_processable.MolProcessable[source]

Bases: ABC

Interface for processing molecules.

abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None][source]

Process the molecules in this instance with a given MolProcessor.

Parameters:
  • processor (MolProcessor) – The processor to use.

  • proc_args (tuple, optional) – Additional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – The type of molecule to process.

  • add_props (list, optional) – Additional properties to add to the dataset.

Returns:

A generator that yields the processed molecules.

Return type:

Generator

qsprpred.data.storage.interfaces.property_storage module

class qsprpred.data.storage.interfaces.property_storage.PropertyStorage[source]

Bases: DataStorage, ChunkIterable, PropSearchable, ABC

A simple DataStorage that maps property names to arbitrary data. It is assumed that PropertyStorage stores entries with one or more properties attached to each entry. It is up to the downstream implementation to decide how the data is stored and how it is accessed as long as the interface is respected. See the methods of this class and the base classes for more details.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) Generator[Iterable[Any], None, None][source]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

  • on_props (list, optional) – The properties to apply the function on.

  • as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract dropEntries(ids: Iterable[str])[source]

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame[source]

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getProperties() list[str][source]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any][source]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage[source]

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract hasProperty(name: str) bool[source]

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract iterChunks(size: int | None = None, on_props: list | None = None) Generator[list[Any], None, None][source]

Iterate over chunks of molecules across the store.

Returns:

an iterable of lists of stored molecules

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract property name: str

Get the name of the storage.

abstract reload()

Reset the current state by reloading from storage.

abstract removeProperty(name: str)[source]

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.storage.interfaces.searchable module

class qsprpred.data.storage.interfaces.searchable.PropSearchable[source]

Bases: ABC

Interface for searching on properties.

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable[source]

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

class qsprpred.data.storage.interfaces.searchable.SMARTSSearchable[source]

Bases: ABC

Instances of this class can be searched with SMARTS patterns.

abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable[source]

Search the molecules within this instance with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.

Returns:

Another instance that can be filtered further.

Return type:

(SMARTSSearchable)

qsprpred.data.storage.interfaces.stored_mol module

class qsprpred.data.storage.interfaces.stored_mol.StoredMol[source]

Bases: ABC

A simple interface for a molecule that can be stored in a ChemStore. Molecules in the ChemStore have properties, representations, and can also have a parent molecule. Representations can be for example conformers, tautomers, or protomers of the parent molecule. Representations can also be used to encode docked poses with metadata attached as properties.

as_rd_mol() Mol[source]

Get the RDKit molecule object of the standardized representation of this instance.

Returns:

rdkit.Chem.Mol instance

abstract property id: str

Get the identifier of the molecule.

Returns:

The identifier of the molecule.

Return type:

str

abstract property origin: str

Get the name of the storage where the molecule resides.

Returns:

The name of the storage.

Return type:

str

abstract property parent: StoredMol | None

Get the parent molecule of this representation.

Returns:

The parent molecule of this representation as a StoredMol instance.

abstract property props: dict[str, Any] | None

Get the metadata of the molecule.

Returns:

The metadata of the molecule.

Return type:

dict

abstract property representations: list[StoredMol] | None

Get the representations of the molecule.

Returns:

The representations of the molecule.

Return type:

list

abstract property smiles: str

Get the SMILES of the molecule.

Returns:

The SMILES of the molecule.

Return type:

str

Module contents