qsprpred.data.storage.interfaces package
Submodules
qsprpred.data.storage.interfaces.chem_store module
- class qsprpred.data.storage.interfaces.chem_store.ChemStore[source]
Bases:
PropertyStorage,MolProcessable,Identifiable,Standardizable,ABCInterface for storing and managing chemical data.
It can be used as an abstraction layer for different storage backends that store data about molecules and their properties or other metadata. Check the documentation of the specific implementation for more details on how to use it and the base classes this class inherits from for more details on its functionality.
- abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)
Add entries to the storage.
- abstract addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol][source]
Add a molecule to the store.
This method should not perform any standardization or identifier calculation. The
add_mol_from_smilesmethod should be used instead if automatic standardization and identification should be performed before storage.- Parameters:
- Returns:
instances of the added molecules
- Return type:
- Raises:
ValueError – if the molecules cannot be added
- abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)
Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.
- abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[Iterable[StoredMol | str | Mol | DataFrame], None, None][source]
Apply a function on all or selected properties of the chunks of data. The requested chunk type is supplied as the first positional argument to the function. Properties are attached to it as appropriate. The format of the properties is up to the downstream implementation, but they should be attached to the objects in chunks somehow.
- Parameters:
func (callable) – The function to apply.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.
on_props (list, optional) – The properties to apply the function on.
chunk_type (str, optional) – The type of chunks to yield.
- Returns:
A generator that yields the results of the function applied to each chunk.
- abstract applyIdentifier(identifier: ChemIdentifier)
Apply an identifier to the SMILES in this instance (i.e. remove duplicates).
- Parameters:
identifier (ChemIdentifier) – The identifier to apply.
- abstract applyStandardizer(standardizer: ChemStandardizer)
Apply a standardizer to the SMILES in the store.
- Parameters:
standardizer (ChemStandardizer) – The standardizer to apply
- abstract clear()
Delete entries in the persistent storage.
- abstract dropEntries(ids: Iterable[str])
Drop entries from the storage.
- Parameters:
ids (list) – The IDs of the entries to drop.
- abstract getDF() DataFrame
Get the stored properties as a pandas DataFrame.
- Returns:
The data as a pandas DataFrame.
- Return type:
pd.DataFrame
- abstract getMolCount()[source]
Get the number of molecules in the store.
- Returns:
(int) number of molecules
- abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]
Get values of a given property.
- abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage
Get a subset of the storage for the given properties.
- Parameters:
- Returns:
The subset of the storage.
- Return type:
- abstract property identifier: ChemIdentifier
Get the identifier used by this instance.
- Returns:
The identifier used by this instance.
- Return type:
- abstract iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]
Iterate over chunks of molecules across the store.
- abstract iterMols() Generator[StoredMol, None, None][source]
Iterate over all molecules in the store.
- Returns:
iterator over
StoredMolinstances
- abstract property metaFile: str
Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the
fromFileclass method.- Returns:
The absolute path to the metadata file.
- Return type:
- abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None]
Process the molecules in this instance with a given
MolProcessor.- Parameters:
processor (MolProcessor) – The processor to use.
proc_args (tuple, optional) – Additional arguments to pass to the processor.
proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.
mol_type (str, optional) – The type of molecule to process.
add_props (list, optional) – Additional properties to add to the dataset.
- Returns:
A generator that yields the processed molecules.
- Return type:
Generator
- abstract reload()
Reset the current state by reloading from storage.
- abstract removeMol(mol_id: str)[source]
Remove a molecule from the store.
- Parameters:
mol_id (str) – identifier of the molecule to remove
- abstract removeProperty(name: str)
Remove a property from the dataset.
- Parameters:
name (str) – The name of the property.
- abstract save() str
Save current state to storage and return the path to the serialized file.
- Returns:
The path to the serialized file.
- Return type:
- abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable
Search the molecules within this
MoleculeDataSeton a property value.- Parameters:
prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.
- Returns:
Another instance that can be filtered further.
- Return type:
- property smiles: Generator[str, None, None]
Generator of SMILES strings of all molecules in storage.
- abstract property standardizer: ChemStandardizer
Get the standardizer used by the store.
- Returns:
The standardizer used by the store.
- Return type:
qsprpred.data.storage.interfaces.chunk_iterable module
- class qsprpred.data.storage.interfaces.chunk_iterable.ChunkIterable[source]
Bases:
ABCObjects that can be iterated over and processed in chunks.
- abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None) Generator[Iterable[Any], None, None][source]
Apply a function on chunks of data. The chunks are supplied as the first positional argument to the function. The format of the chunks is up to the downstream implementation, but it should always be a single object supplied as the first parameter.
qsprpred.data.storage.interfaces.data_store module
- class qsprpred.data.storage.interfaces.data_store.DataStorage[source]
Bases:
JSONSerializable,ABCAbstract base class defining an API to interact with persistent data storage. This does not mean that the data is all stored locally, but only database or REST API connection details can be saved into this file as well. It assumes existence of
metaFileattribute that points to a metadata file that describes this instance and it should be possible to initialize it from this file.- abstract property metaFile: str
Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the
fromFileclass method.- Returns:
The absolute path to the metadata file.
- Return type:
- abstract save() str[source]
Save current state to storage and return the path to the serialized file.
- Returns:
The path to the serialized file.
- Return type:
qsprpred.data.storage.interfaces.descriptor_provider module
- class qsprpred.data.storage.interfaces.descriptor_provider.DescriptorProvider[source]
Bases:
ABCClasses that implement this interface provide a way to store and retrieve molecular descriptors or other high-dimensional embeddings of molecules. It assumes that descriptors are divided into sets of related descriptors and described by a
DescriptorSetobject.- abstract addDescriptors(descriptors: DescriptorSet, *args, **kwargs)[source]
Add descriptors to the dataset.
- Parameters:
descriptors (list[DescriptorSet]) – The descriptors to add.
args – Additional positional arguments to be passed to each descriptor set.
kwargs – Additional keyword arguments to be passed to each descriptor set.
- abstract property descriptorSets: list[DescriptorSet]
Get the descriptor sets that are currently in the storage.
- Returns:
a
listof descriptor sets
- abstract dropDescriptorSets(descriptors: list[DescriptorSet | str])[source]
Drop descriptor sets from the storage.
- Parameters:
descriptors – The descriptor sets to drop.
- abstract getDescriptorNames() list[str][source]
Get the names of the descriptors that are currently in the storage.
- Returns:
a
listof descriptor names
qsprpred.data.storage.interfaces.mol_processable module
- class qsprpred.data.storage.interfaces.mol_processable.MolProcessable[source]
Bases:
ABCInterface for processing molecules.
- abstract processMols(processor: MolProcessor, proc_args: tuple[Any, ...] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None) Generator[Any, None, None][source]
Process the molecules in this instance with a given
MolProcessor.- Parameters:
processor (MolProcessor) – The processor to use.
proc_args (tuple, optional) – Additional arguments to pass to the processor.
proc_kwargs (dict, optional) – Additional keyword arguments to pass to the processor.
mol_type (str, optional) – The type of molecule to process.
add_props (list, optional) – Additional properties to add to the dataset.
- Returns:
A generator that yields the processed molecules.
- Return type:
Generator
qsprpred.data.storage.interfaces.property_storage module
- class qsprpred.data.storage.interfaces.property_storage.PropertyStorage[source]
Bases:
DataStorage,ChunkIterable,PropSearchable,ABCA simple
DataStoragethat maps property names to arbitrary data. It is assumed thatPropertyStoragestores entries with one or more properties attached to each entry. It is up to the downstream implementation to decide how the data is stored and how it is accessed as long as the interface is respected. See the methods of this class and the base classes for more details.- abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]
Add entries to the storage.
- abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]
Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.
- abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) Generator[Iterable[Any], None, None][source]
Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.
- Parameters:
func (callable) – The function to apply.
func_args (list, optional) – The positional arguments of the function.
func_kwargs (dict, optional) – The keyword arguments of the function.
on_props (list, optional) – The properties to apply the function on.
as_df (bool, optional) – Provide properties as a DataFrame to the function.
- Returns:
A generator that yields the results of the function applied to each chunk.
- abstract clear()
Delete entries in the persistent storage.
- abstract dropEntries(ids: Iterable[str])[source]
Drop entries from the storage.
- Parameters:
ids (list) – The IDs of the entries to drop.
- abstract getDF() DataFrame[source]
Get the stored properties as a pandas DataFrame.
- Returns:
The data as a pandas DataFrame.
- Return type:
pd.DataFrame
- abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any][source]
Get values of a given property.
- abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage[source]
Get a subset of the storage for the given properties.
- Parameters:
- Returns:
The subset of the storage.
- Return type:
- abstract hasProperty(name: str) bool[source]
Check whether a property is present in the data frame.
- abstract iterChunks(size: int | None = None, on_props: list | None = None) Generator[list[Any], None, None][source]
Iterate over chunks of molecules across the store.
- Returns:
an iterable of lists of stored molecules
- abstract property metaFile: str
Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the
fromFileclass method.- Returns:
The absolute path to the metadata file.
- Return type:
- abstract reload()
Reset the current state by reloading from storage.
- abstract removeProperty(name: str)[source]
Remove a property from the dataset.
- Parameters:
name (str) – The name of the property.
- abstract save() str
Save current state to storage and return the path to the serialized file.
- Returns:
The path to the serialized file.
- Return type:
- abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable
Search the molecules within this
MoleculeDataSeton a property value.- Parameters:
prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.
- Returns:
Another instance that can be filtered further.
- Return type:
qsprpred.data.storage.interfaces.searchable module
- class qsprpred.data.storage.interfaces.searchable.PropSearchable[source]
Bases:
ABCInterface for searching on properties.
- abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable[source]
Search the molecules within this
MoleculeDataSeton a property value.- Parameters:
prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.
- Returns:
Another instance that can be filtered further.
- Return type:
- class qsprpred.data.storage.interfaces.searchable.SMARTSSearchable[source]
Bases:
ABCInstances of this class can be searched with SMARTS patterns.
- abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable[source]
Search the molecules within this instance with SMARTS patterns.
- Parameters:
patterns – List of SMARTS patterns to search with.
- Returns:
Another instance that can be filtered further.
- Return type:
qsprpred.data.storage.interfaces.stored_mol module
- class qsprpred.data.storage.interfaces.stored_mol.StoredMol[source]
Bases:
ABCA simple interface for a molecule that can be stored in a
ChemStore. Molecules in theChemStorehave properties, representations, and can also have a parent molecule. Representations can be for example conformers, tautomers, or protomers of the parent molecule. Representations can also be used to encode docked poses with metadata attached as properties.- as_rd_mol() Mol[source]
Get the RDKit molecule object of the standardized representation of this instance.
- Returns:
rdkit.Chem.Molinstance
- abstract property id: str
Get the identifier of the molecule.
- Returns:
The identifier of the molecule.
- Return type:
- abstract property origin: str
Get the name of the storage where the molecule resides.
- Returns:
The name of the storage.
- Return type:
- abstract property parent: StoredMol | None
Get the parent molecule of this representation.
- Returns:
The parent molecule of this representation as a
StoredMolinstance.
- abstract property props: dict[str, Any] | None
Get the metadata of the molecule.
- Returns:
The metadata of the molecule.
- Return type: