qsprpred.data.storage.tabular package
Submodules
qsprpred.data.storage.tabular.hierarchical module
- class qsprpred.data.storage.tabular.hierarchical.PandasRepresentationStore(name: str, path: str, chem_store: ChemStore | None = None, df: DataFrame | None = None, store_format: str = 'pkl', add_rdkit: bool = False, overwrite: bool = False, chunk_processor: ParallelGenerator = None, chunk_size: int | None = None, n_jobs: int = 1)[source]
Bases:
ParallelizedChemStore- addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]
Add entries to the storage.
- addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol][source]
Add new representations to the store.
It is required that the properties contain a ‘parent_id’ property that points to the parent molecule in the underlying
storageobject or another representation stored in this object itself.The ‘sdf’ property must also be provided, which defines the representation of the molecule in SDF format. Other properties can be provided as well to indicate the nature of the representation.
- addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]
Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.
- apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None]
Apply a function to the molecules in the data frame.
- Parameters:
func (callable) – Function to apply to the molecules.
func_args (list, optional) – Additional arguments to pass to the function.
func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.
on_props (tuple, optional) – Properties to pass to the function. If
None, all properties will be passed.chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.no_parallel (bool, optional) – Whether to use parallel processing. Defaults to
False.
- Returns:
A generator that yields the results of the supplied function on the chunked molecules from the data set.
- Return type:
(Generator)
- applyIdentifier(identifier: ChemIdentifier)[source]
Apply an identifier to the SMILES in this instance (i.e. remove duplicates).
- Parameters:
identifier (ChemIdentifier) – The identifier to apply.
- applyStandardizer(standardizer: ChemStandardizer)[source]
Apply a standardizer to the SMILES in the store.
- Parameters:
standardizer (ChemStandardizer) – The standardizer to apply
- property chunkProcessor: ParallelGenerator
Parallel generator to use for processing.
- dropEntries(ids: Iterable[str])[source]
Drop entries from the storage.
- Parameters:
ids (list) – The IDs of the entries to drop.
- getDF() DataFrame[source]
Get the stored properties as a pandas DataFrame.
- Returns:
The data as a pandas DataFrame.
- Return type:
pd.DataFrame
- getMol(mol_id: str) StoredMol[source]
Retrieve a molecule with all its representations attached.
- Parameters:
mol_id (str) – identifier of the molecule to retrieve
- Returns:
molecule with all its representations attached to its
representationsattribute- Return type:
- getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any][source]
Get values of a given property.
- getRepresentations(mol_id: str, recursive=True, is_root=False) list[StoredMol][source]
Find all representations of a molecule recursively.
- Parameters:
- getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PandasRepresentationStore[source]
Get a subset of the storage for the given properties.
- Parameters:
- Returns:
The subset of the storage.
- Return type:
- property identifier: ChemIdentifier
Get the identifier used by this instance.
- Returns:
The identifier used by this instance.
- Return type:
- iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]
Iterate over chunks of molecules with their representations added.
- Parameters:
- Yields:
(list[StoredMol | str | Chem.Mol | pd.DataFrame]) – chunk of molecules with all representations attached to their
representationsattribute
- iterMols() Generator[StoredMol, None, None][source]
Iterate over all molecules in the attached storage with their representations added.
- Yields:
(StoredMol) – molecule with all its representations attached to its
representationsattribute
- property metaFile: str
Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the
fromFileclass method.- Returns:
The absolute path to the metadata file.
- Return type:
- processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator
Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.
IMPORTANT: For successful parallel processing with
multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value ofidPropinadd_propsto identify the processed molecules or useMolProcessorWithIDas the processor.- Parameters:
processor (MolProcessor) –
MolProcessorobject to use for processing.proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.add_props (list, optional) – List of data set properties to send to the processor. If
None, all properties will be sent.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.
- Returns:
A generator that yields the results of the supplied processor on the chunked molecules from the data set.
- Return type:
Generator
- removeProperty(name: str)[source]
Remove a property from the dataset.
- Parameters:
name (str) – The name of the property.
- removeRepresentations(mol_id: str)[source]
Remove all representations of a molecule from the store.
- save() str[source]
Save current state to storage and return the path to the serialized file.
- Returns:
The path to the serialized file.
- Return type:
- searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PandasRepresentationStore[source]
Search the molecules within this
MoleculeDataSeton a property value.- Parameters:
prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.
- Returns:
Another instance that can be filtered further.
- Return type:
- searchWithSMARTS(patterns: list[str]) PandasRepresentationStore[source]
Search the molecules within this instance with SMARTS patterns.
- Parameters:
patterns – List of SMARTS patterns to search with.
- Returns:
Another instance that can be filtered further.
- Return type:
- property smiles: Generator[str, None, None]
Generator of SMILES strings of all molecules in storage.
- property standardizer: ChemStandardizer
Get the standardizer used by the store.
- Returns:
The standardizer used by the store.
- Return type:
- class qsprpred.data.storage.tabular.hierarchical.RepresentationMol(mol_id: str, origin: str, smiles: str, parent: TabularMol | None = None, rd_mol: Mol | None = None, props: dict[str, Any] | None = None, representations: tuple[TabularMol, ...] | None = None)[source]
Bases:
TabularMolCreate a new molecule instance.
- Parameters:
mol_id (str) – identifier of the molecule
smiles (str) – SMILES of the molecule
parent (TabularMol, optional) – parent molecule
rd_mol (Chem.Mol, optional) – rdkit molecule object
props (dict, optional) – properties of the molecule
representations (tuple, optional) – representations of the molecule
- as_rd_mol(add_props=False) Mol[source]
Get the rdkit molecule object.
- Returns:
(Chem.Mol) rdkit molecule object
- property origin: str
Get the name of the storage where the molecule resides.
- Returns:
The name of the storage.
- Return type:
- property parent: TabularMol
Get the parent molecule.
- property representations: list[TabularMol] | None
Get the representations of the molecule.
qsprpred.data.storage.tabular.simple module
- class qsprpred.data.storage.tabular.simple.PandasChemStore(name: str, path: str, df: DataFrame | None = None, smiles_col: str = 'SMILES', add_rdkit: bool = False, overwrite: bool = False, save: bool = False, standardizer=None, identifier=None, id_col: str | None = None, autoindex_name: str | None = None, store_format: str = 'pkl', chunk_processor: ParallelGenerator = None, chunk_size: int | None = None, n_jobs: int = 1)[source]
Bases:
ParallelizedChemStoreTabular storage for molecules. An example implementations of
ChemStorethat usesPandasDataTableto store the data.- Variables:
name (str) – Name of the storage.
path (str) – Path to the storage directory.
storeFormat (str) – Format to use for storing the data.
nJobs (int) – Number of parallel jobs to use for processing.
chunkSize (int) – Size of the chunks to use for processing.
chunkProcessor (ParallelGenerator) – Parallel generator to use for processing.
Initialize the storage. If the storage with the given name already exists in the destination it will be reloaded.
- Parameters:
name (str) – Name of the storage.
path (str) – Path to the storage directory.
df (pd.DataFrame, optional) – Data frame to initialize the storage with.
smiles_col (str, optional) – Name of the column containing the SMILES.
add_rdkit (bool, optional) – Whether to add RDKit molecules to the storage.
overwrite (bool, optional) – Whether to overwrite the storage if it already exists.
save (bool, optional) – Whether to save the storage after initialization.
standardizer (ChemStandardizer, optional) – Standardizer to use for the molecules.
identifier (ChemIdentifier, optional) – Identifier to use for the molecules.
id_col (str, optional) – Name of the column containing the molecule IDs.
store_format (str, optional) – Format to use for storing the data.
chunk_processor (ParallelGenerator, optional) – Parallel generator to use for processing.
chunk_size (int, optional) – Size of the chunks to use for processing.
n_jobs (int, optional) – Number of parallel jobs to use for processing.
- addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True, library: str | None = None)[source]
Add entries to the storage.
- addLibrary(name: str, df: DataFrame, smiles_col: str = 'SMILES', id_col: str | None = None, add_rdkit=False, store_format='pkl', save=False)[source]
Reads molecules from a file and adds standardized SMILES to the store as a new library.
- Parameters:
name (str) – name of the library
df (pd.DataFrame) – data frame containing the molecules
smiles_col (str) – name of the column containing the SMILES
id_col (str) – name of the column containing the molecule IDs
add_rdkit (bool) – whether to add RDKit molecules to the store
store_format (str) – format to use for storing the data
save (bool) – whether to save the store after adding the library
- addMols(smiles: Iterable[str], props: dict[str, list] | None = None, library: str | None = None, raise_on_existing: bool = True, add_rdkit: bool = False, store_format: str = 'pkl', save: bool = False, chunk_size: int | None = None, chunk_processor: ParallelGenerator | None = None) list[TabularMol][source]
Add a molecule to the store using its raw SMILES.
The SMILES will be standardized and an identifier will be calculated.
- Parameters:
props (dict, optional) – Additional properties to store with the molecule.
library (str, optional) – Name of the library to add the molecule to.
raise_on_existing (bool, optional) – Whether to raise an error if the molecule already exists in the store.
add_rdkit (bool, optional) – Whether to add RDKit molecules to the store.
store_format (str, optional) – Format to use for storing the data.
save (bool, optional) – Whether to save the store after adding the molecule.
chunk_size (int, optional) – Size of the chunks to use for processing (not used).
chunk_processor (ParallelGenerator, optional) – Parallel generator to use for processing (not used).
- Returns:
Instances of the added molecules.
- Return type:
- addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]
Add a property to the storage.
- apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None]
Apply a function to the molecules in the data frame.
- Parameters:
func (callable) – Function to apply to the molecules.
func_args (list, optional) – Additional arguments to pass to the function.
func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.
on_props (tuple, optional) – Properties to pass to the function. If
None, all properties will be passed.chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.no_parallel (bool, optional) – Whether to use parallel processing. Defaults to
False.
- Returns:
A generator that yields the results of the supplied function on the chunked molecules from the data set.
- Return type:
(Generator)
- applyIdentifier(identifier: ChemIdentifier)[source]
Apply an identifier to the SMILES in the store.
- Parameters:
identifier (ChemIdentifier) – Identifier to apply to the SMILES.
- applyStandardizer(standardizer: ChemStandardizer)[source]
Apply a standardizer to the SMILES in the store.
- Parameters:
standardizer (ChemStandardizer) – Standardizer to apply to the SMILES.
- property chunkProcessor: ParallelGenerator
Parallel generator to use for processing.
- dropEntries(ids: Iterable[str])[source]
Drop entries from the store.
- Parameters:
ids (tuple) – IDs of the entries to drop.
- classmethod fromDF(df: DataFrame, *args, name: str | None = None, **kwargs) PandasChemStore[source]
Create a new instance from a pandas DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame to create the instance from.
name (str) – Name of the new instance. Defaults to the name of the DataFrame.
*args – Additional arguments to pass to the constructor.
**kwargs – Additional keyword arguments to pass to the constructor.
- Returns:
New instance created from the DataFrame.
- Return type:
- getMol(mol_id) TabularMol[source]
Get a molecule from the store by its ID.
- Parameters:
mol_id (str) – ID of the molecule to get.
- Returns:
Molecule with the given ID.
- Return type:
- getMolIDs() tuple[str, ...][source]
Returns a set of all molecule IDs in the store.
Good for checking possible overlaps between stores.
- Returns:
Tuple of all molecule IDs in the store.
- Return type:
(tuple)
- getProperties() list[str][source]
Get a list of all properties in the storage.
- Returns:
List of all properties in the storage.
- Return type:
(list)
- getProperty(name: str, ids: list[str] | None = None) Series[source]
Get a property from the storage.
- getSubset(subset: Iterable[str], ids: list[str] | None = None, name: str | None = None) PandasChemStore[source]
Get a subset of the storage for the given properties.
- Parameters:
- Returns:
New table containing the subset.
- Return type:
- getSummary()[source]
Make a summary with some statistics about the molecules in this table.
The summary contains the number of molecules per target and the number of unique molecules per target. Requires this data set to be imported from Papyrus for now.
- Returns:
A dataframe with the summary statistics.
- Return type:
(pd.DataFrame)
- property idProp: str
Name of the property containing unique molecule IDs. The values are determined by the attached
identifier.
- property identifier
Identifier used in this storage.
- iterChunks(size: int = 1000, on_props: Iterable[str] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]
Iterate over the molecules in the store in chunks.
- iterMols() Generator[TabularMol, None, None][source]
Iterate over the molecules in the store.
- Yields:
(TabularMol) – Molecule from the store.
- property libsPath
Path to the directory where the primary library tables are stored.
- property nJobs
Number of parallel jobs to use for processing.
- property nLibs
Number of libraries in this storage.
- property originalSmilesProp: str
Name of the column containing the original SMILES before standardization.
- processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator
Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.
IMPORTANT: For successful parallel processing with
multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value ofidPropinadd_propsto identify the processed molecules or useMolProcessorWithIDas the processor.- Parameters:
processor (MolProcessor) –
MolProcessorobject to use for processing.proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.add_props (list, optional) – List of data set properties to send to the processor. If
None, all properties will be sent.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.
- Returns:
A generator that yields the results of the supplied processor on the chunked molecules from the data set.
- Return type:
Generator
- removeMol(mol_id)[source]
Remove a molecule from the store.
- Parameters:
mol_id (str) – ID of the molecule to remove.
- removeProperty(name: str)[source]
Remove a property from the storage.
- Parameters:
name (str) – Name of the property to remove.
- save()[source]
Save the whole storage to disk.
- Returns:
Path to the saved storage.
- Return type:
(str)
- searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None) PandasChemStore[source]
Search in this table using a property name and a list of values. It is assumed that the property is searchable with string matching or direct comparison if a number is supplied. Note that the types of the query list need to be consistent. Otherwise, a
ValueErrorwill be raised.In the case of string comparison, if ‘exact’ is
False, the search will be performed with partial matching, i.e. all molecules that contain any of the given values in the property will be returned. If ‘exact’ isTrue, only molecules that have the exact property value for any of the given values will be returned.- Parameters:
prop_name (str) – Name of the property to search on.
values (list[str]) – List of values to search for. If any of the values is found in the property, the molecule will be considered a match.
exact (bool, optional) – Whether to use exact matching, i.e. whether to search for exact strings or just substrings. Defaults to False.
name (str | None, optional) – Name of the new table. Defaults to the name of the old table, plus the
_searchedsuffix.
- Returns:
A new table with the molecules from the old table with the given property values.
- Return type:
- Raises:
ValueError – If the types of the query list are not consistent.
- searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, match_function: MolProcessor | None = None) PandasChemStore[source]
Search the molecules in the table with a SMARTS pattern.
- Parameters:
patterns – List of SMARTS patterns to search with.
operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.
use_chirality – Whether to use chirality in the search.
name – Name of the new table. Defaults to the name of the old table, plus the
smarts_searchedsuffix.match_function – Function to use for matching the molecules to the SMARTS patterns. Defaults to
match_mol_to_smarts.
- Returns:
A dataframe with the molecules that match the pattern.
- Return type:
(MolTable)
- property smiles: Generator[str, None, None]
Generator of SMILES strings of all molecules in storage.
- property standardizer: ChemStandardizer
Standardizer used in this storage.
- class qsprpred.data.storage.tabular.simple.ParallelizedChemStore[source]
Bases:
ChemStore,SMARTSSearchable,PropSearchable,Summarizable,Parallelizable,ABCBase class with default implementations of some parallel processing features for
ChemStoreinstances that want to support it. The mixin basically defines some methods required by theChunkIterableandMolProcessableinterfaces to make implementation of parallel processing for downstream instances ofChemStoreeasier.- abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)
Add entries to the storage.
- abstract addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol]
Add a molecule to the store.
This method should not perform any standardization or identifier calculation. The
add_mol_from_smilesmethod should be used instead if automatic standardization and identification should be performed before storage.- Parameters:
- Returns:
instances of the added molecules
- Return type:
- Raises:
ValueError – if the molecules cannot be added
- abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)
Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.
- apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None][source]
Apply a function to the molecules in the data frame.
- Parameters:
func (callable) – Function to apply to the molecules.
func_args (list, optional) – Additional arguments to pass to the function.
func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.
on_props (tuple, optional) – Properties to pass to the function. If
None, all properties will be passed.chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.no_parallel (bool, optional) – Whether to use parallel processing. Defaults to
False.
- Returns:
A generator that yields the results of the supplied function on the chunked molecules from the data set.
- Return type:
(Generator)
- abstract applyIdentifier(identifier: ChemIdentifier)
Apply an identifier to the SMILES in this instance (i.e. remove duplicates).
- Parameters:
identifier (ChemIdentifier) – The identifier to apply.
- abstract applyStandardizer(standardizer: ChemStandardizer)
Apply a standardizer to the SMILES in the store.
- Parameters:
standardizer (ChemStandardizer) – The standardizer to apply
- abstract property chunkProcessor: ParallelGenerator
Parallel generator to use for processing.
- abstract clear()
Delete entries in the persistent storage.
- abstract dropEntries(ids: Iterable[str])
Drop entries from the storage.
- Parameters:
ids (list) – The IDs of the entries to drop.
- abstract getDF() DataFrame
Get the stored properties as a pandas DataFrame.
- Returns:
The data as a pandas DataFrame.
- Return type:
pd.DataFrame
- abstract getMolCount()
Get the number of molecules in the store.
- Returns:
(int) number of molecules
- abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]
Get values of a given property.
- abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage
Get a subset of the storage for the given properties.
- Parameters:
- Returns:
The subset of the storage.
- Return type:
- abstract getSummary() DataFrame
Make a summary with some statistics about this object or action.
- Returns:
A dataframe with the summary statistics.
- Return type:
(pd.DataFrame)
- abstract property identifier: ChemIdentifier
Get the identifier used by this instance.
- Returns:
The identifier used by this instance.
- Return type:
- abstract iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None]
Iterate over chunks of molecules across the store.
- abstract iterMols() Generator[StoredMol, None, None]
Iterate over all molecules in the store.
- Returns:
iterator over
StoredMolinstances
- abstract property metaFile: str
Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the
fromFileclass method.- Returns:
The absolute path to the metadata file.
- Return type:
- processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator[source]
Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.
IMPORTANT: For successful parallel processing with
multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value ofidPropinadd_propsto identify the processed molecules or useMolProcessorWithIDas the processor.- Parameters:
processor (MolProcessor) –
MolProcessorobject to use for processing.proc_args (list, optional) – Any additional positional arguments to pass to the processor.
proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.
mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies
TabularMolobjects.add_props (list, optional) – List of data set properties to send to the processor. If
None, all properties will be sent.chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified,
self.chunkProcessoris used.
- Returns:
A generator that yields the results of the supplied processor on the chunked molecules from the data set.
- Return type:
Generator
- abstract reload()
Reset the current state by reloading from storage.
- abstract removeMol(mol_id: str)
Remove a molecule from the store.
- Parameters:
mol_id (str) – identifier of the molecule to remove
- abstract removeProperty(name: str)
Remove a property from the dataset.
- Parameters:
name (str) – The name of the property.
- abstract save() str
Save current state to storage and return the path to the serialized file.
- Returns:
The path to the serialized file.
- Return type:
- abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable
Search the molecules within this
MoleculeDataSeton a property value.- Parameters:
prop_name – Name of the column to search on.
values – Values to search for.
exact – Whether to search for exact matches or not.
- Returns:
Another instance that can be filtered further.
- Return type:
- abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable
Search the molecules within this instance with SMARTS patterns.
- Parameters:
patterns – List of SMARTS patterns to search with.
- Returns:
Another instance that can be filtered further.
- Return type:
- property smiles: Generator[str, None, None]
Generator of SMILES strings of all molecules in storage.
- abstract property standardizer: ChemStandardizer
Get the standardizer used by the store.
- Returns:
The standardizer used by the store.
- Return type:
qsprpred.data.storage.tabular.stored_mol module
- class qsprpred.data.storage.tabular.stored_mol.TabularMol(mol_id: str, origin: str, smiles: str, parent: TabularMol | None = None, rd_mol: Mol | None = None, props: dict[str, Any] | None = None, representations: tuple[TabularMol, ...] | None = None)[source]
Bases:
StoredMolSimple implementation of a molecule that is stored in a tabular storage.
Create a new molecule instance.
- Parameters:
mol_id (str) – identifier of the molecule
smiles (str) – SMILES of the molecule
parent (TabularMol, optional) – parent molecule
rd_mol (Chem.Mol, optional) – rdkit molecule object
props (dict, optional) – properties of the molecule
representations (tuple, optional) – representations of the molecule
- property origin: str
Get the name of the storage where the molecule resides.
- Returns:
The name of the storage.
- Return type:
- property parent: TabularMol
Get the parent molecule.
- property representations: list[TabularMol] | None
Get the representations of the molecule.