qsprpred.data.storage.tabular package

Submodules

qsprpred.data.storage.tabular.hierarchical module

class qsprpred.data.storage.tabular.hierarchical.PandasRepresentationStore(name: str, path: str, chem_store: ChemStore | None = None, df: DataFrame | None = None, store_format: str = 'pkl', add_rdkit: bool = False, overwrite: bool = False, chunk_processor: ParallelGenerator = None, chunk_size: int | None = None, n_jobs: int = 1)[source]

Bases: ParallelizedChemStore

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)[source]

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol][source]

Add new representations to the store.

It is required that the properties contain a ‘parent_id’ property that points to the parent molecule in the underlying storage object or another representation stored in this object itself.

The ‘sdf’ property must also be provided, which defines the representation of the molecule in SDF format. Other properties can be provided as well to indicate the nature of the representation.

Parameters:
  • smiles – The SMILES of the representations to add.

  • props – The properties of the representations to add.

  • *args – Additional arguments.

  • **kwargs – Additional keyword arguments.

Returns:

The added representations.

Return type:

(list[StoredMol])

addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None]

Apply a function to the molecules in the data frame.

Parameters:
  • func (callable) – Function to apply to the molecules.

  • func_args (list, optional) – Additional arguments to pass to the function.

  • func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.

  • on_props (tuple, optional) – Properties to pass to the function. If None, all properties will be passed.

  • chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

  • no_parallel (bool, optional) – Whether to use parallel processing. Defaults to False.

Returns:

A generator that yields the results of the supplied function on the chunked molecules from the data set.

Return type:

(Generator)

applyIdentifier(identifier: ChemIdentifier)[source]

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:

identifier (ChemIdentifier) – The identifier to apply.

applyStandardizer(standardizer: ChemStandardizer)[source]

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

property baseDir: str
property chunkProcessor: ParallelGenerator

Parallel generator to use for processing.

property chunkSize: int

The size of the chunks to iterate over.

clear(files_only: bool = True)[source]

Clear the storage.

dropEntries(ids: Iterable[str])[source]

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDF() DataFrame[source]

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

getMol(mol_id: str) StoredMol[source]

Retrieve a molecule with all its representations attached.

Parameters:

mol_id (str) – identifier of the molecule to retrieve

Returns:

molecule with all its representations attached to its representations attribute

Return type:

(StoredMol)

getMolCount()[source]

Get the number of representations in the store.

getMolIDs() tuple[str, ...][source]

Get the identifiers of all representations in the store.

getProperties() list[str][source]

Get the property names contained in the storage.

getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any][source]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

getRepresentations(mol_id: str, recursive=True, is_root=False) list[StoredMol][source]

Find all representations of a molecule recursively.

Parameters:
  • mol_id (str) – identifier of the molecule to find representations for

  • recursive (bool) – whether to find representations recursively or just one level

  • is_root (bool) – whether the molecule is the root molecule (the parent of all representations) -> will be searched for in the main storage

getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PandasRepresentationStore[source]

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

getSummary() DataFrame[source]

Show the number of representations for each parent molecule.

hasProperty(name: str) bool[source]

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

property idProp: str

Get the name of the property that contains the molecule IDs.

property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:

The identifier used by this instance.

Return type:

ChemIdentifier

iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]

Iterate over chunks of molecules with their representations added.

Parameters:
  • size (int) – size of the chunks to yield

  • on_props (list) – properties to chunk on

  • chunk_type (str) – type of the chunk to yield

Yields:

(list[StoredMol | str | Chem.Mol | pd.DataFrame]) – chunk of molecules with all representations attached to their representations attribute

iterMols() Generator[StoredMol, None, None][source]

Iterate over all molecules in the attached storage with their representations added.

Yields:

(StoredMol) – molecule with all its representations attached to its representations attribute

property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

property nJobs: int

Get the number of jobs to run in parallel.

property nMols: int

Number of molecules in storage.

property name: str

Name of the data set.

processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing with multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules or use MolProcessorWithID as the processor.

Parameters:
  • processor (MolProcessor) – MolProcessor object to use for processing.

  • proc_args (list, optional) – Any additional positional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

reload()[source]

Reset the current state by reloading from storage.

removeMol(mol_id: str)[source]

Remove all representations of a molecule from the store.

removeProperty(name: str)[source]

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

removeRepresentations(mol_id: str)[source]

Remove all representations of a molecule from the store.

save() str[source]

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PandasRepresentationStore[source]

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

searchWithSMARTS(patterns: list[str]) PandasRepresentationStore[source]

Search the molecules within this instance with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.

Returns:

Another instance that can be filtered further.

Return type:

(SMARTSSearchable)

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in storage.

property smilesProp: str

Get the name of the property that contains the SMILES strings.

property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.storage.tabular.hierarchical.RepresentationMol(mol_id: str, origin: str, smiles: str, parent: TabularMol | None = None, rd_mol: Mol | None = None, props: dict[str, Any] | None = None, representations: tuple[TabularMol, ...] | None = None)[source]

Bases: TabularMol

Create a new molecule instance.

Parameters:
  • mol_id (str) – identifier of the molecule

  • smiles (str) – SMILES of the molecule

  • parent (TabularMol, optional) – parent molecule

  • rd_mol (Chem.Mol, optional) – rdkit molecule object

  • props (dict, optional) – properties of the molecule

  • representations (tuple, optional) – representations of the molecule

as_rd_mol(add_props=False) Mol[source]

Get the rdkit molecule object.

Returns:

(Chem.Mol) rdkit molecule object

property id: str

Get the identifier of the molecule.

property origin: str

Get the name of the storage where the molecule resides.

Returns:

The name of the storage.

Return type:

str

property parent: TabularMol

Get the parent molecule.

property props: dict[str, Any] | None

Get the row of the dataframe corresponding to this molecule.

property representations: list[TabularMol] | None

Get the representations of the molecule.

sdf() str[source]
property smiles: str

Get the SMILES of the molecule.

to_file(directory, extension='.csv') str[source]

Write a minimal file containing the SMILES and the ID of the molecule. Used for ligrep (.csv is the preferred format).

qsprpred.data.storage.tabular.simple module

class qsprpred.data.storage.tabular.simple.PandasChemStore(name: str, path: str, df: DataFrame | None = None, smiles_col: str = 'SMILES', add_rdkit: bool = False, overwrite: bool = False, save: bool = False, standardizer=None, identifier=None, id_col: str | None = None, autoindex_name: str | None = None, store_format: str = 'pkl', chunk_processor: ParallelGenerator = None, chunk_size: int | None = None, n_jobs: int = 1)[source]

Bases: ParallelizedChemStore

Tabular storage for molecules. An example implementations of ChemStore that uses PandasDataTable to store the data.

Variables:
  • name (str) – Name of the storage.

  • path (str) – Path to the storage directory.

  • storeFormat (str) – Format to use for storing the data.

  • nJobs (int) – Number of parallel jobs to use for processing.

  • chunkSize (int) – Size of the chunks to use for processing.

  • chunkProcessor (ParallelGenerator) – Parallel generator to use for processing.

Initialize the storage. If the storage with the given name already exists in the destination it will be reloaded.

Parameters:
  • name (str) – Name of the storage.

  • path (str) – Path to the storage directory.

  • df (pd.DataFrame, optional) – Data frame to initialize the storage with.

  • smiles_col (str, optional) – Name of the column containing the SMILES.

  • add_rdkit (bool, optional) – Whether to add RDKit molecules to the storage.

  • overwrite (bool, optional) – Whether to overwrite the storage if it already exists.

  • save (bool, optional) – Whether to save the storage after initialization.

  • standardizer (ChemStandardizer, optional) – Standardizer to use for the molecules.

  • identifier (ChemIdentifier, optional) – Identifier to use for the molecules.

  • id_col (str, optional) – Name of the column containing the molecule IDs.

  • store_format (str, optional) – Format to use for storing the data.

  • chunk_processor (ParallelGenerator, optional) – Parallel generator to use for processing.

  • chunk_size (int, optional) – Size of the chunks to use for processing.

  • n_jobs (int, optional) – Number of parallel jobs to use for processing.

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True, library: str | None = None)[source]

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Whether to raise an error if the entry already exists.

  • library (str) – Name of the library to add the entries to.

addLibrary(name: str, df: DataFrame, smiles_col: str = 'SMILES', id_col: str | None = None, add_rdkit=False, store_format='pkl', save=False)[source]

Reads molecules from a file and adds standardized SMILES to the store as a new library.

Parameters:
  • name (str) – name of the library

  • df (pd.DataFrame) – data frame containing the molecules

  • smiles_col (str) – name of the column containing the SMILES

  • id_col (str) – name of the column containing the molecule IDs

  • add_rdkit (bool) – whether to add RDKit molecules to the store

  • store_format (str) – format to use for storing the data

  • save (bool) – whether to save the store after adding the library

addMols(smiles: Iterable[str], props: dict[str, list] | None = None, library: str | None = None, raise_on_existing: bool = True, add_rdkit: bool = False, store_format: str = 'pkl', save: bool = False, chunk_size: int | None = None, chunk_processor: ParallelGenerator | None = None) list[TabularMol][source]

Add a molecule to the store using its raw SMILES.

The SMILES will be standardized and an identifier will be calculated.

Parameters:
  • smiles (list[str]) – SMILES of the molecule to add.

  • props (dict, optional) – Additional properties to store with the molecule.

  • library (str, optional) – Name of the library to add the molecule to.

  • raise_on_existing (bool, optional) – Whether to raise an error if the molecule already exists in the store.

  • add_rdkit (bool, optional) – Whether to add RDKit molecules to the store.

  • store_format (str, optional) – Format to use for storing the data.

  • save (bool, optional) – Whether to save the store after adding the molecule.

  • chunk_size (int, optional) – Size of the chunks to use for processing (not used).

  • chunk_processor (ParallelGenerator, optional) – Parallel generator to use for processing (not used).

Returns:

Instances of the added molecules.

Return type:

(list[StoredMol])

addProperty(name: str, data: Sized, ids: list[str] | None = None)[source]

Add a property to the storage.

Parameters:
  • name (str) – Name of the property to add.

  • data (list) – Data of the property.

  • ids (list, optional) – IDs of the molecules to add the property for.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None]

Apply a function to the molecules in the data frame.

Parameters:
  • func (callable) – Function to apply to the molecules.

  • func_args (list, optional) – Additional arguments to pass to the function.

  • func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.

  • on_props (tuple, optional) – Properties to pass to the function. If None, all properties will be passed.

  • chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

  • no_parallel (bool, optional) – Whether to use parallel processing. Defaults to False.

Returns:

A generator that yields the results of the supplied function on the chunked molecules from the data set.

Return type:

(Generator)

applyIdentifier(identifier: ChemIdentifier)[source]

Apply an identifier to the SMILES in the store.

Parameters:

identifier (ChemIdentifier) – Identifier to apply to the SMILES.

applyStandardizer(standardizer: ChemStandardizer)[source]

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – Standardizer to apply to the SMILES.

property chunkProcessor: ParallelGenerator

Parallel generator to use for processing.

property chunkSize: int

Size of the chunks to use for processing.

clear(files_only: bool = True)[source]

Clear the storage.

dropEntries(ids: Iterable[str])[source]

Drop entries from the store.

Parameters:

ids (tuple) – IDs of the entries to drop.

classmethod fromDF(df: DataFrame, *args, name: str | None = None, **kwargs) PandasChemStore[source]

Create a new instance from a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame to create the instance from.

  • name (str) – Name of the new instance. Defaults to the name of the DataFrame.

  • *args – Additional arguments to pass to the constructor.

  • **kwargs – Additional keyword arguments to pass to the constructor.

Returns:

New instance created from the DataFrame.

Return type:

PropertyStorage

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getDF() DataFrame[source]

Get the stored properties as a pandas DataFrame.

getMol(mol_id) TabularMol[source]

Get a molecule from the store by its ID.

Parameters:

mol_id (str) – ID of the molecule to get.

Returns:

Molecule with the given ID.

Return type:

(TabularMol)

getMolCount() int[source]

Get the number of molecules in the store.

getMolIDs() tuple[str, ...][source]

Returns a set of all molecule IDs in the store.

Good for checking possible overlaps between stores.

Returns:

Tuple of all molecule IDs in the store.

Return type:

(tuple)

getProperties() list[str][source]

Get a list of all properties in the storage.

Returns:

List of all properties in the storage.

Return type:

(list)

getProperty(name: str, ids: list[str] | None = None) Series[source]

Get a property from the storage.

Parameters:
  • name (str) – Name of the property to get.

  • ids (list, optional) – IDs of the molecules to get the property for.

Returns:

Series containing the property values.

Return type:

(pd.Series)

getSubset(subset: Iterable[str], ids: list[str] | None = None, name: str | None = None) PandasChemStore[source]

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – List of property names to include in the subset.

  • ids (list, optional) – IDs of the entries to include in the subset.

  • name (str, optional) – Name of the new table.

Returns:

New table containing the subset.

Return type:

(PandasChemStore)

getSummary()[source]

Make a summary with some statistics about the molecules in this table.

The summary contains the number of molecules per target and the number of unique molecules per target. Requires this data set to be imported from Papyrus for now.

Returns:

A dataframe with the summary statistics.

Return type:

(pd.DataFrame)

hasProperty(name: str) bool[source]

Check if a property is present in the storage.

property idProp: str

Name of the property containing unique molecule IDs. The values are determined by the attached identifier.

property identifier

Identifier used in this storage.

iterChunks(size: int = 1000, on_props: Iterable[str] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None][source]

Iterate over the molecules in the store in chunks.

Parameters:
  • size (int, optional) – Size of the chunks to use for processing.

  • on_props (list, optional) – Properties to pass to the function.

  • chunk_type (str, optional) – Type of molecule to send to the function.

Yields:

(list) – List of molecules in the chunk.

iterMols() Generator[TabularMol, None, None][source]

Iterate over the molecules in the store.

Yields:

(TabularMol) – Molecule from the store.

property libsPath

Path to the directory where the primary library tables are stored.

property metaFile: str

Path to the meta file.

property nJobs

Number of parallel jobs to use for processing.

property nLibs

Number of libraries in this storage.

property nMols: int

Number of molecules in storage.

property name: str

Name of the data set.

property originalSmilesProp: str

Name of the column containing the original SMILES before standardization.

processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing with multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules or use MolProcessorWithID as the processor.

Parameters:
  • processor (MolProcessor) – MolProcessor object to use for processing.

  • proc_args (list, optional) – Any additional positional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

reload()[source]

Reload the storage from disk.

removeMol(mol_id)[source]

Remove a molecule from the store.

Parameters:

mol_id (str) – ID of the molecule to remove.

removeProperty(name: str)[source]

Remove a property from the storage.

Parameters:

name (str) – Name of the property to remove.

save()[source]

Save the whole storage to disk.

Returns:

Path to the saved storage.

Return type:

(str)

searchOnProperty(prop_name: str, values: list[float | int | str], exact=False, name: str | None = None) PandasChemStore[source]

Search in this table using a property name and a list of values. It is assumed that the property is searchable with string matching or direct comparison if a number is supplied. Note that the types of the query list need to be consistent. Otherwise, a ValueError will be raised.

In the case of string comparison, if ‘exact’ is False, the search will be performed with partial matching, i.e. all molecules that contain any of the given values in the property will be returned. If ‘exact’ is True, only molecules that have the exact property value for any of the given values will be returned.

Parameters:
  • prop_name (str) – Name of the property to search on.

  • values (list[str]) – List of values to search for. If any of the values is found in the property, the molecule will be considered a match.

  • exact (bool, optional) – Whether to use exact matching, i.e. whether to search for exact strings or just substrings. Defaults to False.

  • name (str | None, optional) – Name of the new table. Defaults to the name of the old table, plus the _searched suffix.

Returns:

A new table with the molecules from the old table with the given property values.

Return type:

(MoleculeTable)

Raises:

ValueError – If the types of the query list are not consistent.

searchWithSMARTS(patterns: list[str], operator: Literal['or', 'and'] = 'or', use_chirality: bool = False, name: str | None = None, match_function: MolProcessor | None = None) PandasChemStore[source]

Search the molecules in the table with a SMARTS pattern.

Parameters:
  • patterns – List of SMARTS patterns to search with.

  • operator (object) – Whether to use an “or” or “and” operator on patterns. Defaults to “or”.

  • use_chirality – Whether to use chirality in the search.

  • name – Name of the new table. Defaults to the name of the old table, plus the smarts_searched suffix.

  • match_function – Function to use for matching the molecules to the SMARTS patterns. Defaults to match_mol_to_smarts.

Returns:

A dataframe with the molecules that match the pattern.

Return type:

(MolTable)

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in storage.

property smilesProp: str

Name of the property containing the SMILES.

property standardizer: ChemStandardizer

Standardizer used in this storage.

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.data.storage.tabular.simple.ParallelizedChemStore[source]

Bases: ChemStore, SMARTSSearchable, PropSearchable, Summarizable, Parallelizable, ABC

Base class with default implementations of some parallel processing features for ChemStore instances that want to support it. The mixin basically defines some methods required by the ChunkIterable and MolProcessable interfaces to make implementation of parallel processing for downstream instances of ChemStore easier.

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addMols(smiles: Iterable[str], props: dict[str, list] | None = None, *args, **kwargs) list[StoredMol]

Add a molecule to the store.

This method should not perform any standardization or identifier calculation. The add_mol_from_smiles method should be used instead if automatic standardization and identification should be performed before storage.

Parameters:
  • smiles (Iterable[str]) – molecules to add as SMILES

  • props (dict, optional) – additional metadata to store with the molecules

  • args – Additional positional arguments to be passed to each molecule.

  • kwargs – Additional keyword arguments to be passed to each molecule.

Returns:

instances of the added molecules

Return type:

list[StoredMol]

Raises:

ValueError – if the molecules cannot be added

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol', chunk_processor: ParallelGenerator | None = None, no_parallel: bool = False) Generator[Iterable[Any], None, None][source]

Apply a function to the molecules in the data frame.

Parameters:
  • func (callable) – Function to apply to the molecules.

  • func_args (list, optional) – Additional arguments to pass to the function.

  • func_kwargs (dict, optional) – Additional keyword arguments to pass to the function.

  • on_props (tuple, optional) – Properties to pass to the function. If None, all properties will be passed.

  • chunk_type (str, optional) – Type of molecule to send to the function. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

  • no_parallel (bool, optional) – Whether to use parallel processing. Defaults to False.

Returns:

A generator that yields the results of the supplied function on the chunked molecules from the data set.

Return type:

(Generator)

abstract applyIdentifier(identifier: ChemIdentifier)

Apply an identifier to the SMILES in this instance (i.e. remove duplicates).

Parameters:

identifier (ChemIdentifier) – The identifier to apply.

abstract applyStandardizer(standardizer: ChemStandardizer)

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

abstract property chunkProcessor: ParallelGenerator

Parallel generator to use for processing.

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getMol(mol_id: str) StoredMol

Get a molecule from the store using its ID.

Parameters:

mol_id (str) – identifier of the molecule to search

Returns:

instance of the molecule

Return type:

StoredMol

abstract getMolCount()

Get the number of molecules in the store.

Returns:

(int) number of molecules

abstract getMolIDs() tuple[str, ...]

Get all molecule IDs in the store.

Returns:

molecule IDs

Return type:

tuple[str]

abstract getProperties() list[str]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract getSummary() DataFrame

Make a summary with some statistics about this object or action.

Returns:

A dataframe with the summary statistics.

Return type:

(pd.DataFrame)

abstract hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract property identifier: ChemIdentifier

Get the identifier used by this instance.

Returns:

The identifier used by this instance.

Return type:

ChemIdentifier

abstract iterChunks(size: int | None = None, on_props: list | None = None, chunk_type: Literal['mol', 'smiles', 'rdkit', 'df'] = 'mol') Generator[list[StoredMol | str | Mol | DataFrame], None, None]

Iterate over chunks of molecules across the store.

Parameters:
  • size (int, optional) – The size of the chunks.

  • on_props (list, optional) – The properties to include in the chunks.

  • chunk_type (str, optional) – The type of chunks to yield.

Returns:

an iterable of lists of stored molecules

abstract iterMols() Generator[StoredMol, None, None]

Iterate over all molecules in the store.

Returns:

iterator over StoredMol instances

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract property nJobs: int

Get the number of jobs to run in parallel.

property nMols: int

Number of molecules in storage.

abstract property name: str

Get the name of the storage.

processMols(processor: MolProcessor, proc_args: Iterable[Any] | None = None, proc_kwargs: dict[str, Any] | None = None, mol_type: Literal['smiles', 'mol', 'rdkit'] = 'mol', add_props: Iterable[str] | None = None, chunk_processor: ParallelGenerator | None = None) Generator[source]

Apply a function to the molecules in the data frame. The SMILES or an RDKit molecule will be supplied as the first positional argument to the function. Additional properties to provide from the data set can be specified with ‘add_props’, which will be a dictionary supplied as an additional positional argument to the function.

IMPORTANT: For successful parallel processing with multiprocessing, the processor must be picklable. Also note that the returned generator may only produce results as soon as they are ready, which means that the chunks of data may not be in the same order as the original data frame. However, you can pass the value of idProp in add_props to identify the processed molecules or use MolProcessorWithID as the processor.

Parameters:
  • processor (MolProcessor) – MolProcessor object to use for processing.

  • proc_args (list, optional) – Any additional positional arguments to pass to the processor.

  • proc_kwargs (dict, optional) – Any additional keyword arguments to pass to the processor.

  • mol_type (str, optional) – Type of molecule to send to the processor. Can be ‘smiles’, ‘mol’, or ‘rdkit’. Defaults to ‘mol’, which implies TabularMol objects.

  • add_props (list, optional) – List of data set properties to send to the processor. If None, all properties will be sent.

  • chunk_processor (ParallelGenerator, optional) – The parallel generator to use for processing. If not specified, self.chunkProcessor is used.

Returns:

A generator that yields the results of the supplied processor on the chunked molecules from the data set.

Return type:

Generator

abstract reload()

Reset the current state by reloading from storage.

abstract removeMol(mol_id: str)

Remove a molecule from the store.

Parameters:

mol_id (str) – identifier of the molecule to remove

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract searchWithSMARTS(patterns: list[str]) SMARTSSearchable

Search the molecules within this instance with SMARTS patterns.

Parameters:

patterns – List of SMARTS patterns to search with.

Returns:

Another instance that can be filtered further.

Return type:

(SMARTSSearchable)

property smiles: Generator[str, None, None]

Generator of SMILES strings of all molecules in storage.

abstract property smilesProp: str

Get the name of the property that contains the SMILES strings.

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.data.storage.tabular.stored_mol module

class qsprpred.data.storage.tabular.stored_mol.TabularMol(mol_id: str, origin: str, smiles: str, parent: TabularMol | None = None, rd_mol: Mol | None = None, props: dict[str, Any] | None = None, representations: tuple[TabularMol, ...] | None = None)[source]

Bases: StoredMol

Simple implementation of a molecule that is stored in a tabular storage.

Create a new molecule instance.

Parameters:
  • mol_id (str) – identifier of the molecule

  • smiles (str) – SMILES of the molecule

  • parent (TabularMol, optional) – parent molecule

  • rd_mol (Chem.Mol, optional) – rdkit molecule object

  • props (dict, optional) – properties of the molecule

  • representations (tuple, optional) – representations of the molecule

as_rd_mol() Mol[source]

Get the rdkit molecule object.

Returns:

(Chem.Mol) rdkit molecule object

property id: str

Get the identifier of the molecule.

property origin: str

Get the name of the storage where the molecule resides.

Returns:

The name of the storage.

Return type:

str

property parent: TabularMol

Get the parent molecule.

property props: dict[str, Any] | None

Get the row of the dataframe corresponding to this molecule.

property representations: list[TabularMol] | None

Get the representations of the molecule.

property smiles: str

Get the SMILES of the molecule.

Module contents