qsprpred.data.chem.standardizers package

Submodules

qsprpred.data.chem.standardizers.base module

exception qsprpred.data.chem.standardizers.base.ChemStandardizationException[source]

Bases: Exception

Exception raised when standardization fails.

add_note(note, /)

Add a note to the exception

args
with_traceback(tb, /)

Set self.__traceback__ to tb and return self.

class qsprpred.data.chem.standardizers.base.ChemStandardizer[source]

Bases: ABC

Standardizer to convert SMILES to a standardized form.

This class defines an interface of a uniquely identifiable standardizer. The getID method should return a unique identifier for the standardizer based on its settings. Standardizes that have the same ID should produce the same standardized form for a given SMILES.

The main method of the class is convertSMILES, which should convert a given SMILES to a standardized form based on the settings of the standardizer.

abstract convertSMILES(smiles: str) str | None[source]

Convert the SMILES to a standardized form.

Parameters:

smiles (str) – SMILES to be converted

Returns:

The standardized SMILES string or None if standardization fails or the molecule is deemed invalid.

Return type:

str | None

Raises:

ChemStandardizationException – if standardization fails, but the upstream code should be notified and handle the exception.

abstract classmethod fromSettings(settings: dict) ChemStandardizer[source]

Create a new standardizer from a settings dictionary.

classmethod fromSettingsFile(path: str) ChemStandardizer[source]

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str[source]

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

abstract getID() str[source]

Return the unique identifier of the standardizer. This method should return a unique identifier based on the settings of the standardizer.

Two standardizers with the same settings should have the same ID and produce the same standardized form for a given SMILES.

Returns:

The unique identifier of the standardizer.

Return type:

str

abstract property settings: dict

Settings of the standardizer. It should contain complete information needed to initialize another equivalent standardizer.

class qsprpred.data.chem.standardizers.base.Standardizable[source]

Bases: ABC

Interface for objects that use chemical standardization with ` ChemStandardizer objects.

abstract applyStandardizer(standardizer: ChemStandardizer)[source]

Apply a standardizer to the SMILES in the store.

Parameters:

standardizer (ChemStandardizer) – The standardizer to apply

abstract property standardizer: ChemStandardizer

Get the standardizer used by the store.

Returns:

The standardizer used by the store.

Return type:

ChemStandardizer

qsprpred.data.chem.standardizers.check_smiles module

class qsprpred.data.chem.standardizers.check_smiles.CheckSmilesValid(id_prop: str | None = 'ID')[source]

Bases: MolProcessorWithID

Processor to check the validity of the SMILES.

Initialize the processor with the name of the property that contains the molecule’s unique identifier.

Parameters:

id_prop (str) – Name of the property that contains the molecule’s unique identifier. Defaults to “QSPRID”.

iterMolsAndIDs(mols, props: dict[str, list] | None)

Iterate over molecules and their corresponding IDs regardless of the input molecule format. This is just a helper function that will detect the input and yield the molecule and its ID.

Parameters:
  • mols (list[str | Mol | StoredMol]) – A list of SMILES or RDKit molecules to process.

  • props (dict) – An optional dictionary of properties related to the molecules to process.

Returns:

A tuple of the molecules and their IDs.

Return type:

tuple[Mol, str]

property requiredProps: list[str]

The properties required by the processor. This is to inform the caller that the processor requires certain properties to be passed to the __call__ method or via the props attribute of StoredMol instances.

property supportsParallel: bool

Return True if the processor supports parallel processing.

class qsprpred.data.chem.standardizers.check_smiles.ValidationStandardizer[source]

Bases: ChemStandardizer

Standardizer that checks the validity of the SMILES by attempting to sanitize the molecule using RDKit.

Variables:

checker (CheckSmilesValid) – Processor to check the validity of the SMILES

Initialize the standardizer.

Raises:

ValueError – If the SMILES is invalid

convertSMILES(smiles: str) str | None[source]

Check the validity of the SMILES.

Parameters:

smiles (str) – SMILES to be checked

Returns:

the standardized SMILES

Return type:

str | None

classmethod fromSettings(settings: dict) ValidationStandardizer[source]

Create a standardizer from settings. In this case, the settings are ignored.

Parameters:

settings (dict) – Settings of the standardizer

Returns:

The standardizer created from settings

Return type:

ValidationStandardizer

classmethod fromSettingsFile(path: str) ChemStandardizer

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

getID()[source]

Return the unique identifier of the standardizer. In this case, it is just “ValidationStandardizer”. There are no settings to consider.

property settings

Settings of the standardizer. Empty in this case since there is nothing to set except the default settings.

qsprpred.data.chem.standardizers.chembl module

class qsprpred.data.chem.standardizers.chembl.ChemblStandardizer(isomeric_smiles: bool = True, sanitize: bool = True)[source]

Bases: ChemStandardizer

Standardizer using the ChEMBL standardizer.

Variables:
  • isomericSmiles (bool) – return the isomeric smiles.

  • sanitize (bool) – sanitize SMILES before standardization.

Initialize the ChEMBL standardizer.

Parameters:
  • isomeric_smiles (bool) – return the isomeric smiles. Defaults to True.

  • sanitize (bool) – sanitize SMILES before standardization. Defaults to True.

convertSMILES(smiles: str) str | None[source]

Standardize SMILES using the ChEMBL standardizer.

Parameters:

smiles (str) – SMILES to be standardized

Returns:

standardized SMILES string or None if standardization failed.

Return type:

(str)

classmethod fromSettings(settings: dict) ChemblStandardizer[source]

Create a standardizer from settings.

Parameters:

settings (dict) – Settings of the standardizer

Returns:

The standardizer created from settings

Return type:

(ChemblStandardizer)

classmethod fromSettingsFile(path: str) ChemStandardizer

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

getID() str[source]

Return the unique identifier of the standardizer.

In this case, the identifier starts with “ChEMBLStandardizer” followed by the settings of the standardizer concatenated with “~”.

Returns:

unique identifier of the standardizer

Return type:

(str)

property settings: dict

Settings of the standardizer. It should contain complete information needed to initialize another equivalent standardizer.

qsprpred.data.chem.standardizers.chembl.chembl_smi_standardizer(smi: str, isomeric_smiles: bool = True, sanitize: bool = True) str | None[source]

Standardize SMILES using ChEMBL standardizer.

Parameters:
  • smi (str) – SMILES string to be standardized.

  • isomeric_smiles (bool) – return the isomeric smiles. Defaults to True.

  • sanitize (bool) – applies sanitization using the ChEMBL standardizer. Defaults to True.

Returns:

standardized SMILES string or None if standardization failed.

Return type:

(str)

qsprpred.data.chem.standardizers.naive module

class qsprpred.data.chem.standardizers.naive.NaiveStandardizer[source]

Bases: ChemStandardizer

Naive standardizer

Briefly, the standardization process involves disconnecting metals, normalizing, removing salts (largest fragment) and charges. See qsprpred.data.chem.standardizers.naive.standardize_mol for more details.

convertSMILES(smiles: str) str | None[source]

Standardize SMILES using standardize_mol.

Parameters:

smiles (str) – SMILES to be standardized

Returns:

standardized SMILES or None if SMILES could not be standardized

Return type:

str | None

classmethod fromSettings(settings: dict) NaiveStandardizer[source]

Create a naive standardizer from settings. In this case, the settings are ignored.

Parameters:

settings (dict) – settings of the standardizer

Returns:

a naive standardizer

Return type:

NaiveStandardizer

classmethod fromSettingsFile(path: str) ChemStandardizer

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

getID() str[source]

Return the unique identifier of the standardizer, which in this case is “NaiveStandardizer” without any settings.

property settings: dict

Settings of the standardizer. They are empty in this case.

qsprpred.data.chem.standardizers.naive.standardize_mol(mol) str | None[source]

Standardizes SMILES and removes fragments

Standardizes SMILES using RDKit MolStandardize to disconnect metals, normalize, remove salts (largest fragment), and uncharge. Followed by a second round of disconnecting metals and normalizing. Finally, the SMILES is canonicalized.

Parameters:

mol (rdkit.Chem.rdchem.Mol) – RDKit molecule object

Returns:

Standardized SMILES or None if SMILES could not be standardized or if SMILES does not contain carbon or contains salts after standardization

Return type:

(str | None)

qsprpred.data.chem.standardizers.papyrus module

class qsprpred.data.chem.standardizers.papyrus.PapyrusStandardizer(keep_stereo: bool = True, canonize: bool = True, mixture_handling: Literal['keep_largest', 'filter', 'keep'] = 'keep_largest', remove_additional_salts: bool = True, remove_additional_metals: bool = True, filter_inorganic: bool = False, filter_non_small_molecule: bool = True, small_molecule_min_mw: float = 200, small_molecule_max_mw: float = 800, canonicalize_tautomer: bool = True, tautomer_max_tautomers: int = 4294967295, extra_organic_atoms: list | None = None, extra_metals: list | None = None, extra_salts: list | None = None, uncharge: bool = True)[source]

Bases: ChemStandardizer

Papyrus standardizer

Uses Papyrus (>v05.6) standardization protecol to standardize SMILES.

Béquignon, O.J.M., Bongers, B.J., Jespers, W. et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. J Cheminform 15, 3 (2023). https://doi.org/10.1186/s13321-022-00672-x

Variables:

settings (dict) – Settings of the standardizer

Initialize Papyrus standardizer

Parameters:
  • keep_stereo (bool, optional) – Keep stereochemistry.

  • canonize (bool, optional) – Canonicalize SMILES.

  • mixture_handling (Literal["keep_largest", "filter", "keep"], optional) – How to handle mixtures. Defaults to “keep_largest”.

  • remove_additional_salts (bool, optional) – Removes a custom set of fragments if present in the molecule object.

  • remove_additional_metals (bool, optional) – Removes metal fragments if present in the molecule object. Ignored if remove_additional_salts is set to False.

  • filter_inorganic (bool, optional) – Filter inorganic molecules.

  • filter_non_small_molecule (bool, optional) – Filter non-small molecules.

  • small_molecule_min_mw (float, optional) – Minimum molecular weight of small molecules.

  • small_molecule_max_mw (float, optional) – Maximum molecular weight of small molecules.

  • canonicalize_tautomer (bool, optional) – Canonicalize tautomers.

  • tautomer_max_tautomers (int, optional) – Maximum number of tautomers to consider by the tautomer search algorithm (<2^32).

  • extra_organic_atoms (list, optional) – Extra organic atoms to consider in addition to the default set (Papyrus_standardizer.ORGANIC_ATOMS).

  • extra_metals (list, optional) – Extra metals to consider in addition to the default set (Papyrus_standardizer.METALS).

  • extra_salts (list, optional) – Extra salts to consider in addition to the default set (Papyrus_standardizer.SALTS).

  • uncharge (bool, optional) – Uncharge molecules.

convertSMILES(smiles: str, verbose: bool = False) str | None[source]

Standardize SMILES using Papyrus standardization protocol.

Parameters:
  • smiles (str) – SMILES to be standardized

  • verbose (bool, optional) – Print verbose output. Defaults to False.

Returns:

a tuple where the first element is the standardized SMILES and the second element is the original SMILES

Return type:

tuple[str | None, str]

fromSettings(settings: dict) PapyrusStandardizer[source]

Create a Papyrus standardizer from settings.

Parameters:

settings (dict) – settings of the standardizer

Returns:

a Papyrus standardizer

Return type:

PapyrusStandardizer

classmethod fromSettingsFile(path: str) ChemStandardizer

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

getID() str[source]

Get the ID of the standardizer.

In this case, the ID is based on the settings of the standardizer. It starts with ‘PapyrusStandardizer’ followed by a tilde and the settings concatenated with a colon.

Returns:

ID of the standardizer

Return type:

str

property settings: dict

Settings of the standardizer. It should contain complete information needed to initialize another equivalent standardizer.

Module contents

class qsprpred.data.chem.standardizers.ChemStandardizer[source]

Bases: ABC

Standardizer to convert SMILES to a standardized form.

This class defines an interface of a uniquely identifiable standardizer. The getID method should return a unique identifier for the standardizer based on its settings. Standardizes that have the same ID should produce the same standardized form for a given SMILES.

The main method of the class is convertSMILES, which should convert a given SMILES to a standardized form based on the settings of the standardizer.

abstract convertSMILES(smiles: str) str | None[source]

Convert the SMILES to a standardized form.

Parameters:

smiles (str) – SMILES to be converted

Returns:

The standardized SMILES string or None if standardization fails or the molecule is deemed invalid.

Return type:

str | None

Raises:

ChemStandardizationException – if standardization fails, but the upstream code should be notified and handle the exception.

abstract classmethod fromSettings(settings: dict) ChemStandardizer[source]

Create a new standardizer from a settings dictionary.

classmethod fromSettingsFile(path: str) ChemStandardizer[source]

Load the standardizer from a settings file in JSON format.

Parameters:

path (str) – Path to the settings file.

Returns:

The standardizer loaded from the settings file.

Return type:

ChemStandardizer

getHashID() str[source]

Get the hash ID of the standardizer. This is simply the MD5 hash of the unique identifier of the standardizer.

Returns:

The hash ID of the standardizer

Return type:

str

abstract getID() str[source]

Return the unique identifier of the standardizer. This method should return a unique identifier based on the settings of the standardizer.

Two standardizers with the same settings should have the same ID and produce the same standardized form for a given SMILES.

Returns:

The unique identifier of the standardizer.

Return type:

str

abstract property settings: dict

Settings of the standardizer. It should contain complete information needed to initialize another equivalent standardizer.