qsprpred.extra.data.storage.protein.interfaces package

Submodules

qsprpred.extra.data.storage.protein.interfaces.protein_storage module

class qsprpred.extra.data.storage.protein.interfaces.protein_storage.ProteinStorage[source]

Bases: PropertyStorage, ABC

Storage for proteins.

Variables:
  • sequenceProp (str) – name of the property that contains all protein sequences

  • proteins (Iterable[StoredProtein]) – all proteins in the store

abstract addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the storage.

Parameters:
  • ids (list) – The IDs of the entries to add.

  • props (dict) – The properties to add.

  • raise_on_existing (bool) – Overwrite existing entries. If True, an exception is raised if an entry already exists.

Raises:

ValueError – If an entry already exists and overwrite is False.

abstract addProperty(name: str, data: Sized, ids: list[str] | None = None)

Add a property to the dataset. The supplied data should be a sized list of values of the same length as the number of entries in the storage.

Parameters:
  • name (str) – The name of the property.

  • data (list) – The data of the property.

  • ids (list, optional) – The IDs of the entries to add the property for.

abstract add_protein(protein: StoredProtein, raise_on_existing=True) StoredProtein[source]

Add a protein to the store.

Parameters:
  • protein (StoredProtein) – protein sequence

  • raise_on_existing (bool) – raise an exception if the protein already exists in the store

Returns:

instance of the added protein

Return type:

StoredProtein

abstract apply(func: callable, func_args: list | None = None, func_kwargs: dict | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False) Generator[Iterable[Any], None, None]

Apply a function on all or selected properties of the chunks of data. The properties are supplied as the first positional argument to the function. The format of the properties is up to the downstream implementation, but it should always be a single object supplied as the first parameter.

Parameters:
  • func (callable) – The function to apply.

  • func_args (list, optional) – The positional arguments of the function.

  • func_kwargs (dict, optional) – The keyword arguments of the function.

  • on_props (list, optional) – The properties to apply the function on.

  • as_df (bool, optional) – Provide properties as a DataFrame to the function.

Returns:

A generator that yields the results of the function applied to each chunk.

abstract property chunkSize: int

The size of the chunks to iterate over.

abstract clear()

Delete entries in the persistent storage.

abstract dropEntries(ids: Iterable[str])

Drop entries from the storage.

Parameters:

ids (list) – The IDs of the entries to drop.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

abstract getDF() DataFrame

Get the stored properties as a pandas DataFrame.

Returns:

The data as a pandas DataFrame.

Return type:

pd.DataFrame

abstract getPCMInfo() tuple[dict[str, str], dict][source]

Return a dictionary mapping of protein ids to their sequences and a dictionary with metadata for each. This is mainly for compatibility with QSPRpred’s PCM modelling API.

Returns:

Dictionary of protein sequences. metadata (dict): Dictionary of metadata for each protein.

Return type:

sequences (dict)

abstract getProperties() list[str]

Get the property names contained in the storage.

abstract getProperty(name: str, ids: tuple[str] | None = None) Iterable[Any]

Get values of a given property.

Parameters:
  • name (str) – The name of the property.

  • ids (list, optional) – The IDs of the entries to get the property for.

abstract getProtein(protein_id: str) StoredProtein[source]

Get a protein from the store using its name.

Parameters:

protein_id (str) – name of the protein to search

Returns:

instance of Protein

Return type:

StoredProtein

abstract getSubset(subset: Iterable[str], ids: Iterable[str] | None = None) PropertyStorage

Get a subset of the storage for the given properties.

Parameters:
  • subset (list) – The list of property names to include in the subset.

  • ids (list, optional) – The IDs of the entries to include in the subset.

  • name (str, optional) – The name of the new storage.

Returns:

The subset of the storage.

Return type:

PropertyStorage

abstract hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

abstract property idProp: str

Get the name of the property that contains the molecule IDs.

abstract iterChunks(size: int | None = None, on_props: list | None = None) Generator[list[Any], None, None]

Iterate over chunks of molecules across the store.

Returns:

an iterable of lists of stored molecules

abstract property metaFile: str

Get the absolute path to the metadata file that describes how the persisted data can be accessed. This can be used to load the object back from storage using the fromFile class method.

Returns:

The absolute path to the metadata file.

Return type:

str

abstract property name: str

Get the name of the storage.

abstract property proteins: Iterable[StoredProtein]

Get all proteins in the store.

Returns:

iterable of Protein instances

Return type:

Iterable[StoredProtein]

abstract reload()

Reset the current state by reloading from storage.

abstract removeProperty(name: str)

Remove a property from the dataset.

Parameters:

name (str) – The name of the property.

abstract save() str

Save current state to storage and return the path to the serialized file.

Returns:

The path to the serialized file.

Return type:

str

abstract searchOnProperty(prop_name: str, values: list[float | int | str], exact=False) PropSearchable

Search the molecules within this MoleculeDataSet on a property value.

Parameters:
  • prop_name – Name of the column to search on.

  • values – Values to search for.

  • exact – Whether to search for exact matches or not.

Returns:

Another instance that can be filtered further.

Return type:

(PropSearchable)

abstract property sequenceProp: str

Get the name of the property that contains all protein sequences.

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

qsprpred.extra.data.storage.protein.interfaces.storedprotein module

class qsprpred.extra.data.storage.protein.interfaces.storedprotein.StoredProtein[source]

Bases: ABC

A protein object.

Variables:
  • id (str) – id of the protein

  • sequence (str) – sequence of the protein

  • props (dict[str, Any]) – properties of the protein

  • representations (Iterable[StoredProtein]) – representations of the protein

abstract as_fasta() str | None[source]

Return the protein as a FASTA file.

abstract as_pdb() str | None[source]

Return the protein as a PDB file.

as_rd_mol() Mol | None[source]

Return the protein as an RDKit molecule.

abstract property id: str

Get the id of the protein.

abstract property parent: StoredProtein

Get the parent protein.

abstract property props: dict[str, Any] | None

Get the properties of the protein.

abstract property representations: Iterable[StoredProtein]

Get all representations of the protein.

abstract property sequence: str | None

Get the sequence of the protein.

Module contents