qsprpred.extra.data.storage.protein package

Subpackages

Submodules

qsprpred.extra.data.storage.protein.tabular_pcm module

class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProtein(protein_id: str, sequence: str | None = None, parent: TabularProtein | None = None, props: dict[str, Any] | None = None, representations: Iterable[TabularProtein] | None = None)[source]

Bases: StoredProtein

A protein object that is stored in a tabular format.

Variables:
  • id (str) – id of the protein

  • sequence (str) – sequence of the protein

  • props (dict[str, Any]) – properties of the protein

  • representations (Iterable[TabularProtein]) – representations of the protein

Create a new protein instance.

Parameters:
  • protein_id (str) – identifier of the protein

  • sequence (str) – sequence of the protein

  • parent (TabularProtein) – parent protein

  • props (dict[str, Any]) – properties of the protein

  • representations (Iterable[TabularProtein]) – representations of the protein

as_fasta() str | None[source]

Return the protein as a FASTA file.

as_pdb() str | None[source]

Return the protein as a PDB file.

as_rd_mol() Mol | None

Return the protein as an RDKit molecule.

property id: str

Get the id of the protein.

property parent: TabularProtein

Get the parent protein.

property props: dict[str, Any] | None

Get the properties of the protein.

property representations: Iterable[TabularProtein]

Get all representations of the protein.

property sequence: str | None

Get the sequence of the protein.

class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProteinStorage(name: str, df: DataFrame | None = None, sequence_col: str = 'Sequence', sequence_provider: Callable | None = None, store_dir: str = '.', overwrite: bool = False, index_cols: list[str] | None = None, n_jobs: int = 1, chunk_size: int | None = None, protein_col: str = 'accession', random_state: int | None = None, store_format: str = 'pkl', parallel_generator: ParallelGenerator | None = None)[source]

Bases: ProteinStorage, PandasDataTable

A storage class for proteins stored in a tabular format.

Variables:
  • sequenceCol (str) – name of the column that contains all protein sequences

  • proteinSeqProvider (Callable) – function that provides protein

  • sequenceProp (str) – name of the property that contains all protein sequences

  • proteins (Iterable[TabularProtein]) – all proteins in the store

Create a new protein storage instance.

Parameters:
  • name (str) – name of the storage

  • df (pd.DataFrame) – data frame containing the proteins

  • sequence_col (str) – name of the column that contains all protein sequences

  • sequence_provider (Callable) – function that provides protein

  • store_dir (str) – directory to store the data

  • overwrite (bool) – overwrite the existing data

  • index_cols (list[str]) – columns to use as index

  • n_jobs (int) – number of parallel jobs

  • chunk_size (int) – size of the chunks

  • protein_col (str) – name of the column that contains the protein ids

  • random_state (int) – random state

  • store_format (str) – format to store the data

  • parallel_generator (ParallelGenerator) – parallel generator

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the data set.

Parameters:
  • ids (list[str]) – IDs of entries to add.

  • props (dict[str, list]) – Dictionary of properties to add.

  • raise_on_existing (bool) – If True, raise an error if any of the new entries are duplicates.

addProperty(name: str, data: list, ids: list[str] | None = None, ignore_missing: bool = False)

Add a property to the data frame.

Parameters:
  • name (str) – Name of the property.

  • data (list) – list of property values.

  • ids – IDs of entries to get properties for.

  • ignore_missing (bool) – If True, missing IDs are ignored.

add_protein(protein: TabularProtein, raise_on_existing=True)[source]

Add a protein to the store.

Parameters:
  • protein (TabularProtein) – protein sequence

  • raise_on_existing (bool) – raise an exception if the protein already exists in the store

apply(func: Callable[[dict[str, list[Any]] | DataFrame, ...], Any], func_args: tuple[Any, ...] | None = None, func_kwargs: dict[str, Any] | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False, chunk_size: int | None = None, n_jobs: int | None = None) Generator

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:
  • func (Callable) – Function to apply to the data frame.

  • func_args (list) – Positional arguments to pass to the function.

  • func_kwargs (dict) – Keyword arguments to pass to the function.

  • on_props (list[str]) – list of properties to send to the function as arguments

  • as_df (bool) – If True, the function is applied to chunks represented as data frames.

  • chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.

  • n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

property baseDir: str

The base directory of the data set folder.

property chunkSize: int

Size of chunks to use per job in parallel processing.

clear(files_only: bool = True)

Remove all files associated with this data set from disk.

dropEmptyProperties(names: list[str])

Drop rows with empty target property value from the data set.

Parameters:

names (list[str]) – list of property names to check for empty values.

dropEntries(ids: Iterable[str], ignore_missing: bool = False)

Drop entries from the data set by their IDs.

Parameters:
  • ids (Iterable[str]) – IDs of entries to drop.

  • ignore_missing (bool) – If True, missing IDs are ignored.

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:
  • name (str | None) – name of the resulting index column.

  • prefix (str | None) – prefix to use for the index column values.

getDF()

Get the data frame this instance manages.

Returns:

The data frame this instance manages.

Return type:

pd.DataFrame

getPCMInfo() tuple[dict[str, str], dict][source]

Return a dictionary of protein sequences for the proteins in the data frame and the additional metadata separately.

Returns:

Dictionary of protein sequences.

Return type:

sequences (dict)

getProperties() list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:

list of property names.

Return type:

(list[str])

getProperty(name: str, ids: tuple[str] | None = None, ignore_missing: bool = False) Series

Get property values from the data set.

Parameters:
  • name (str) – Name of the property to get.

  • ids – IDs of entries to get properties for.

  • ignore_missing (bool) – If True, missing IDs are ignored.

Returns:

List of values for the property.

Return type:

(pd.Series)

getProtein(protein_id: str) TabularProtein[source]

Get a protein from the store using its name.

Parameters:

protein_id (str) – name of the protein to search

Returns:

instance of Protein

Return type:

TabularProtein

Raises:

ValueError – if the protein is not found

getSubset(properties: list[str], ids: list[str] | None = None, name: str | None = None, path: str | None = None, ignore_missing: bool = False) PandasDataTable

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:
  • properties (list[str]) – list of property names to get.

  • ids – IDs of entries to get subset of properties for.

  • name (str) – Name of the new data set.

  • path (str) – Path to save the new data set.

  • ignore_missing (bool) – If True, missing IDs are ignored.

Returns:

A new data set containing the subset of the properties

Return type:

(PandasDataTable)

hasProperty(name: str) bool

Check whether a property is present in the data frame.

Parameters:

name (str) – Name of the property.

Returns:

Whether the property is present.

Return type:

bool

property idProp: str

Column name to use for automatically generated IDs.

iterChunks(size: int | None = None, on_props: tuple[str] | None = None, as_dict: bool = False) Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:
  • size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.

  • on_props (list[str]) – list of properties to include, if None, all properties are included.

  • as_dict (bool) – If True, the generator yields dictionaries instead of data frames.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

property metaFile

The path to the meta file of this data set.

property nJobs

Number of jobs to use for parallel processing.

property name: str

Name of the data set.

property proteins: list[TabularProtein]

Get all proteins in the store.

Returns:

list of proteins

Return type:

list[TabularProtein]

property randomState: int

Random state to use for all random operations for reproducibility.

reload()

Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:

name (str) – Name of the property to delete.

save() str

Save the data frame to disk and all associated files.

Returns:

Path to the saved data frame.

Return type:

(str)

searchOnProperty(prop_name: str, values: list[str], exact: bool = False) PandasDataTable

Search the molecules within this MoleculeDataSet on a property value and return the appropriate subset.

Parameters:
  • prop_name (str) – Name of the column to search on.

  • values (list[str]) – Values to search for.

  • exact (bool) – Whether to search for exact matches or not.

Returns:

A data set with the molecules that match the search.

Return type:

(PandasDataTable)

property sequenceProp: str

Get the name of the property that contains all protein sequences.

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:

cols (list[str]) – list of columns to use as index.

shuffle(random_state: int | None = None)

Shuffle the internal data frame.

Parameters:

random_state (int | None) – Random state to use for shuffling. If None, the random state of the data set is used.

property storeDir

The data set folder containing the data set files after saving.

property storePath

The path to the main data set file.

property storePrefix

The prefix of the data set files.

toFile(filename: str) str

Save the metafile and all associated files to a custom location.

Parameters:

filename (str) – absolute path to the saved metafile.

Returns:

Path to the saved data frame.

Return type:

(str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

transformProperties(names: list[str], transformer: Callable)

Transform property values using a transformer function.

Parameters:
  • names (list[str]) – list of column names to transform.

  • transformer (Callable) – Function that transforms the data in target columns to a new representation.

Module contents