qsprpred.extra.data.storage.protein package

Subpackages

qsprpred.extra.data.storage.protein.interfaces package

Submodules

qsprpred.extra.data.storage.protein.tabular_pcm module

class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProtein(protein_id: str, sequence: str | None = None, parent: TabularProtein | None = None, props: dict[str, Any] | None = None, representations: Iterable[TabularProtein] | None = None)[source]

Bases: StoredProtein

A protein object that is stored in a tabular format.

Variables:

id (str) – id of the protein
sequence (str) – sequence of the protein
props (dict[str, Any]) – properties of the protein
representations (Iterable[TabularProtein]) – representations of the protein

Create a new protein instance.

Parameters:

protein_id (str) – identifier of the protein
sequence (str) – sequence of the protein
parent (TabularProtein) – parent protein
props (dict[str, Any]) – properties of the protein
representations (Iterable[TabularProtein]) – representations of the protein

as_fasta() → str | None[source]: Return the protein as a FASTA file.

as_pdb() → str | None[source]: Return the protein as a PDB file.

as_rd_mol() → Mol | None: Return the protein as an RDKit molecule.

property id: str: Get the id of the protein.

property parent: TabularProtein: Get the parent protein.

property props: dict[str, Any] | None: Get the properties of the protein.

property representations: Iterable[TabularProtein]: Get all representations of the protein.

property sequence: str | None: Get the sequence of the protein.

class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProteinStorage(name: str, df: DataFrame | None = None, sequence_col: str = 'Sequence', sequence_provider: Callable | None = None, store_dir: str = '.', overwrite: bool = False, index_cols: list[str] | None = None, n_jobs: int = 1, chunk_size: int | None = None, protein_col: str = 'accession', random_state: int | None = None, store_format: str = 'pkl', parallel_generator: ParallelGenerator | None = None)[source]

Bases: ProteinStorage, PandasDataTable

A storage class for proteins stored in a tabular format.

Variables:

sequenceCol (str) – name of the column that contains all protein sequences
proteinSeqProvider (Callable) – function that provides protein
sequenceProp (str) – name of the property that contains all protein sequences
proteins (Iterable[TabularProtein]) – all proteins in the store

Create a new protein storage instance.

Parameters:

name (str) – name of the storage
df (pd.DataFrame) – data frame containing the proteins
sequence_col (str) – name of the column that contains all protein sequences
sequence_provider (Callable) – function that provides protein
store_dir (str) – directory to store the data
overwrite (bool) – overwrite the existing data
index_cols (list[str]) – columns to use as index
n_jobs (int) – number of parallel jobs
chunk_size (int) – size of the chunks
protein_col (str) – name of the column that contains the protein ids
random_state (int) – random state
store_format (str) – format to store the data
parallel_generator (ParallelGenerator) – parallel generator

addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)

Add entries to the data set.

Parameters:

ids (list[str]) – IDs of entries to add.
props (dict[str, list]) – Dictionary of properties to add.
raise_on_existing (bool) – If True, raise an error if any of the new entries are duplicates.

addProperty(name: str, data: list, ids: list[str] | None = None, ignore_missing: bool = False)

Add a property to the data frame.

Parameters:

name (str) – Name of the property.
data (list) – list of property values.
ids – IDs of entries to get properties for.
ignore_missing (bool) – If True, missing IDs are ignored.

add_protein(protein: TabularProtein, raise_on_existing=True)[source]

Add a protein to the store.

Parameters:

protein (TabularProtein) – protein sequence
raise_on_existing (bool) – raise an exception if the protein already exists in the store

Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form {'prop1': [...], 'prop2': [...], ...}. If as_df is True, the properties will be passed as a data frame instead.

Any additional arguments specified in func_args and func_kwargs will be passed to the function after the properties as positional and keyword arguments, respectively.

If on_props is specified, only the properties in this list will be passed to the function. If on_props is None, all properties will be passed to the function.

Parameters:

func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If True, the function is applied to chunks represented as data frames.
chunk_size (int) – Size of chunks to use per job in parallel processing. If None, the chunk size will be set to self.chunkSize. The chunk size will always be set to the number of rows in the data frame if n_jobs or `self.nJobs is 1.
n_jobs (int) – Number of jobs to use for parallel processing. If None, self.nJobs is used.

Returns:

Generator that yields the results of the function applied to each chunk of the data frame as determined by chunk_size and n_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.

Return type:

Generator

property baseDir: str: The base directory of the data set folder.

property chunkSize: int: Size of chunks to use per job in parallel processing.

clear(files_only: bool = True): Remove all files associated with this data set from disk.

dropEmptyProperties(names: list[str])

Drop rows with empty target property value from the data set.

Parameters:: names (list[str]) – list of property names to check for empty values.

dropEntries(ids: Iterable[str], ignore_missing: bool = False)

Drop entries from the data set by their IDs.

Parameters:

ids (Iterable[str]) – IDs of entries to drop.
ignore_missing (bool) – If True, missing IDs are ignored.

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

generateIndex(name: str | None = None, prefix: str | None = None)

Generate a custom index for the data frame automatically.

Parameters:

name (str | None) – name of the resulting index column.
prefix (str | None) – prefix to use for the index column values.

getDF()

Get the data frame this instance manages.

Returns:: The data frame this instance manages.
Return type:: pd.DataFrame

getPCMInfo() → tuple[dict[str, str], dict][source]

Return a dictionary of protein sequences for the proteins in the data frame and the additional metadata separately.

Returns:: Dictionary of protein sequences.
Return type:: sequences (dict)

getProperties() → list[str]

Get names of all properties/variables saved in the data frame (all columns).

Returns:: list of property names.
Return type:: (list[str])

getProperty(name: str, ids: tuple[str] | None = None, ignore_missing: bool = False) → Series

Get property values from the data set.

Parameters:

name (str) – Name of the property to get.
ids – IDs of entries to get properties for.
ignore_missing (bool) – If True, missing IDs are ignored.

Returns:

List of values for the property.

Return type:

(pd.Series)

getProtein(protein_id: str) → TabularProtein[source]

Get a protein from the store using its name.

Parameters:: protein_id (str) – name of the protein to search
Returns:: instance of Protein
Return type:: TabularProtein
Raises:: ValueError – if the protein is not found

getSubset(properties: list[str], ids: list[str] | None = None, name: str | None = None, path: str | None = None, ignore_missing: bool = False) → PandasDataTable

Get a subset of the data set by providing a prefix for the column names or a column name directly.

Parameters:

properties (list[str]) – list of property names to get.
ids – IDs of entries to get subset of properties for.
name (str) – Name of the new data set.
path (str) – Path to save the new data set.
ignore_missing (bool) – If True, missing IDs are ignored.

Returns:

A new data set containing the subset of the properties

Return type:

(PandasDataTable)

hasProperty(name: str) → bool

Check whether a property is present in the data frame.

Parameters:: name (str) – Name of the property.
Returns:: Whether the property is present.
Return type:: bool

property idProp: str: Column name to use for automatically generated IDs.

iterChunks(size: int | None = None, on_props: tuple[str] | None = None, as_dict: bool = False) → Generator[DataFrame | dict, None, None]

Batch a data frame into chunks of the given size.

Parameters:

size (int) – Size of chunks to use per job in parallel processing. If None, self.chunkSize is used.
on_props (list[str]) – list of properties to include, if None, all properties are included.
as_dict (bool) – If True, the generator yields dictionaries instead of data frames.

Returns:

Generator that yields batches of the data frame as smaller data frames.

Return type:

Generator[pd.DataFrame, None, None]

property metaFile: The path to the meta file of this data set.

property nJobs: Number of jobs to use for parallel processing.

property name: str: Name of the data set.

property proteins: list[TabularProtein]

Get all proteins in the store.

Returns:: list of proteins
Return type:: list[TabularProtein]

property randomState: int: Random state to use for all random operations for reproducibility.

reload(): Reload the data table from disk.

removeProperty(name)

Remove a property from the data frame.

Parameters:: name (str) – Name of the property to delete.

save() → str

Save the data frame to disk and all associated files.

Returns:: Path to the saved data frame.
Return type:: (str)

searchOnProperty(prop_name: str, values: list[str], exact: bool = False) → PandasDataTable

Search the molecules within this MoleculeDataSet on a property value and return the appropriate subset.

Parameters:

prop_name (str) – Name of the column to search on.
values (list[str]) – Values to search for.
exact (bool) – Whether to search for exact matches or not.

Returns:

A data set with the molecules that match the search.

Return type:

(PandasDataTable)

property sequenceProp: str: Get the name of the property that contains all protein sequences.

setIndex(cols: list[str])

Create and index column from several columns of the data set. This also resets the idProp attribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.

Parameters:: cols (list[str]) – list of columns to use as index.

shuffle(random_state: int | None = None)

Shuffle the internal data frame.

Parameters:: random_state (int | None) – Random state to use for shuffling. If None, the random state of the data set is used.

property storeDir: The data set folder containing the data set files after saving.

property storePath: The path to the main data set file.

property storePrefix: The prefix of the data set files.

toFile(filename: str) → str

Save the metafile and all associated files to a custom location.

Parameters:: filename (str) – absolute path to the saved metafile.
Returns:: Path to the saved data frame.
Return type:: (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

transformProperties(names: list[str], transformer: Callable)

Transform property values using a transformer function.

Parameters:

names (list[str]) – list of column names to transform.
transformer (Callable) – Function that transforms the data in target columns to a new representation.

qsprpred.extra.data.storage.protein package

Subpackages

Submodules

qsprpred.extra.data.storage.protein.tabular_pcm module

Module contents