qsprpred.extra.data.storage.protein package
Subpackages
- qsprpred.extra.data.storage.protein.interfaces package
- Submodules
- qsprpred.extra.data.storage.protein.interfaces.protein_storage module
ProteinStorageProteinStorage.addEntries()ProteinStorage.addProperty()ProteinStorage.add_protein()ProteinStorage.apply()ProteinStorage.chunkSizeProteinStorage.clear()ProteinStorage.dropEntries()ProteinStorage.fromFile()ProteinStorage.fromJSON()ProteinStorage.getDF()ProteinStorage.getPCMInfo()ProteinStorage.getProperties()ProteinStorage.getProperty()ProteinStorage.getProtein()ProteinStorage.getSubset()ProteinStorage.hasProperty()ProteinStorage.idPropProteinStorage.iterChunks()ProteinStorage.metaFileProteinStorage.nameProteinStorage.proteinsProteinStorage.reload()ProteinStorage.removeProperty()ProteinStorage.save()ProteinStorage.searchOnProperty()ProteinStorage.sequencePropProteinStorage.toFile()ProteinStorage.toJSON()
- qsprpred.extra.data.storage.protein.interfaces.storedprotein module
- Module contents
Submodules
qsprpred.extra.data.storage.protein.tabular_pcm module
- class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProtein(protein_id: str, sequence: str | None = None, parent: TabularProtein | None = None, props: dict[str, Any] | None = None, representations: Iterable[TabularProtein] | None = None)[source]
Bases:
StoredProteinA protein object that is stored in a tabular format.
- Variables:
id (str) – id of the protein
sequence (str) – sequence of the protein
representations (Iterable[TabularProtein]) – representations of the protein
Create a new protein instance.
- Parameters:
protein_id (str) – identifier of the protein
sequence (str) – sequence of the protein
parent (TabularProtein) – parent protein
representations (Iterable[TabularProtein]) – representations of the protein
- property parent: TabularProtein
Get the parent protein.
- property representations: Iterable[TabularProtein]
Get all representations of the protein.
- class qsprpred.extra.data.storage.protein.tabular_pcm.TabularProteinStorage(name: str, df: DataFrame | None = None, sequence_col: str = 'Sequence', sequence_provider: Callable | None = None, store_dir: str = '.', overwrite: bool = False, index_cols: list[str] | None = None, n_jobs: int = 1, chunk_size: int | None = None, protein_col: str = 'accession', random_state: int | None = None, store_format: str = 'pkl', parallel_generator: ParallelGenerator | None = None)[source]
Bases:
ProteinStorage,PandasDataTableA storage class for proteins stored in a tabular format.
- Variables:
sequenceCol (str) – name of the column that contains all protein sequences
proteinSeqProvider (Callable) – function that provides protein
sequenceProp (str) – name of the property that contains all protein sequences
proteins (Iterable[TabularProtein]) – all proteins in the store
Create a new protein storage instance.
- Parameters:
name (str) – name of the storage
df (pd.DataFrame) – data frame containing the proteins
sequence_col (str) – name of the column that contains all protein sequences
sequence_provider (Callable) – function that provides protein
store_dir (str) – directory to store the data
overwrite (bool) – overwrite the existing data
n_jobs (int) – number of parallel jobs
chunk_size (int) – size of the chunks
protein_col (str) – name of the column that contains the protein ids
random_state (int) – random state
store_format (str) – format to store the data
parallel_generator (ParallelGenerator) – parallel generator
- addEntries(ids: list[str], props: dict[str, list], raise_on_existing: bool = True)
Add entries to the data set.
- addProperty(name: str, data: list, ids: list[str] | None = None, ignore_missing: bool = False)
Add a property to the data frame.
- add_protein(protein: TabularProtein, raise_on_existing=True)[source]
Add a protein to the store.
- Parameters:
protein (TabularProtein) – protein sequence
raise_on_existing (bool) – raise an exception if the protein already exists in the store
- apply(func: Callable[[dict[str, list[Any]] | DataFrame, ...], Any], func_args: tuple[Any, ...] | None = None, func_kwargs: dict[str, Any] | None = None, on_props: tuple[str, ...] | None = None, as_df: bool = False, chunk_size: int | None = None, n_jobs: int | None = None) Generator
Apply a function to the data frame. The properties of the data set are passed as the first positional argument to the function. This will be a dictionary of the form
{'prop1': [...], 'prop2': [...], ...}. Ifas_dfisTrue, the properties will be passed as a data frame instead.Any additional arguments specified in
func_argsandfunc_kwargswill be passed to the function after the properties as positional and keyword arguments, respectively.If
on_propsis specified, only the properties in this list will be passed to the function. Ifon_propsisNone, all properties will be passed to the function.- Parameters:
func (Callable) – Function to apply to the data frame.
func_args (list) – Positional arguments to pass to the function.
func_kwargs (dict) – Keyword arguments to pass to the function.
on_props (list[str]) – list of properties to send to the function as arguments
as_df (bool) – If
True, the function is applied to chunks represented as data frames.chunk_size (int) – Size of chunks to use per job in parallel processing. If
None, the chunk size will be set toself.chunkSize. The chunk size will always be set to the number of rows in the data frame ifn_jobsor `self.nJobs is 1.n_jobs (int) – Number of jobs to use for parallel processing. If
None,self.nJobsis used.
- Returns:
Generator that yields the results of the function applied to each chunk of the data frame as determined by
chunk_sizeandn_jobs. Each item in the generator will be the result of the function applied to one chunk of the data set.- Return type:
Generator
- dropEmptyProperties(names: list[str])
Drop rows with empty target property value from the data set.
- dropEntries(ids: Iterable[str], ignore_missing: bool = False)
Drop entries from the data set by their IDs.
- generateIndex(name: str | None = None, prefix: str | None = None)
Generate a custom index for the data frame automatically.
- getDF()
Get the data frame this instance manages.
- Returns:
The data frame this instance manages.
- Return type:
pd.DataFrame
- getPCMInfo() tuple[dict[str, str], dict][source]
Return a dictionary of protein sequences for the proteins in the data frame and the additional metadata separately.
- Returns:
Dictionary of protein sequences.
- Return type:
sequences (dict)
- getProperties() list[str]
Get names of all properties/variables saved in the data frame (all columns).
- getProperty(name: str, ids: tuple[str] | None = None, ignore_missing: bool = False) Series
Get property values from the data set.
- getProtein(protein_id: str) TabularProtein[source]
Get a protein from the store using its name.
- Parameters:
protein_id (str) – name of the protein to search
- Returns:
instance of
Protein- Return type:
- Raises:
ValueError – if the protein is not found
- getSubset(properties: list[str], ids: list[str] | None = None, name: str | None = None, path: str | None = None, ignore_missing: bool = False) PandasDataTable
Get a subset of the data set by providing a prefix for the column names or a column name directly.
- Parameters:
- Returns:
A new data set containing the subset of the properties
- Return type:
- iterChunks(size: int | None = None, on_props: tuple[str] | None = None, as_dict: bool = False) Generator[DataFrame | dict, None, None]
Batch a data frame into chunks of the given size.
- Parameters:
- Returns:
Generator that yields batches of the data frame as smaller data frames.
- Return type:
Generator[pd.DataFrame, None, None]
- property metaFile
The path to the meta file of this data set.
- property nJobs
Number of jobs to use for parallel processing.
- property proteins: list[TabularProtein]
Get all proteins in the store.
- Returns:
list of proteins
- Return type:
- reload()
Reload the data table from disk.
- removeProperty(name)
Remove a property from the data frame.
- Parameters:
name (str) – Name of the property to delete.
- save() str
Save the data frame to disk and all associated files.
- Returns:
Path to the saved data frame.
- Return type:
(str)
- searchOnProperty(prop_name: str, values: list[str], exact: bool = False) PandasDataTable
Search the molecules within this
MoleculeDataSeton a property value and return the appropriate subset.- Parameters:
- Returns:
A data set with the molecules that match the search.
- Return type:
- setIndex(cols: list[str])
Create and index column from several columns of the data set. This also resets the
idPropattribute to be the name of the index columns joined by a ‘~’ character. The values of the columns are also joined in the same way to create the index. Thus, make sure the values of the columns are unique together and can be joined to a string.
- property storeDir
The data set folder containing the data set files after saving.
- property storePath
The path to the main data set file.
- property storePrefix
The prefix of the data set files.