qsprpred.data.sources.papyrus package

Submodules

qsprpred.data.sources.papyrus.papyrus_class module

Creating dataset from Papyrus database.

class qsprpred.data.sources.papyrus.papyrus_class.Papyrus(data_dir: str = '/home/runner/.Papyrus', version: str = 'latest', descriptors: str | list[str] | None = None, stereo: bool = False, disk_margin: float = 0.01, plus_only: bool = True)[source]

Bases: DataSource

Create new instance of Papyrus dataset. See papyrus_filter and Papyrus.download and Papyrus.getData for more details.

Variables:

DEFAULT_DIR (str) – default directory for Papyrus database and the extracted data
dataDir (str) – storage directory for Papyrus database and the extracted data
_papyrusDir (str) – directory where the Papyrus database is located, os.path.join(dataDir, “papyrus”)
version (list) – Papyrus database version
descriptors (list, str, None) – descriptors to download if not already present
stereo (bool) – use version with stereochemistry
nostereo (bool) – use version without stereochemistry
plusplus (bool) – use plusplus version
diskMargin (float) – the disk space margin to leave free

Create new instance of Papyrus dataset. See papyrus_filter and Papyrus.download and Papyrus.getData for more details.

Parameters:

data_dir (str) – storage directory for Papyrus database and the extracted data
version (str) – Papyrus database version
descriptors (str, list, None) – descriptors to download if not already present (set to ‘all’ for all descriptors, otherwise a list of descriptor names, see https://github.com/OlivierBeq/Papyrus-scripts)
stereo (str) – include stereochemistry in the database
disk_margin (float) – the disk space margin to leave free
plus_only (bool) – use only plusplus version, only high quality data

DEFAULT_DIR = '/home/runner/.Papyrus'

download()[source]

Download Papyrus database with the required information.

Only newly requested data is downloaded. Remove the files if you want to reload the data completely.

getData(name: str | None = None, acc_keys: list[str] | None = None, quality: str = 'high', activity_types: list[str] | str = 'all', output_dir: str | None = None, drop_duplicates: bool = False, chunk_size: int = 100000.0, use_existing: bool = True, **kwargs) → MoleculeTable[source]

Get the data from the Papyrus database as a DataSetTSV instance.

Parameters:

acc_keys (list) – protein accession keys
quality (str) – desired minimum quality of the dataset
activity_types (list, str) – list of activity types to include in the dataset
output_dir (str) – path to the directory where the data set will be stored
name (str) – name of the dataset (the prefix of the generated .tsv file)
drop_duplicates (bool) – remove duplicates after filtering
chunk_size (int) – data is read in chunks of this size (see papyrus_filter)
use_existing (bool) – use existing if available
kwargs – additional keyword arguments passed to MoleculeTable.fromTableFile

Returns:

the filtered data set

Return type:

MolculeTable

getDataSet(target_props: list[qsprpred.tasks.TargetProperty | dict], name: str | None = None, **kwargs) → QSPRDataset

getProteinData(acc_keys: list[str], output_dir: str | None = None, name: str | None = None, use_existing: bool = True) → DataFrame[source]

Get the protein data from the Papyrus database.

Parameters:

acc_keys (list) – protein accession keys
output_dir (str) – path to the directory where the data set will be stored
name (str) – name of the dataset (the prefix of the generated .tsv file)
use_existing (bool) – use existing if available

Returns:

the protein data

Return type:

pd.DataFrame

qsprpred.data.sources.papyrus.papyrus_filter module

Filter Papyrus data.

qsprpred.data.sources.papyrus.papyrus_filter.papyrus_filter(version: str, acc_key: list[str], quality: str, outdir: str, activity_types: list[str] | str = 'all', prefix: str | None = None, drop_duplicates: bool = True, chunk_size: int = 100000.0, use_existing: bool = True, stereo: bool = False, plusplus: bool = False, papyrus_dir: str | None = None)[source]

Filters the downloaded Papyrus dataset for quality and accession key (UniProt) and outputs a .tsv file of all compounds fulfilling these requirements.

Parameters:

version (str) – Papyrus database version
acc_key (list) – list of UniProt accession keys
quality (str) – str with minimum quality of dataset to keep
outdir (str) – path to the location of Papyrus data
activity_types (list, str) – list of activity types to keep
prefix (str) – prefix for the output file
drop_duplicates (bool) – boolean to drop duplicates from the final dataset
chunk_size (int) – integer of chunks to process one at the time
use_existing (bool) – if True, use existing data if available
stereo (bool) – if True, read stereochemistry data (if available)
plusplus (bool) – if True, read high quality Papyrus++ data (if available)
papyrus_dir – path to the location of Papyrus database

Returns:

filtered dataset outfile (str): path to the output file

Return type:

dataset (pd.DataFrame)

qsprpred.data.sources.papyrus package

Submodules

qsprpred.data.sources.papyrus.papyrus_class module

qsprpred.data.sources.papyrus.papyrus_filter module

Module contents