qsprpred.data.sources.papyrus package

Submodules

qsprpred.data.sources.papyrus.papyrus_class module

Creating dataset from Papyrus database.

class qsprpred.data.sources.papyrus.papyrus_class.Papyrus(data_dir: str = '/home/runner/.Papyrus', version: str = 'latest', descriptors: str | list[str] | None = None, stereo: bool = False, disk_margin: float = 0.01, plus_only: bool = True)[source]

Bases: DataSource

Create new instance of Papyrus dataset. See papyrus_filter and Papyrus.download and Papyrus.getData for more details.

Variables:
  • DEFAULT_DIR (str) – default directory for Papyrus database and the extracted data

  • dataDir (str) – storage directory for Papyrus database and the extracted data

  • _papyrusDir (str) – directory where the Papyrus database is located, os.path.join(dataDir, “papyrus”)

  • version (list) – Papyrus database version

  • descriptors (list, str, None) – descriptors to download if not already present

  • stereo (bool) – use version with stereochemistry

  • nostereo (bool) – use version without stereochemistry

  • plusplus (bool) – use plusplus version

  • diskMargin (float) – the disk space margin to leave free

Create new instance of Papyrus dataset. See papyrus_filter and Papyrus.download and Papyrus.getData for more details.

Parameters:
  • data_dir (str) – storage directory for Papyrus database and the extracted data

  • version (str) – Papyrus database version

  • descriptors (str, list, None) – descriptors to download if not already present (set to ‘all’ for all descriptors, otherwise a list of descriptor names, see https://github.com/OlivierBeq/Papyrus-scripts)

  • stereo (str) – include stereochemistry in the database

  • disk_margin (float) – the disk space margin to leave free

  • plus_only (bool) – use only plusplus version, only high quality data

DEFAULT_DIR = '/home/runner/.Papyrus'
download()[source]

Download Papyrus database with the required information.

Only newly requested data is downloaded. Remove the files if you want to reload the data completely.

getData(name: str | None = None, acc_keys: list[str] | None = None, quality: str = 'high', activity_types: list[str] | str = 'all', output_dir: str | None = None, drop_duplicates: bool = False, chunk_size: int = 100000.0, use_existing: bool = True, **kwargs) MoleculeTable[source]

Get the data from the Papyrus database as a DataSetTSV instance.

Parameters:
  • acc_keys (list) – protein accession keys

  • quality (str) – desired minimum quality of the dataset

  • activity_types (list, str) – list of activity types to include in the dataset

  • output_dir (str) – path to the directory where the data set will be stored

  • name (str) – name of the dataset (the prefix of the generated .tsv file)

  • drop_duplicates (bool) – remove duplicates after filtering

  • chunk_size (int) – data is read in chunks of this size (see papyrus_filter)

  • use_existing (bool) – use existing if available

  • kwargs – additional keyword arguments passed to MoleculeTable.fromTableFile

Returns:

the filtered data set

Return type:

MolculeTable

getDataSet(target_props: list[qsprpred.tasks.TargetProperty | dict], name: str | None = None, **kwargs) QSPRDataset
getProteinData(acc_keys: list[str], output_dir: str | None = None, name: str | None = None, use_existing: bool = True) DataFrame[source]

Get the protein data from the Papyrus database.

Parameters:
  • acc_keys (list) – protein accession keys

  • output_dir (str) – path to the directory where the data set will be stored

  • name (str) – name of the dataset (the prefix of the generated .tsv file)

  • use_existing (bool) – use existing if available

Returns:

the protein data

Return type:

pd.DataFrame

qsprpred.data.sources.papyrus.papyrus_filter module

Filter Papyrus data.

qsprpred.data.sources.papyrus.papyrus_filter.papyrus_filter(version: str, acc_key: list[str], quality: str, outdir: str, activity_types: list[str] | str = 'all', prefix: str | None = None, drop_duplicates: bool = True, chunk_size: int = 100000.0, use_existing: bool = True, stereo: bool = False, plusplus: bool = False, papyrus_dir: str | None = None)[source]

Filters the downloaded Papyrus dataset for quality and accession key (UniProt) and outputs a .tsv file of all compounds fulfilling these requirements.

Parameters:
  • version (str) – Papyrus database version

  • acc_key (list) – list of UniProt accession keys

  • quality (str) – str with minimum quality of dataset to keep

  • outdir (str) – path to the location of Papyrus data

  • activity_types (list, str) – list of activity types to keep

  • prefix (str) – prefix for the output file

  • drop_duplicates (bool) – boolean to drop duplicates from the final dataset

  • chunk_size (int) – integer of chunks to process one at the time

  • use_existing (bool) – if True, use existing data if available

  • stereo (bool) – if True, read stereochemistry data (if available)

  • plusplus (bool) – if True, read high quality Papyrus++ data (if available)

  • papyrus_dir – path to the location of Papyrus database

Returns:

filtered dataset outfile (str): path to the output file

Return type:

dataset (pd.DataFrame)

Module contents