qsprpred.data.sources.papyrus package
Submodules
qsprpred.data.sources.papyrus.papyrus_class module
Creating dataset from Papyrus database.
- class qsprpred.data.sources.papyrus.papyrus_class.Papyrus(data_dir: str = '/home/runner/.Papyrus', version: str = 'latest', descriptors: str | list[str] | None = None, stereo: bool = False, disk_margin: float = 0.01, plus_only: bool = True)[source]
Bases:
DataSource
Create new instance of Papyrus dataset. See
papyrus_filter
andPapyrus.download
andPapyrus.getData
for more details.- Variables:
DEFAULT_DIR (str) – default directory for Papyrus database and the extracted data
dataDir (str) – storage directory for Papyrus database and the extracted data
_papyrusDir (str) – directory where the Papyrus database is located, os.path.join(dataDir, “papyrus”)
version (list) – Papyrus database version
descriptors (list, str, None) – descriptors to download if not already present
stereo (bool) – use version with stereochemistry
nostereo (bool) – use version without stereochemistry
plusplus (bool) – use plusplus version
diskMargin (float) – the disk space margin to leave free
Create new instance of Papyrus dataset. See
papyrus_filter
andPapyrus.download
andPapyrus.getData
for more details.- Parameters:
data_dir (str) – storage directory for Papyrus database and the extracted data
version (str) – Papyrus database version
descriptors (str, list, None) – descriptors to download if not already present (set to ‘all’ for all descriptors, otherwise a list of descriptor names, see https://github.com/OlivierBeq/Papyrus-scripts)
stereo (str) – include stereochemistry in the database
disk_margin (float) – the disk space margin to leave free
plus_only (bool) – use only plusplus version, only high quality data
- DEFAULT_DIR = '/home/runner/.Papyrus'
- download()[source]
Download Papyrus database with the required information.
Only newly requested data is downloaded. Remove the files if you want to reload the data completely.
- getData(name: str | None = None, acc_keys: list[str] | None = None, quality: str = 'high', activity_types: list[str] | str = 'all', output_dir: str | None = None, drop_duplicates: bool = False, chunk_size: int = 100000.0, use_existing: bool = True, **kwargs) MoleculeTable [source]
Get the data from the Papyrus database as a
DataSetTSV
instance.- Parameters:
acc_keys (list) – protein accession keys
quality (str) – desired minimum quality of the dataset
activity_types (list, str) – list of activity types to include in the dataset
output_dir (str) – path to the directory where the data set will be stored
name (str) – name of the dataset (the prefix of the generated .tsv file)
drop_duplicates (bool) – remove duplicates after filtering
chunk_size (int) – data is read in chunks of this size (see
papyrus_filter
)use_existing (bool) – use existing if available
kwargs – additional keyword arguments passed to
MoleculeTable.fromTableFile
- Returns:
the filtered data set
- Return type:
MolculeTable
- getDataSet(target_props: list[qsprpred.tasks.TargetProperty | dict], name: str | None = None, **kwargs) QSPRDataset
qsprpred.data.sources.papyrus.papyrus_filter module
Filter Papyrus data.
- qsprpred.data.sources.papyrus.papyrus_filter.papyrus_filter(version: str, acc_key: list[str], quality: str, outdir: str, activity_types: list[str] | str = 'all', prefix: str | None = None, drop_duplicates: bool = True, chunk_size: int = 100000.0, use_existing: bool = True, stereo: bool = False, plusplus: bool = False, papyrus_dir: str | None = None)[source]
Filters the downloaded Papyrus dataset for quality and accession key (UniProt) and outputs a .tsv file of all compounds fulfilling these requirements.
- Parameters:
version (str) – Papyrus database version
acc_key (list) – list of UniProt accession keys
quality (str) – str with minimum quality of dataset to keep
outdir (str) – path to the location of Papyrus data
prefix (str) – prefix for the output file
drop_duplicates (bool) – boolean to drop duplicates from the final dataset
chunk_size (int) – integer of chunks to process one at the time
use_existing (bool) – if
True
, use existing data if availablestereo (bool) – if
True
, read stereochemistry data (if available)plusplus (bool) – if
True
, read high quality Papyrus++ data (if available)papyrus_dir – path to the location of Papyrus database
- Returns:
filtered dataset outfile (str): path to the output file
- Return type:
dataset (pd.DataFrame)