drugex.data package

Subpackages

drugex.data.corpus package

Submodules

drugex.data.datasets module

defaultdatasets

Created by: Martin Sicho On: 25.06.22, 19:42

class drugex.data.datasets.GraphFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]

Bases: DataSet

DataSet to manage the fragment-molecule pair encodings for the graph-based model (GraphModel).

static dataToLoader(data, batch_size, vocabulary)[source]

The default method to use to convert data (as returned from DataSet.getData()) to a PyTorch DataLoader. Basically, mirrors the DataToLoader interface.

Parameters:

data – data from DataSet.getData()
batch_size – specified batch size for the DataLoader
vocabulary – a Vocabulary instance (in this case should be the same as returned by DataSet.getVoc())

Returns:

typically an instance of PyTorch DataLoader generated from “data”, but depends on the implementation

getColumns()[source]

class drugex.data.datasets.SmilesDataSet(path, voc=None, rewrite=False)[source]

Bases: DataSet

DataSet that holds the encoded SMILES representations of molecules for the single-network sequence-based DrugEx model (RNN).

columns = ('Smiles', 'Token')

static dataToLoader(data, batch_size, vocabulary)[source]

The default method to use to convert data (as returned from DataSet.getData()) to a PyTorch DataLoader. Basically, mirrors the DataToLoader interface.

Parameters:

data – data from DataSet.getData()
batch_size – specified batch size for the DataLoader
vocabulary – a Vocabulary instance (in this case should be the same as returned by DataSet.getVoc())

Returns:

typically an instance of PyTorch DataLoader generated from “data”, but depends on the implementation

getColumns()[source]

readVocs(paths, voc_class, *args, **kwargs)[source]

Read vocabularies from files and add them together to form the full vocabulary for this DataSet.

Parameters:

paths – file paths to vocabulary files
voc_class – Vocabulary implementation to initialize from the files
*args – any positional arguments passed to the Vocabulary constructor besides “words”
**kwargs – any keyword arguments passed to the Vocabulary constructor

Returns:

None

class drugex.data.datasets.SmilesFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]

Bases: DataSet

DataSet that holds the encoded SMILES representations of fragment-molecule pairs for the sequence-based encoder-decoder type of DrugEx models.

class TargetCreator[source]

Bases: DataToLoader

Old creator for test data that currently is no longer being used. Saved here for future reference.

class TgtData(seqs, ix, max_len=100)[source]

Bases: Dataset

collate_fn(arr)[source]

columns = ('Input', 'Output')

createLoaders(data, batch_size, splitter=None, converter=None)[source]

Facilitates splitting and conversion of data to `DataLoader`s.

Parameters:

data – data to convert
batch_size – batch size
splitter – the ChunkSplitter to use
converter – the DataToLoader instance to convert with

Returns:

a list of created data loaders (same length as the “splitter” return value)

static dataToLoader(data, batch_size, vocabulary)[source]

The default method to use to convert data (as returned from DataSet.getData()) to a PyTorch DataLoader. Basically, mirrors the DataToLoader interface.

Parameters:

data – data from DataSet.getData()
batch_size – specified batch size for the DataLoader
vocabulary – a Vocabulary instance (in this case should be the same as returned by DataSet.getVoc())

Returns:

typically an instance of PyTorch DataLoader generated from “data”, but depends on the implementation

getColumns()[source]

readVocs(paths, voc_class, *args, **kwargs)[source]

Read vocabularies from files and add them together to form the full vocabulary for this DataSet.

Parameters:

paths – file paths to vocabulary files
voc_class – Vocabulary implementation to initialize from the files
*args – any positional arguments passed to the Vocabulary constructor besides “words”
**kwargs – any keyword arguments passed to the Vocabulary constructor

Returns:

None

drugex.data.fragments module

class drugex.data.fragments.FragmentCorpusEncoder(fragmenter, encoder, pairs_splitter=None, n_proc=None, chunk_size=None)[source]

Bases: ParallelProcessor

Fragments and encodes fragment-molecule pairs in parallel. Each encoded pair is used as input to the fragment-based DrugEx models.

class FragmentPairsCollector(other=None)[source]

Bases: ListExtend

A simple ResultCollector that extends an internal list. It can also wrap another instance of itself.

apply(mols, fragmentPairsCollector=None, encodingCollectors=None)[source]

Apply fragmentation and encoding to the given molecules represented as SMILES strings. Collectors can be used to fetch fragment-molecule pairs and the final encoding with vocabulary.

Parameters:

mols – list of molecules as SMILES strings
fragmentPairsCollector – an instance of ResultCollector to collect results of the fragmentation (the generated fragment-molecule `tuple`s from the given “fragmenter”).
encodingCollectors – a list of ResultCollector instances matching in length the number of splits given by the “pairs_splitter”. Each ResultCollector receives a (data, FragmentPairsEncodedSupplier) tuple of the currently finished process.

Returns:

None

encodeFragments(pairs, collector)[source]

Encodes fragment-pairs obtained from FragmentCorpusEncoder.getFragmentPairs() with the specified FragmentPairEncoder initialized in “encoder”.

Parameters:

pairs – list of (fragment, molecule) `tuple`s to encode
collector – The ResultCollector to apply to fetch encoding data from each process.

Returns:

None

getFragmentPairs(mols, collector)[source]

Apply the given “fragmenter” in parallel.

Parameters:

mols – Molecules represented as SMILES strings.
collector – The ResultCollector to apply to fetch the result per process.

Returns:

None

splitFragmentPairs(pairs)[source]

Use the “pairs_splitter” to get splits of the calculated molecule-fragment pairs from FragmentCorpusEncoder.getFragmentPairs()

Parameters:: pairs – pairs generated by the “fragmenter”
Returns:: splits from the specified “splitter”

class drugex.data.fragments.FragmentPairsEncodedSupplier(pairs, encoder)[source]

Bases: MolSupplier

Transforms fragment-molecule pairs to the encoded representation used by the fragment-based DrugEx models.

exception FragmentEncodingException[source]

Bases: ConversionException

Raise this when a fragment failed to encode.

exception MoleculeEncodingException[source]

Bases: ConversionException

Raise this when the parent molecule of the fragment failed to be encoded.

next()[source]

Get the next pair and encode it with the encoder.

Returns:: (str, str) encoded form of fragment-molecule pair
Return type:: tuple

class drugex.data.fragments.FragmentPairsSplitter(ratio=0.2, max_test_samples=10000.0, train_collector=None, test_collector=None, unique_collector=None, make_unique=False, seed=None)[source]

Bases: DataSplitter

A DataSplitter to be used to split molecule-fragment pairs into training and test data.

class drugex.data.fragments.FragmentPairsSupplier(molecules, fragmenter, max_bonds=None)[source]

Bases: MolSupplier

Produces fragment-molecule pairs from input molecules.

next()[source]

Generate the next fragment-molecule pair.

Returns:: a (fragment, molecule) tuple

class drugex.data.fragments.GraphFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocGraph object>)[source]

Bases: FragmentPairEncoder

Encode molecules and fragments for the graph-based transformer (GraphModel).

encodeFrag(mol, mol_tokens, frag)[source]

Encode molecules and fragments at once.

Parameters:

mol – parent molecule SMILES (from encodeMol)
mol_tokens – molecule SMILES (from encodeMol)
frag – SMILES of the fragment in the parent molecule

Returns:

One line of the graph-encoded data.

encodeMol(smiles)[source]

Molecules are encoded together with fragments -> we just pass the smiles back as both tokens and result of encoding.

Parameters:: smiles –
Returns:: The input smiles as both the tokens and as the encoded result.

getVoc()[source]

The vocabulary used for encoding.

Returns:: a Vocabulary instance

class drugex.data.fragments.SequenceFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocSmiles object>, update_voc=True, throw=False)[source]

Bases: FragmentPairEncoder

Encode fragment-molecule pairs for the sequence-based models.

encodeFrag(mol, mol_tokens, frag)[source]

Encode a fragment.

Is called by FragmentPairsEncodedSupplier with the mol argument being the output of the above encodeMol method.

encodeMol(sequence)[source]

Encode a molecule sequence.

Parameters:: sequence – sequential representation of the molecule (i.e. SMILES)
Returns:: a tuple containing the obtained tokens from the sequence (if any) and the corresponding sequence of codes

getVoc()[source]

The vocabulary used for encoding.

Returns:: a Vocabulary instance

drugex.data.interfaces module

splitting

Created by: Martin Sicho On: 07.05.22, 15:54

class drugex.data.interfaces.DataSet(path, rewrite=False, save_voc=True, voc_file=None)[source]

Bases: ResultCollector, ABC

Data sets represent encoded input data for the various DrugEx models. Each DataSet is associated with a file and also acts as a ResultCollector to append data from parallel operations (see ParallelProcessor). The DataSet is also coupled with the Vocabulary used to encode the data in it. However, Vocabulary is usually saved in a separate file(s) and needs to be loaded explicitly with DataSet.readVocs().

asDataLoader(batch_size, splitter=None, split_converter=None, n_samples=-1, n_samples_ratio=None)[source]

Convert the data in this DataSet to a compatible PyTorch DataLoader.

Parameters:

batch_size – the desired batch size
splitter – If a split of the data is required (i.e. training/validation set) a custom ChunkSplitter can be supplied. Otherwise, only a single DataLoader is created.
split_converter – a custom DataToLoader implementation can be supplied to convert each split to a DataLoader. By default, the DataSet.dataToLoader() method is used instead.
n_samples – Number of desired samples in the supplied data before splitting. If “n_samples > 0” and “len(data) < n_samples”, the data of the DataSet is oversampled to match “len(data) == n_samples”
n_samples_ratio – If supplied only “n_samples*n_samples_ratio” samples are generated from this DataSet before splitting.

Returns:

a tuple of PyTorch DataLoader instances matching the number of splits as defined by the current “splitter”. If only one DataLoader split data set is created, it returns its DataLoader directly.

createLoaders(data, batch_size, splitter=None, converter=None)[source]

Facilitates splitting and conversion of data to `DataLoader`s.

Parameters:

data – data to convert
batch_size – batch size
splitter – the ChunkSplitter to use
converter – the DataToLoader instance to convert with

Returns:

a list of created data loaders (same length as the “splitter” return value)

abstract static dataToLoader(data, batch_size, vocabulary)[source]

The default method to use to convert data (as returned from DataSet.getData()) to a PyTorch DataLoader. Basically, mirrors the DataToLoader interface.

Parameters:

data – data from DataSet.getData()
batch_size – specified batch size for the DataLoader
vocabulary – a Vocabulary instance (in this case should be the same as returned by DataSet.getVoc())

Returns:

typically an instance of PyTorch DataLoader generated from “data”, but depends on the implementation

fromFile(path, vocs=(), voc_class=None)[source]

Initialize this DataSet from file and load the associated vocabulary.

Parameters:

path – Path to the encoded data.
vocs – Paths to the file(s) containing the vocabulary
voc_class – The Vocabulary implementation to initialize.

Returns:

None

getData(chunk_size=None)[source]

Get this DataSet as a pandas DataFrame.

Parameters:: chunk_size – the size of the chunk to load at a time
Returns:: pandas DataFrame representing this instance. If “chunks” is specified an iterator is returned that supplies the chunks.

getVoc()[source]

Return the Vocabulary associated with this data set (should comprise all tokens within it). The vocabulary can be generated from the results collected from CorpusEncoder or FragmentCorpusEncoder on which this class acts as a collector. Or it can be loaded from files with DataSet.readVocs().

Returns:: the associated Vocabulary instance.

getVocPath()[source]

readVocs(paths, voc_class, *args, **kwargs)[source]

Read vocabularies from files and add them together to form the full vocabulary for this DataSet.

Parameters:

paths – file paths to vocabulary files
voc_class – Vocabulary implementation to initialize from the files
*args – any positional arguments passed to the Vocabulary constructor besides “words”
**kwargs – any keyword arguments passed to the Vocabulary constructor

Returns:

None

reset()[source]

sendDataToFile(data, columns=None)[source]

setVoc(voc)[source]

updateVoc(voc)[source]

Accept a Vocabulary instance and add it to the existing one.

Parameters:: voc – vocabulary to add
Returns:: None

class drugex.data.interfaces.DataSplitter[source]

Bases: ABC

Splits input data into multiple parts.

class drugex.data.interfaces.DataToLoader[source]

Bases: ABC

Responsible for the conversion of raw input data into data loaders used by the DrugEx models for training.

class drugex.data.interfaces.FragmentPairEncoder[source]

Bases: ABC

Encode fragments and the associated molecules for the fragment-based DrugEx models.

abstract encodeFrag(mol, mol_tokens, frag)[source]

Encode fragment.

Parameters:

mol – the parent molecule of this fragment
mol_tokens – the encoded representation of the parent molecule
frag – the fragment to encode

Returns:

the encoded representation of the fragment-molecule pair (i.e. the generated tokens corresponding to both the fragment and the parent molecule)

abstract encodeMol(mol)[source]

Encode molecule.

Parameters:: mol – molecule as SMILES
Returns:: a tuple of the molecule tokens (as determined by the specified vocabulary) and the encoded representation

abstract getVoc()[source]

The vocabulary used for encoding.

Returns:: a Vocabulary instance

drugex.data.processing module

processing

Created by: Martin Sicho On: 27.05.22, 10:16

class drugex.data.processing.CorpusEncoder(corpus_class, corpus_options, n_proc=None, chunk_size=None)[source]

Bases: ParallelProcessor

This processor translates input molecules to representations that can be used directly as input to both sequence- and graph-based models. It works by evaluating a Corpus in parallel on the input data.

apply(mols, collector)[source]

Apply the encoder to given molecules.

Parameters:

mols – list or similar data structure with molecules (representation of each molecule depends on the Corpus implementation used).
collector – custom ResultCollector to use as a callback to customize how results are collected. If it is specified, this method returns None. A tuple with two items is passed to the collector: the encoded data and the associated Corpus instance used to calculate it.

Returns:

None

class drugex.data.processing.RandomTrainTestSplitter(test_size, max_test_size=10000.0, shuffle=True)[source]

Bases: DataSplitter

Simple splitter to facilitate a random split into training and test set with the option to fix the maximum size of the test set.

class drugex.data.processing.Standardization(standardizer=<drugex.molecules.converters.standardizers.DefaultStandardizer object>, **kwargs)[source]

Bases: ParallelProcessor

Processor to standardize molecules in parallel.

class Collector[source]: Bases: ListExtend

apply(mols, collector=None)[source]

Apply defined standardization to an iterable of molecules.

This method just automates initialization of a ParallelSupplierEvaluator on the given molecules. Molecules can be given as a generator or a MolSupplier, but note that they will be evaluated before processing, which may add overhead. In such a case consider evaluating the list with a ParallelSupplierEvaluator separately prior to processing.

Parameters:

mols – an iterable containing molecules to transform
collector – a callable to collect the results, passed as the ‘result_collector’ to ParallelSupplierEvaluator

Returns:

None

drugex.data.tests module

tests

Created by: Martin Sicho On: 18.05.22, 11:49

class drugex.data.tests.FragmentPairs(methodName='runTest')[source]

Bases: TestCase

getPairs()[source]

test_pair_encode_graph()[source]

test_pair_encode_smiles()[source]

test_pair_encode_smiles_parallel()[source]

class drugex.data.tests.ProcessingTests(methodName='runTest')[source]

Bases: TestCase

static getRandomFile()[source]

getStandardizationMols()[source]

getTestMols()[source]

test_frag_suppliers()[source]

test_fragmentation_with_selected_fragment()[source]

test_gragh_scaffold_encoding()[source]

test_graph_frag_encoder()[source]

test_smiles_encoder()[source]

test_smiles_frag_encoder()[source]

test_smiles_scaffold_encoding()[source]

test_standardization()[source]

drugex.data.utils module

drugex.data.utils.getDataPaths(data_path, input_prefix, mol_type, unique_frags)[source]

Get paths to training and test data files.

Parameters:

data_path (str) – Path to data directory.
input_prefix (str) – Prefix of data files. If a file with the exact name exists, it is used for both training and testing.
mol_type (str) – Type of molecules in data files. Either ‘smiles’ or ‘graph’.
unique_frags (bool) – Whether to use unique fragments or not.

Returns:

Paths to training and test data files.

Return type:

Tuple[str, str]

drugex.data.utils.getVocPaths(data_path, voc_files, mol_type)[source]

Get paths to vocabulary files. If none are found, use internal defaults.

Parameters:

data_path (str) – Path to data directory.
voc_files (list) – List of vocabulary file names.

Returns:

List of paths to vocabulary files.

Return type:

list

Module contents

__init__.py

Created by: Martin Sicho On: 07.05.22, 15:53