drugex.data package
Subpackages
Submodules
drugex.data.datasets module
defaultdatasets
Created by: Martin Sicho On: 25.06.22, 19:42
- class drugex.data.datasets.GraphFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
DataSetDataSetto manage the fragment-molecule pair encodings for the graph-based model (GraphModel).- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()) to a PyTorchDataLoader. Basically, mirrors theDataToLoaderinterface.- Parameters:
data – data from
DataSet.getData()batch_size – specified batch size for the
DataLoadervocabulary – a
Vocabularyinstance (in this case should be the same as returned byDataSet.getVoc())
- Returns:
typically an instance of PyTorch
DataLoadergenerated from “data”, but depends on the implementation
- class drugex.data.datasets.SmilesDataSet(path, voc=None, rewrite=False)[source]
Bases:
DataSetDataSetthat holds the encoded SMILES representations of molecules for the single-network sequence-based DrugEx model (RNN).- columns = ('Smiles', 'Token')
- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()) to a PyTorchDataLoader. Basically, mirrors theDataToLoaderinterface.- Parameters:
data – data from
DataSet.getData()batch_size – specified batch size for the
DataLoadervocabulary – a
Vocabularyinstance (in this case should be the same as returned byDataSet.getVoc())
- Returns:
typically an instance of PyTorch
DataLoadergenerated from “data”, but depends on the implementation
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabularyimplementation to initialize from the files*args – any positional arguments passed to the
Vocabularyconstructor besides “words”**kwargs – any keyword arguments passed to the
Vocabularyconstructor
- Returns:
- class drugex.data.datasets.SmilesFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
DataSetDataSetthat holds the encoded SMILES representations of fragment-molecule pairs for the sequence-based encoder-decoder type of DrugEx models.- class TargetCreator[source]
Bases:
DataToLoaderOld creator for test data that currently is no longer being used. Saved here for future reference.
- columns = ('Input', 'Output')
- createLoaders(data, batch_size, splitter=None, converter=None)[source]
Facilitates splitting and conversion of data to `DataLoader`s.
- Parameters:
data – data to convert
batch_size – batch size
splitter – the
ChunkSplitterto useconverter – the
DataToLoaderinstance to convert with
- Returns:
a
listof created data loaders (same length as the “splitter” return value)
- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()) to a PyTorchDataLoader. Basically, mirrors theDataToLoaderinterface.- Parameters:
data – data from
DataSet.getData()batch_size – specified batch size for the
DataLoadervocabulary – a
Vocabularyinstance (in this case should be the same as returned byDataSet.getVoc())
- Returns:
typically an instance of PyTorch
DataLoadergenerated from “data”, but depends on the implementation
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabularyimplementation to initialize from the files*args – any positional arguments passed to the
Vocabularyconstructor besides “words”**kwargs – any keyword arguments passed to the
Vocabularyconstructor
- Returns:
drugex.data.fragments module
- class drugex.data.fragments.FragmentCorpusEncoder(fragmenter, encoder, pairs_splitter=None, n_proc=None, chunk_size=None)[source]
Bases:
ParallelProcessorFragments and encodes fragment-molecule pairs in parallel. Each encoded pair is used as input to the fragment-based DrugEx models.
- class FragmentPairsCollector(other=None)[source]
Bases:
ListExtendA simple
ResultCollectorthat extends an internallist. It can also wrap another instance of itself.
- apply(mols, fragmentPairsCollector=None, encodingCollectors=None)[source]
Apply fragmentation and encoding to the given molecules represented as SMILES strings. Collectors can be used to fetch fragment-molecule pairs and the final encoding with vocabulary.
- Parameters:
mols –
listof molecules as SMILES stringsfragmentPairsCollector – an instance of
ResultCollectorto collect results of the fragmentation (the generated fragment-molecule `tuple`s from the given “fragmenter”).encodingCollectors – a
listofResultCollectorinstances matching in length the number of splits given by the “pairs_splitter”. EachResultCollectorreceives a (data,FragmentPairsEncodedSupplier)tupleof the currently finished process.
- Returns:
- encodeFragments(pairs, collector)[source]
Encodes fragment-pairs obtained from
FragmentCorpusEncoder.getFragmentPairs()with the specifiedFragmentPairEncoderinitialized in “encoder”.- Parameters:
collector – The
ResultCollectorto apply to fetch encoding data from each process.
- Returns:
- getFragmentPairs(mols, collector)[source]
Apply the given “fragmenter” in parallel.
- Parameters:
mols – Molecules represented as SMILES strings.
collector – The
ResultCollectorto apply to fetch the result per process.
- Returns:
- splitFragmentPairs(pairs)[source]
Use the “pairs_splitter” to get splits of the calculated molecule-fragment pairs from
FragmentCorpusEncoder.getFragmentPairs()- Parameters:
pairs – pairs generated by the “fragmenter”
- Returns:
splits from the specified “splitter”
- class drugex.data.fragments.FragmentPairsEncodedSupplier(pairs, encoder)[source]
Bases:
MolSupplierTransforms fragment-molecule pairs to the encoded representation used by the fragment-based DrugEx models.
- exception FragmentEncodingException[source]
Bases:
ConversionExceptionRaise this when a fragment failed to encode.
- exception MoleculeEncodingException[source]
Bases:
ConversionExceptionRaise this when the parent molecule of the fragment failed to be encoded.
- class drugex.data.fragments.FragmentPairsSplitter(ratio=0.2, max_test_samples=10000.0, train_collector=None, test_collector=None, unique_collector=None, make_unique=False, seed=None)[source]
Bases:
DataSplitterA
DataSplitterto be used to split molecule-fragment pairs into training and test data.
- class drugex.data.fragments.FragmentPairsSupplier(molecules, fragmenter, max_bonds=None)[source]
Bases:
MolSupplierProduces fragment-molecule pairs from input molecules.
- class drugex.data.fragments.GraphFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocGraph object>)[source]
Bases:
FragmentPairEncoderEncode molecules and fragments for the graph-based transformer (
GraphModel).- encodeMol(smiles)[source]
Molecules are encoded together with fragments -> we just pass the smiles back as both tokens and result of encoding.
- Parameters:
smiles –
- Returns:
The input smiles as both the tokens and as the encoded result.
- getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabularyinstance
- class drugex.data.fragments.SequenceFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocSmiles object>, update_voc=True, throw=False)[source]
Bases:
FragmentPairEncoderEncode fragment-molecule pairs for the sequence-based models.
- encodeFrag(mol, mol_tokens, frag)[source]
Encode a fragment.
Is called by
FragmentPairsEncodedSupplierwith themolargument being the output of the aboveencodeMolmethod.
- encodeMol(sequence)[source]
Encode a molecule sequence.
- Parameters:
sequence – sequential representation of the molecule (i.e. SMILES)
- Returns:
a
tuplecontaining the obtained tokens from the sequence (if any) and the corresponding sequence of codes
- getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabularyinstance
drugex.data.interfaces module
splitting
Created by: Martin Sicho On: 07.05.22, 15:54
- class drugex.data.interfaces.DataSet(path, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
ResultCollector,ABCData sets represent encoded input data for the various DrugEx models. Each
DataSetis associated with a file and also acts as aResultCollectorto append data from parallel operations (seeParallelProcessor). TheDataSetis also coupled with theVocabularyused to encode the data in it. However,Vocabularyis usually saved in a separate file(s) and needs to be loaded explicitly withDataSet.readVocs().- asDataLoader(batch_size, splitter=None, split_converter=None, n_samples=-1, n_samples_ratio=None)[source]
Convert the data in this
DataSetto a compatible PyTorchDataLoader.- Parameters:
batch_size – the desired batch size
splitter – If a split of the data is required (i.e. training/validation set) a custom
ChunkSplittercan be supplied. Otherwise, only a singleDataLoaderis created.split_converter – a custom
DataToLoaderimplementation can be supplied to convert each split to aDataLoader. By default, theDataSet.dataToLoader()method is used instead.n_samples – Number of desired samples in the supplied data before splitting. If “n_samples > 0” and “len(data) < n_samples”, the data of the
DataSetis oversampled to match “len(data) == n_samples”n_samples_ratio – If supplied only “n_samples*n_samples_ratio” samples are generated from this
DataSetbefore splitting.
- Returns:
a
tupleof PyTorchDataLoaderinstances matching the number of splits as defined by the current “splitter”. If only oneDataLoadersplit data set is created, it returns itsDataLoaderdirectly.
- createLoaders(data, batch_size, splitter=None, converter=None)[source]
Facilitates splitting and conversion of data to `DataLoader`s.
- Parameters:
data – data to convert
batch_size – batch size
splitter – the
ChunkSplitterto useconverter – the
DataToLoaderinstance to convert with
- Returns:
a
listof created data loaders (same length as the “splitter” return value)
- abstract static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()) to a PyTorchDataLoader. Basically, mirrors theDataToLoaderinterface.- Parameters:
data – data from
DataSet.getData()batch_size – specified batch size for the
DataLoadervocabulary – a
Vocabularyinstance (in this case should be the same as returned byDataSet.getVoc())
- Returns:
typically an instance of PyTorch
DataLoadergenerated from “data”, but depends on the implementation
- fromFile(path, vocs=(), voc_class=None)[source]
Initialize this
DataSetfrom file and load the associated vocabulary.- Parameters:
path – Path to the encoded data.
vocs – Paths to the file(s) containing the vocabulary
voc_class – The
Vocabularyimplementation to initialize.
- Returns:
- getData(chunk_size=None)[source]
Get this
DataSetas a pandasDataFrame.- Parameters:
chunk_size – the size of the chunk to load at a time
- Returns:
pandas
DataFramerepresenting this instance. If “chunks” is specified an iterator is returned that supplies the chunks.
- getVoc()[source]
Return the
Vocabularyassociated with this data set (should comprise all tokens within it). The vocabulary can be generated from the results collected fromCorpusEncoderorFragmentCorpusEncoderon which this class acts as a collector. Or it can be loaded from files withDataSet.readVocs().- Returns:
the associated
Vocabularyinstance.
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabularyimplementation to initialize from the files*args – any positional arguments passed to the
Vocabularyconstructor besides “words”**kwargs – any keyword arguments passed to the
Vocabularyconstructor
- Returns:
- updateVoc(voc)[source]
Accept a
Vocabularyinstance and add it to the existing one.- Parameters:
voc – vocabulary to add
- Returns:
- class drugex.data.interfaces.DataSplitter[source]
Bases:
ABCSplits input data into multiple parts.
- class drugex.data.interfaces.DataToLoader[source]
Bases:
ABCResponsible for the conversion of raw input data into data loaders used by the DrugEx models for training.
- class drugex.data.interfaces.FragmentPairEncoder[source]
Bases:
ABCEncode fragments and the associated molecules for the fragment-based DrugEx models.
- abstract encodeFrag(mol, mol_tokens, frag)[source]
Encode fragment.
- Parameters:
mol – the parent molecule of this fragment
mol_tokens – the encoded representation of the parent molecule
frag – the fragment to encode
- Returns:
the encoded representation of the fragment-molecule pair (i.e. the generated tokens corresponding to both the fragment and the parent molecule)
- abstract encodeMol(mol)[source]
Encode molecule.
- Parameters:
mol – molecule as SMILES
- Returns:
a
tupleof the molecule tokens (as determined by the specified vocabulary) and the encoded representation
- abstract getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabularyinstance
drugex.data.processing module
processing
Created by: Martin Sicho On: 27.05.22, 10:16
- class drugex.data.processing.CorpusEncoder(corpus_class, corpus_options, n_proc=None, chunk_size=None)[source]
Bases:
ParallelProcessorThis processor translates input molecules to representations that can be used directly as input to both sequence- and graph-based models. It works by evaluating a
Corpusin parallel on the input data.- apply(mols, collector)[source]
Apply the encoder to given molecules.
- Parameters:
mols –
listor similar data structure with molecules (representation of each molecule depends on theCorpusimplementation used).collector – custom
ResultCollectorto use as a callback to customize how results are collected. If it is specified, this method returns None. Atuplewith two items is passed to the collector: the encoded data and the associatedCorpusinstance used to calculate it.
- Returns:
- class drugex.data.processing.RandomTrainTestSplitter(test_size, max_test_size=10000.0, shuffle=True)[source]
Bases:
DataSplitterSimple splitter to facilitate a random split into training and test set with the option to fix the maximum size of the test set.
- class drugex.data.processing.Standardization(standardizer=<drugex.molecules.converters.standardizers.DefaultStandardizer object>, **kwargs)[source]
Bases:
ParallelProcessorProcessor to standardize molecules in parallel.
- class Collector[source]
Bases:
ListExtend
- apply(mols, collector=None)[source]
Apply defined standardization to an iterable of molecules.
This method just automates initialization of a
ParallelSupplierEvaluatoron the given molecules. Molecules can be given as a generator or aMolSupplier, but note that they will be evaluated before processing, which may add overhead. In such a case consider evaluating the list with aParallelSupplierEvaluatorseparately prior to processing.- Parameters:
mols – an iterable containing molecules to transform
collector – a callable to collect the results, passed as the ‘result_collector’ to
ParallelSupplierEvaluator
- Returns:
drugex.data.tests module
tests
Created by: Martin Sicho On: 18.05.22, 11:49
drugex.data.utils module
- drugex.data.utils.getDataPaths(data_path, input_prefix, mol_type, unique_frags)[source]
Get paths to training and test data files.
- Parameters:
data_path (str) – Path to data directory.
input_prefix (str) – Prefix of data files. If a file with the exact name exists, it is used for both training and testing.
mol_type (str) – Type of molecules in data files. Either ‘smiles’ or ‘graph’.
unique_frags (bool) – Whether to use unique fragments or not.
- Returns:
Paths to training and test data files.
- Return type:
Module contents
__init__.py
Created by: Martin Sicho On: 07.05.22, 15:53