drugex.data package
Subpackages
Submodules
drugex.data.datasets module
defaultdatasets
Created by: Martin Sicho On: 25.06.22, 19:42
- class drugex.data.datasets.GraphFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
DataSet
DataSet
to manage the fragment-molecule pair encodings for the graph-based model (GraphModel
).- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()
) to a PyTorchDataLoader
. Basically, mirrors theDataToLoader
interface.- Parameters:
data – data from
DataSet.getData()
batch_size – specified batch size for the
DataLoader
vocabulary – a
Vocabulary
instance (in this case should be the same as returned byDataSet.getVoc()
)
- Returns:
typically an instance of PyTorch
DataLoader
generated from “data”, but depends on the implementation
- class drugex.data.datasets.SmilesDataSet(path, voc=None, rewrite=False)[source]
Bases:
DataSet
DataSet
that holds the encoded SMILES representations of molecules for the single-network sequence-based DrugEx model (RNN
).- columns = ('Smiles', 'Token')
- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()
) to a PyTorchDataLoader
. Basically, mirrors theDataToLoader
interface.- Parameters:
data – data from
DataSet.getData()
batch_size – specified batch size for the
DataLoader
vocabulary – a
Vocabulary
instance (in this case should be the same as returned byDataSet.getVoc()
)
- Returns:
typically an instance of PyTorch
DataLoader
generated from “data”, but depends on the implementation
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet
.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabulary
implementation to initialize from the files*args – any positional arguments passed to the
Vocabulary
constructor besides “words”**kwargs – any keyword arguments passed to the
Vocabulary
constructor
- Returns:
- class drugex.data.datasets.SmilesFragDataSet(path, voc=None, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
DataSet
DataSet
that holds the encoded SMILES representations of fragment-molecule pairs for the sequence-based encoder-decoder type of DrugEx models.- class TargetCreator[source]
Bases:
DataToLoader
Old creator for test data that currently is no longer being used. Saved here for future reference.
- columns = ('Input', 'Output')
- createLoaders(data, batch_size, splitter=None, converter=None)[source]
Facilitates splitting and conversion of data to `DataLoader`s.
- Parameters:
data – data to convert
batch_size – batch size
splitter – the
ChunkSplitter
to useconverter – the
DataToLoader
instance to convert with
- Returns:
a
list
of created data loaders (same length as the “splitter” return value)
- static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()
) to a PyTorchDataLoader
. Basically, mirrors theDataToLoader
interface.- Parameters:
data – data from
DataSet.getData()
batch_size – specified batch size for the
DataLoader
vocabulary – a
Vocabulary
instance (in this case should be the same as returned byDataSet.getVoc()
)
- Returns:
typically an instance of PyTorch
DataLoader
generated from “data”, but depends on the implementation
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet
.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabulary
implementation to initialize from the files*args – any positional arguments passed to the
Vocabulary
constructor besides “words”**kwargs – any keyword arguments passed to the
Vocabulary
constructor
- Returns:
drugex.data.fragments module
- class drugex.data.fragments.FragmentCorpusEncoder(fragmenter, encoder, pairs_splitter=None, n_proc=None, chunk_size=None)[source]
Bases:
ParallelProcessor
Fragments and encodes fragment-molecule pairs in parallel. Each encoded pair is used as input to the fragment-based DrugEx models.
- class FragmentPairsCollector(other=None)[source]
Bases:
ListExtend
A simple
ResultCollector
that extends an internallist
. It can also wrap another instance of itself.
- apply(mols, fragmentPairsCollector=None, encodingCollectors=None)[source]
Apply fragmentation and encoding to the given molecules represented as SMILES strings. Collectors can be used to fetch fragment-molecule pairs and the final encoding with vocabulary.
- Parameters:
mols –
list
of molecules as SMILES stringsfragmentPairsCollector – an instance of
ResultCollector
to collect results of the fragmentation (the generated fragment-molecule `tuple`s from the given “fragmenter”).encodingCollectors – a
list
ofResultCollector
instances matching in length the number of splits given by the “pairs_splitter”. EachResultCollector
receives a (data,FragmentPairsEncodedSupplier
)tuple
of the currently finished process.
- Returns:
- encodeFragments(pairs, collector)[source]
Encodes fragment-pairs obtained from
FragmentCorpusEncoder.getFragmentPairs()
with the specifiedFragmentPairEncoder
initialized in “encoder”.- Parameters:
collector – The
ResultCollector
to apply to fetch encoding data from each process.
- Returns:
- getFragmentPairs(mols, collector)[source]
Apply the given “fragmenter” in parallel.
- Parameters:
mols – Molecules represented as SMILES strings.
collector – The
ResultCollector
to apply to fetch the result per process.
- Returns:
- splitFragmentPairs(pairs)[source]
Use the “pairs_splitter” to get splits of the calculated molecule-fragment pairs from
FragmentCorpusEncoder.getFragmentPairs()
- Parameters:
pairs – pairs generated by the “fragmenter”
- Returns:
splits from the specified “splitter”
- class drugex.data.fragments.FragmentPairsEncodedSupplier(pairs, encoder)[source]
Bases:
MolSupplier
Transforms fragment-molecule pairs to the encoded representation used by the fragment-based DrugEx models.
- exception FragmentEncodingException[source]
Bases:
ConversionException
Raise this when a fragment failed to encode.
- exception MoleculeEncodingException[source]
Bases:
ConversionException
Raise this when the parent molecule of the fragment failed to be encoded.
- class drugex.data.fragments.FragmentPairsSplitter(ratio=0.2, max_test_samples=10000.0, train_collector=None, test_collector=None, unique_collector=None, make_unique=False, seed=None)[source]
Bases:
DataSplitter
A
DataSplitter
to be used to split molecule-fragment pairs into training and test data.
- class drugex.data.fragments.FragmentPairsSupplier(molecules, fragmenter, max_bonds=None)[source]
Bases:
MolSupplier
Produces fragment-molecule pairs from input molecules.
- class drugex.data.fragments.GraphFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocGraph object>)[source]
Bases:
FragmentPairEncoder
Encode molecules and fragments for the graph-based transformer (
GraphModel
).- encodeMol(smiles)[source]
Molecules are encoded together with fragments -> we just pass the smiles back as both tokens and result of encoding.
- Parameters:
smiles –
- Returns:
The input smiles as both the tokens and as the encoded result.
- getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabulary
instance
- class drugex.data.fragments.SequenceFragmentEncoder(vocabulary=<drugex.data.corpus.vocabulary.VocSmiles object>, update_voc=True, throw=False)[source]
Bases:
FragmentPairEncoder
Encode fragment-molecule pairs for the sequence-based models.
- encodeFrag(mol, mol_tokens, frag)[source]
Encode a fragment.
Is called by
FragmentPairsEncodedSupplier
with themol
argument being the output of the aboveencodeMol
method.
- encodeMol(sequence)[source]
Encode a molecule sequence.
- Parameters:
sequence – sequential representation of the molecule (i.e. SMILES)
- Returns:
a
tuple
containing the obtained tokens from the sequence (if any) and the corresponding sequence of codes
- getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabulary
instance
drugex.data.interfaces module
splitting
Created by: Martin Sicho On: 07.05.22, 15:54
- class drugex.data.interfaces.DataSet(path, rewrite=False, save_voc=True, voc_file=None)[source]
Bases:
ResultCollector
,ABC
Data sets represent encoded input data for the various DrugEx models. Each
DataSet
is associated with a file and also acts as aResultCollector
to append data from parallel operations (seeParallelProcessor
). TheDataSet
is also coupled with theVocabulary
used to encode the data in it. However,Vocabulary
is usually saved in a separate file(s) and needs to be loaded explicitly withDataSet.readVocs()
.- asDataLoader(batch_size, splitter=None, split_converter=None, n_samples=-1, n_samples_ratio=None)[source]
Convert the data in this
DataSet
to a compatible PyTorchDataLoader
.- Parameters:
batch_size – the desired batch size
splitter – If a split of the data is required (i.e. training/validation set) a custom
ChunkSplitter
can be supplied. Otherwise, only a singleDataLoader
is created.split_converter – a custom
DataToLoader
implementation can be supplied to convert each split to aDataLoader
. By default, theDataSet.dataToLoader()
method is used instead.n_samples – Number of desired samples in the supplied data before splitting. If “n_samples > 0” and “len(data) < n_samples”, the data of the
DataSet
is oversampled to match “len(data) == n_samples”n_samples_ratio – If supplied only “n_samples*n_samples_ratio” samples are generated from this
DataSet
before splitting.
- Returns:
a
tuple
of PyTorchDataLoader
instances matching the number of splits as defined by the current “splitter”. If only oneDataLoader
split data set is created, it returns itsDataLoader
directly.
- createLoaders(data, batch_size, splitter=None, converter=None)[source]
Facilitates splitting and conversion of data to `DataLoader`s.
- Parameters:
data – data to convert
batch_size – batch size
splitter – the
ChunkSplitter
to useconverter – the
DataToLoader
instance to convert with
- Returns:
a
list
of created data loaders (same length as the “splitter” return value)
- abstract static dataToLoader(data, batch_size, vocabulary)[source]
The default method to use to convert data (as returned from
DataSet.getData()
) to a PyTorchDataLoader
. Basically, mirrors theDataToLoader
interface.- Parameters:
data – data from
DataSet.getData()
batch_size – specified batch size for the
DataLoader
vocabulary – a
Vocabulary
instance (in this case should be the same as returned byDataSet.getVoc()
)
- Returns:
typically an instance of PyTorch
DataLoader
generated from “data”, but depends on the implementation
- fromFile(path, vocs=(), voc_class=None)[source]
Initialize this
DataSet
from file and load the associated vocabulary.- Parameters:
path – Path to the encoded data.
vocs – Paths to the file(s) containing the vocabulary
voc_class – The
Vocabulary
implementation to initialize.
- Returns:
- getData(chunk_size=None)[source]
Get this
DataSet
as a pandasDataFrame
.- Parameters:
chunk_size – the size of the chunk to load at a time
- Returns:
pandas
DataFrame
representing this instance. If “chunks” is specified an iterator is returned that supplies the chunks.
- getVoc()[source]
Return the
Vocabulary
associated with this data set (should comprise all tokens within it). The vocabulary can be generated from the results collected fromCorpusEncoder
orFragmentCorpusEncoder
on which this class acts as a collector. Or it can be loaded from files withDataSet.readVocs()
.- Returns:
the associated
Vocabulary
instance.
- readVocs(paths, voc_class, *args, **kwargs)[source]
Read vocabularies from files and add them together to form the full vocabulary for this
DataSet
.- Parameters:
paths – file paths to vocabulary files
voc_class –
Vocabulary
implementation to initialize from the files*args – any positional arguments passed to the
Vocabulary
constructor besides “words”**kwargs – any keyword arguments passed to the
Vocabulary
constructor
- Returns:
- updateVoc(voc)[source]
Accept a
Vocabulary
instance and add it to the existing one.- Parameters:
voc – vocabulary to add
- Returns:
- class drugex.data.interfaces.DataSplitter[source]
Bases:
ABC
Splits input data into multiple parts.
- class drugex.data.interfaces.DataToLoader[source]
Bases:
ABC
Responsible for the conversion of raw input data into data loaders used by the DrugEx models for training.
- class drugex.data.interfaces.FragmentPairEncoder[source]
Bases:
ABC
Encode fragments and the associated molecules for the fragment-based DrugEx models.
- abstract encodeFrag(mol, mol_tokens, frag)[source]
Encode fragment.
- Parameters:
mol – the parent molecule of this fragment
mol_tokens – the encoded representation of the parent molecule
frag – the fragment to encode
- Returns:
the encoded representation of the fragment-molecule pair (i.e. the generated tokens corresponding to both the fragment and the parent molecule)
- abstract encodeMol(mol)[source]
Encode molecule.
- Parameters:
mol – molecule as SMILES
- Returns:
a
tuple
of the molecule tokens (as determined by the specified vocabulary) and the encoded representation
- abstract getVoc()[source]
The vocabulary used for encoding.
- Returns:
a
Vocabulary
instance
drugex.data.processing module
processing
Created by: Martin Sicho On: 27.05.22, 10:16
- class drugex.data.processing.CorpusEncoder(corpus_class, corpus_options, n_proc=None, chunk_size=None)[source]
Bases:
ParallelProcessor
This processor translates input molecules to representations that can be used directly as input to both sequence- and graph-based models. It works by evaluating a
Corpus
in parallel on the input data.- apply(mols, collector)[source]
Apply the encoder to given molecules.
- Parameters:
mols –
list
or similar data structure with molecules (representation of each molecule depends on theCorpus
implementation used).collector – custom
ResultCollector
to use as a callback to customize how results are collected. If it is specified, this method returns None. Atuple
with two items is passed to the collector: the encoded data and the associatedCorpus
instance used to calculate it.
- Returns:
- class drugex.data.processing.RandomTrainTestSplitter(test_size, max_test_size=10000.0, shuffle=True)[source]
Bases:
DataSplitter
Simple splitter to facilitate a random split into training and test set with the option to fix the maximum size of the test set.
- class drugex.data.processing.Standardization(standardizer=<drugex.molecules.converters.standardizers.DefaultStandardizer object>, **kwargs)[source]
Bases:
ParallelProcessor
Processor to standardize molecules in parallel.
- class Collector[source]
Bases:
ListExtend
- apply(mols, collector=None)[source]
Apply defined standardization to an iterable of molecules.
This method just automates initialization of a
ParallelSupplierEvaluator
on the given molecules. Molecules can be given as a generator or aMolSupplier
, but note that they will be evaluated before processing, which may add overhead. In such a case consider evaluating the list with aParallelSupplierEvaluator
separately prior to processing.- Parameters:
mols – an iterable containing molecules to transform
collector – a callable to collect the results, passed as the ‘result_collector’ to
ParallelSupplierEvaluator
- Returns:
drugex.data.tests module
tests
Created by: Martin Sicho On: 18.05.22, 11:49
drugex.data.utils module
- drugex.data.utils.getDataPaths(data_path, input_prefix, mol_type, unique_frags)[source]
Get paths to training and test data files.
- Parameters:
data_path (str) – Path to data directory.
input_prefix (str) – Prefix of data files. If a file with the exact name exists, it is used for both training and testing.
mol_type (str) – Type of molecules in data files. Either ‘smiles’ or ‘graph’.
unique_frags (bool) – Whether to use unique fragments or not.
- Returns:
Paths to training and test data files.
- Return type:
Module contents
__init__.py
Created by: Martin Sicho On: 07.05.22, 15:53