drugex.data.corpus package
Submodules
drugex.data.corpus.corpus module
corpus
Created by: Martin Sicho On: 26.04.22, 16:47
- class drugex.data.corpus.corpus.SequenceCorpus(molecules, vocabulary=<drugex.data.corpus.vocabulary.VocSmiles object>, update_voc=True, throw=False, check_unique=True)[source]
Bases:
Corpus
A
Corpus
to encode molecules for the sequence-based models.- getVoc()[source]
Return current vocabulary.
- Returns:
Current vocabulary as a
SequenceVocabulary
instance.
drugex.data.corpus.interfaces module
interfaces
Created by: Martin Sicho On: 26.04.22, 13:12
- class drugex.data.corpus.interfaces.Corpus(molecules)[source]
Bases:
MolSupplier
,ABC
A
MolSupplier
that generates encoded molecule data from the given input.- convert(representation)[source]
Can be used to convert a molecule from the supplied representation to a different one. This method is called automatically on the output of
next
. By default, it returns the produced representation as is.- Parameters:
next (representation - the output produced by) –
- Return type:
molecule - molecule converted from “representation” to the desired output
- abstract getVoc()[source]
Corpus should keep track of the ‘Vocabulary’ used to encode molecules. This method should return its current state.
- Returns:
currently used
Vocabulary
- next()[source]
Implement this method so that it provides iteration over molecules item by item. It should fetch next item from a generator, line from a file or next item from a remote API. If there are no more items, raise
StopIteration
.- Raises:
StopIteration – no more items to return
- Returns:
one instance of a molecule annotations (optional): molecule associated metadata as a
dict
- Return type:
molecule
- class drugex.data.corpus.interfaces.SequenceVocabulary(encode_frags, words, max_len=100, min_len=10)[source]
Bases:
Vocabulary
,ABC
Generic vocabulary for sequence-based models.
drugex.data.corpus.tests module
tests
Created by: Martin Sicho On: 28.04.22, 14:08
drugex.data.corpus.vocabulary module
vocabulary
Created by: Martin Sicho On: 26.04.22, 13:16
- class drugex.data.corpus.vocabulary.VocGraph(words=('2O', '3O+', '1O-', '4C', '3C+', '3C-', '3N', '4N+', '2N-', '1Cl', '2S', '6S', '4S', '3S+', '5S+', '1S-', '1F', '1I', '5I', '2I+', '1Br', '5P', '3P', '4P+', '2Se', '6Se', '4Se', '3Se+', '4Si', '3B', '4B-', '5As', '3As', '4As+', '2Te', '4Te', '3Te+'), max_len=80, n_frags=4)[source]
Bases:
Vocabulary
- defaultWords = ('2O', '3O+', '1O-', '4C', '3C+', '3C-', '3N', '4N+', '2N-', '1Cl', '2S', '6S', '4S', '3S+', '5S+', '1S-', '1F', '1I', '5I', '2I+', '1Br', '5P', '3P', '4P+', '2Se', '6Se', '4Se', '3Se+', '4Si', '3B', '4B-', '5As', '3As', '4As+', '2Te', '4Te', '3Te+')
- class drugex.data.corpus.vocabulary.VocNonGPT(words, src_len=1000, trg_len=100, max_len=100, min_len=10)[source]
Bases:
VocSmiles
Modified version of
VocSmiles
adjusted for the legacy sequence models (Seq2Seq
andEncDec
).- decode(matrix, is_smiles=True, is_tk=False)[source]
Takes an array of indices and returns the corresponding SMILES.
- class drugex.data.corpus.vocabulary.VocSmiles(encode_frags, words=('#', '%', '(', ')', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', 'B', 'C', 'F', 'I', 'L', 'N', 'O', 'P', 'R', 'S', '[Ag-3]', '[As+]', '[As]', '[B-]', '[BH-]', '[BH2-]', '[BH3-]', '[B]', '[C+]', '[C-]', '[CH-]', '[CH2]', '[CH]', '[I+]', '[IH2]', '[N+]', '[N-]', '[NH+]', '[NH-]', '[NH2+]', '[N]', '[O+]', '[O-]', '[OH+]', '[O]', '[P+]', '[PH]', '[S+]', '[S-]', '[SH+]', '[SH2]', '[SH]', '[Se+]', '[SeH]', '[Se]', '[SiH2]', '[SiH]', '[Si]', '[Te]', '[b-]', '[c+]', '[c-]', '[cH-]', '[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[se+]', '[se]', '[te+]', '[te]', 'b', 'c', 'n', 'o', 'p', 's'), max_len=100, min_len=10)[source]
Bases:
SequenceVocabulary
The class for handling encoding/decoding from SMILES to an array of indices for the main SMILES-based models (
GPT2Model
andRNN
)- decode(tensor, is_tk=True, is_smiles=True)[source]
Takes an array of indices and returns the corresponding SMILES :param tensor: a long tensor containing all the indices of given tokens. :type tensor: torch.LongTensor
- Returns:
a decoded smiles sequence.
- Return type:
smiles (str)
- defaultWords = ('#', '%', '(', ')', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', 'B', 'C', 'F', 'I', 'L', 'N', 'O', 'P', 'R', 'S', '[Ag-3]', '[As+]', '[As]', '[B-]', '[BH-]', '[BH2-]', '[BH3-]', '[B]', '[C+]', '[C-]', '[CH-]', '[CH2]', '[CH]', '[I+]', '[IH2]', '[N+]', '[N-]', '[NH+]', '[NH-]', '[NH2+]', '[N]', '[O+]', '[O-]', '[OH+]', '[O]', '[P+]', '[PH]', '[S+]', '[S-]', '[SH+]', '[SH2]', '[SH]', '[Se+]', '[SeH]', '[Se]', '[SiH2]', '[SiH]', '[Si]', '[Te]', '[b-]', '[c+]', '[c-]', '[cH-]', '[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[se+]', '[se]', '[te+]', '[te]', 'b', 'c', 'n', 'o', 'p', 's')
- encode(tokens, frags=None)[source]
Takes a list of tokens (eg ‘[NH]’) and encodes to array of indices :param input: a list of SMILES sequence represented as a series of tokens
- Returns:
a long tensor containing all the indices of given tokens.
- Return type:
output (torch.LongTensor)
Module contents
This package contains classes that are necessary to encode data for modeling molecules as they are (i.e. without fragmenting).
Created by: Martin Sicho On: 26.04.22, 13:12