drugex.data.corpus package

Submodules

drugex.data.corpus.corpus module

corpus

Created by: Martin Sicho On: 26.04.22, 16:47

class drugex.data.corpus.corpus.SequenceCorpus(molecules, vocabulary=<drugex.data.corpus.vocabulary.VocSmiles object>, update_voc=True, throw=False, check_unique=True)[source]

Bases: Corpus

A Corpus to encode molecules for the sequence-based models.

getVoc()[source]

Return current vocabulary.

Returns:

Current vocabulary as a SequenceVocabulary instance.

processMolecule(seq)[source]

Generate encoding information for the given molecule sequence.

Parameters:

seq – molecule as a sequence (i.e. SMILES string)

Returns:

a dict where “seq” is the key to the original sequence and “token” to the generated encoding of this sequence

saveVoc(path)[source]

Save the current state of the vocabulary to a file.

Parameters:

path – Path to the generated file.

Returns:

None

drugex.data.corpus.interfaces module

interfaces

Created by: Martin Sicho On: 26.04.22, 13:12

class drugex.data.corpus.interfaces.Corpus(molecules)[source]

Bases: MolSupplier, ABC

A MolSupplier that generates encoded molecule data from the given input.

convert(representation)[source]

Can be used to convert a molecule from the supplied representation to a different one. This method is called automatically on the output of next. By default, it returns the produced representation as is.

Parameters:

next (representation - the output produced by) –

Return type:

molecule - molecule converted from “representation” to the desired output

abstract getVoc()[source]

Corpus should keep track of the ‘Vocabulary’ used to encode molecules. This method should return its current state.

Returns:

currently used Vocabulary

next()[source]

Implement this method so that it provides iteration over molecules item by item. It should fetch next item from a generator, line from a file or next item from a remote API. If there are no more items, raise StopIteration.

Raises:

StopIteration – no more items to return

Returns:

one instance of a molecule annotations (optional): molecule associated metadata as a dict

Return type:

molecule

abstract processMolecule(molecule)[source]

Process one molecule.

Parameters:

molecule – a molecule instance (representation depend on the implementation).

Returns:

encoded data of the molecule (i.e. data associated with one input sample to the desired DrugEx model)

class drugex.data.corpus.interfaces.SequenceVocabulary(encode_frags, words, max_len=100, min_len=10)[source]

Bases: Vocabulary, ABC

Generic vocabulary for sequence-based models.

addWordsFromSeq(seq, ignoreConstraints=False)[source]
removeIfNew(seq, ignoreConstraints=False)[source]
abstract splitSequence(seq)[source]
toFile(path)[source]
updateIndex()[source]
class drugex.data.corpus.interfaces.Vocabulary(words)[source]

Bases: ABC

Definition of the vocabulary interface. All vocabularies contain “words” that are used for encoding and decoding molecules.

abstract decode(representation)[source]
abstract encode(tokens, frags=None)[source]
abstract static fromFile(path)[source]
abstract toFile(path)[source]

drugex.data.corpus.tests module

tests

Created by: Martin Sicho On: 28.04.22, 14:08

class drugex.data.corpus.tests.CorpusTest(methodName='runTest')[source]

Bases: TestCase

static getMols()[source]
static getTempFilePath(name)[source]
test_graph_voc()[source]
test_sequence_corpus_file()[source]

drugex.data.corpus.vocabulary module

vocabulary

Created by: Martin Sicho On: 26.04.22, 13:16

class drugex.data.corpus.vocabulary.VocGraph(words=('2O', '3O+', '1O-', '4C', '3C+', '3C-', '3N', '4N+', '2N-', '1Cl', '2S', '6S', '4S', '3S+', '5S+', '1S-', '1F', '1I', '5I', '2I+', '1Br', '5P', '3P', '4P+', '2Se', '6Se', '4Se', '3Se+', '4Si', '3B', '4B-', '5As', '3As', '4As+', '2Te', '4Te', '3Te+'), max_len=80, n_frags=4)[source]

Bases: Vocabulary

decode(matrix)[source]
defaultWords = ('2O', '3O+', '1O-', '4C', '3C+', '3C-', '3N', '4N+', '2N-', '1Cl', '2S', '6S', '4S', '3S+', '5S+', '1S-', '1F', '1I', '5I', '2I+', '1Br', '5P', '3P', '4P+', '2Se', '6Se', '4Se', '3Se+', '4Si', '3B', '4B-', '5As', '3As', '4As+', '2Te', '4Te', '3Te+')
encode(smiles, subs=None)[source]
static fromDataFrame(df, word_col='Word', max_len=80, n_frags=4)[source]
static fromFile(path, word_col='Word', max_len=80, n_frags=4)[source]
get_atom_tk(atom)[source]
static parseWord(word)[source]
toDataFrame()[source]
toFile(path)[source]
class drugex.data.corpus.vocabulary.VocNonGPT(words, src_len=1000, trg_len=100, max_len=100, min_len=10)[source]

Bases: VocSmiles

Modified version of VocSmiles adjusted for the legacy sequence models (Seq2Seq and EncDec).

decode(matrix, is_smiles=True, is_tk=False)[source]

Takes an array of indices and returns the corresponding SMILES.

encode(input, is_smiles=True)[source]

Takes a list of characters (eg ‘[NH]’) and encodes to array of indices

static fromFile(path, src_len=1000, trg_len=100, max_len=100, min_len=10)[source]

Takes a file containing separated characters to initialize the vocabulary

class drugex.data.corpus.vocabulary.VocSmiles(encode_frags, words=('#', '%', '(', ')', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', 'B', 'C', 'F', 'I', 'L', 'N', 'O', 'P', 'R', 'S', '[Ag-3]', '[As+]', '[As]', '[B-]', '[BH-]', '[BH2-]', '[BH3-]', '[B]', '[C+]', '[C-]', '[CH-]', '[CH2]', '[CH]', '[I+]', '[IH2]', '[N+]', '[N-]', '[NH+]', '[NH-]', '[NH2+]', '[N]', '[O+]', '[O-]', '[OH+]', '[O]', '[P+]', '[PH]', '[S+]', '[S-]', '[SH+]', '[SH2]', '[SH]', '[Se+]', '[SeH]', '[Se]', '[SiH2]', '[SiH]', '[Si]', '[Te]', '[b-]', '[c+]', '[c-]', '[cH-]', '[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[se+]', '[se]', '[te+]', '[te]', 'b', 'c', 'n', 'o', 'p', 's'), max_len=100, min_len=10)[source]

Bases: SequenceVocabulary

The class for handling encoding/decoding from SMILES to an array of indices for the main SMILES-based models (GPT2Model and RNN)

calc_voc_fp(smiles, prefix=None)[source]
decode(tensor, is_tk=True, is_smiles=True)[source]

Takes an array of indices and returns the corresponding SMILES :param tensor: a long tensor containing all the indices of given tokens. :type tensor: torch.LongTensor

Returns:

a decoded smiles sequence.

Return type:

smiles (str)

defaultWords = ('#', '%', '(', ')', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', 'B', 'C', 'F', 'I', 'L', 'N', 'O', 'P', 'R', 'S', '[Ag-3]', '[As+]', '[As]', '[B-]', '[BH-]', '[BH2-]', '[BH3-]', '[B]', '[C+]', '[C-]', '[CH-]', '[CH2]', '[CH]', '[I+]', '[IH2]', '[N+]', '[N-]', '[NH+]', '[NH-]', '[NH2+]', '[N]', '[O+]', '[O-]', '[OH+]', '[O]', '[P+]', '[PH]', '[S+]', '[S-]', '[SH+]', '[SH2]', '[SH]', '[Se+]', '[SeH]', '[Se]', '[SiH2]', '[SiH]', '[Si]', '[Te]', '[b-]', '[c+]', '[c-]', '[cH-]', '[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[se+]', '[se]', '[te+]', '[te]', 'b', 'c', 'n', 'o', 'p', 's')
encode(tokens, frags=None)[source]

Takes a list of tokens (eg ‘[NH]’) and encodes to array of indices :param input: a list of SMILES sequence represented as a series of tokens

Returns:

a long tensor containing all the indices of given tokens.

Return type:

output (torch.LongTensor)

static fromFile(path, encode_frags, min_len=10, max_len=100)[source]

Takes a file containing separated characters to initialize the vocabulary

parseDecoded(smiles)[source]
splitSequence(smile)[source]

Takes a SMILES and return a list of characters/tokens :param smile: a decoded smiles sequence. :type smile: str

Returns:

a list of tokens decoded from the SMILES sequence.

Return type:

tokens (List)

Module contents

This package contains classes that are necessary to encode data for modeling molecules as they are (i.e. without fragmenting).

Created by: Martin Sicho On: 26.04.22, 13:12