drugex.training.generators package

Submodules

drugex.training.generators.graph_transformer module

class drugex.training.generators.graph_transformer.AtomLayer(d_model=512, n_head=8, d_inner=1024, n_layer=12)[source]

Bases: Module

forward(x: Tensor, key_mask=None, attn_mask=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.graph_transformer.Block(d_model, n_head, d_inner)[source]

Bases: Module

forward(x, key_mask=None, attn_mask=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.graph_transformer.GraphTransformer(voc_trg, d_emb=512, d_model=512, n_head=8, d_inner=1024, n_layer=12, pad_idx=0, device=device(type='cuda'), use_gpus=(0,))[source]

Bases: FragGenerator

Graph Transformer for molecule generation from fragments

attachToGPUs(gpus)[source]

Attach model to GPUs

Parameters:

gpus: tuple

A tuple of GPU ids to use

Returns:

None

decodeLoaders(src, trg)[source]
forward(src, is_train=False)[source]

Forward pass

Parameters:

src: torch.Tensor

Input tensor of shape [batch_size, 80, 5] (transpose of the encoded graphs as drawn in the paper)

is_train: bool

Whether the model is in training mode

Returns:

TODO : fill outputs

init_states()[source]

Initialize model parameters

Notes:

Xavier initialization for all parameters except for the embedding layer

iterLoader(loader)[source]
loaderFromFrags(frags, batch_size=32, n_proc=1)[source]

Encode the input fragments and create a dataloader object

Parameters:

frags: list

A list of input fragments (in SMILES format)

batch_size: int

Batch size for the dataloader

n_proc: int

Number of processes to use for encoding the fragments

Returns:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the input fragments

sample(loader)[source]

Sample SMILES from the network

Parameters:

loader (torch.utils.data.DataLoader) – The data loader for the input fragments

Returns:

  • smiles (list) – List of SMILES

  • frags (list) – List of fragments

trainNet(loader, epoch, epochs)[source]

Train the network for one epoch

Parameters:
  • loader (torch.utils.data.DataLoader) – The data loader for the training set

  • epoch (int) – The current epoch

  • epochs (int) – The total number of epochs

Returns:

loss – The training loss of the epoch

Return type:

float

validateNet(loader, evaluator=None, no_multifrag_smiles=True, n_samples=None)[source]

Validate the network

Parameters:
  • loader (torch.utils.data.DataLoader) – A dataloader object to iterate over the validation data

  • evaluator (Evaluator) – An evaluator object to evaluate the generated SMILES

  • no_multifrag_smiles (bool) – If True, only single-fragment SMILES are considered valid

Returns:

  • valid_metrics (dict) – Dictionary containing the validation metrics

  • scores (pandas.DataFrame) – DataFrame containing Smiles, frags and the scores for each SMILES

Notes

The validation metrics are:
  • valid_ratio: the ratio of valid SMILES

  • accurate_ratio: the ratio of SMILES that are valid and have the desired fragments

  • loss_valid: the validation loss

drugex.training.generators.interfaces module

interfaces

Created by: Martin Sicho On: 01.06.22, 11:29

class drugex.training.generators.interfaces.FragGenerator(device=device(type='cuda'), use_gpus=(0,))[source]

Bases: Generator

A generator for fragment-based molecules.

attachToGPUs(gpus)[source]

Attach model to GPUs

Parameters:

gpus: tuple

A tuple of GPU ids to use

abstract decodeLoaders(src, trg)[source]
generate(input_frags: List[str] | None = None, input_dataset: DataSet | None = None, num_samples=100, batch_size=32, n_proc=1, keep_frags=True, drop_duplicates=True, drop_invalid=True, evaluator=None, no_multifrag_smiles=True, drop_undesired=False, raw_scores=True, progress=True, tqdm_kwargs={})[source]

Generate SMILES from either a list of input fragments (input_frags) or a dataset object directly (input_dataset). You have to specify either one or the other. Various other options are available to filter, score and show generation progress (see below).

Parameters:
  • input_frags (list) – a list of input fragments to incorporate in the (as molecules in SMILES format)

  • input_dataset (GraphFragDataSet) – a GraphFragDataSet object to use to provide the input fragments

  • num_samples – the number of SMILES to generate, default is 100

  • batch_size – the batch size to use for generation, default is 32

  • n_proc – the number of processes to use for encoding the fragments if input_frags is provided, default is 1

  • keep_frags – if True, the fragments are kept in the generated SMILES, default is True

  • drop_duplicates – if True, duplicate SMILES are dropped, default is True

  • drop_invalid – if True, invalid SMILES are dropped, default is True

  • evaluator (Environment) – an Environment object to score the generated SMILES against, if None, no scoring is performed, is required if drop_undesired is True, default is None

  • no_multifrag_smiles – if True, only single-fragment SMILES are considered valid, default is True

  • drop_undesired – if True, SMILES that do not contain the desired fragments are dropped, default is False

  • raw_scores – if True, raw scores (without modifiers) are calculated if evaluator is specified, these values are also used for filtering if drop_undesired is True, default for raw_scores is True

  • progress – if True, a progress bar is shown, default is True

  • tqdm_kwargs – keyword arguments to pass to the tqdm progress bar, default is an empty dict

Returns:

init_states()[source]

Initialize model parameters

Notes:

Xavier initialization for all parameters except for the embedding layer

abstract iterLoader(loader)[source]
abstract loaderFromFrags(frags, batch_size=32, n_proc=1)[source]

Encode the input fragments and create a dataloader object

Parameters:

frags: list

A list of input fragments (in SMILES format)

batch_size: int

Batch size for the dataloader

n_proc: int

Number of processes to use for encoding the fragments

Returns:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the input fragments

class drugex.training.generators.interfaces.Generator(device=device(type='cuda'), use_gpus=(0,))[source]

Bases: Model, ABC

The base generator class for fitting and evaluating a DrugEx generator.

evaluate(smiles: List[str], frags: List[str] | None = None, evaluator=None, no_multifrag_smiles: bool = True, unmodified_scores: bool = False)[source]

Evaluate molecules by using the given evaluator or checking for validity.

Parameters:

smiles: List

List of SMILES to evaluate

frags: List

List of fragments used to generate the SMILES

evaluator: Environement

An Environement instance used to evaluate the molecules

no_multifrag_smiles: bool

If True, only single-fragment SMILES are considered valid

unmodified_scores: bool

If True, the scores are not modified by the evaluator

returns:

scores – A DataFrame with the scores for each molecule

rtype:

DataFrame

filterNewMolecules(df_old, df_new, with_frags=True, drop_duplicates=True, drop_undesired=True, evaluator=None, no_multifrag_smiles=True)[source]

Filter the generated SMILES

Parameters:

smiles: list

A list of previous SMILES

new_smiles: list

A list of additional generated SMILES

frags: list

A list of additional input fragments

drop_duplicates: bool

If True, duplicate SMILES are dropped

drop_undesired: bool

If True, SMILES that do not fulfill the desired objectives

evaluator: Evaluator

An evaluator object to evaluate the generated SMILES

no_multifrag_smiles: bool

If True, only single-fragment SMILES are considered valid

Returns:

new_smiles: list

A list of filtered SMILES

new_frags: list

A list of filtered input fragments

fit(train_loader, valid_loader, epochs=100, patience=50, evaluator=None, monitor=None, no_multifrag_smiles=True)[source]

Fit the generator.

Parameters:
  • train_loader (DataLoader) – a DataLoader instance to use for training

  • valid_loader (DataLoader) – a DataLoader instance to use for validation

  • epochs (int) – the number of epochs to train for

  • patience (int) – the number of epochs to wait for improvement before early stopping

  • evaluator (ModelEvaluator) – a ModelEvaluator instance to use for validation TODO: maybe the evaluator should be hard coded to None here as during PT/FT training we don’t need it

  • monitor (Monitor) – a Monitor instance to use for saving the model and performance info

  • no_multifrag_smiles (bool) – if True, only single-fragment SMILES are considered valid

abstract generate(*args, **kwargs)[source]

Generate molecules from the generator.

Returns:

df_smiles – a DataFrame with the generated molecules (and their scores)

Return type:

DataFrame

getModel()[source]

Return a copy of this model as a state dictionary.

Returns:

model – A serializable copy of this model as a state dictionary

Return type:

dict

logPerformanceAndCompounds(epoch, metrics, scores)[source]

Log performance and compounds

Parameters:

epoch: int

The current epoch

metrics: dict

A dictionary with the performance metrics

scores: DataFrame

A DataFrame with generated molecules and their scores

abstract sample(*args, **kwargs)[source]

Samples molcules from the generator.

Returns:

  • smiles (List) – List of SMILES strings

  • frags (List, optional) – List of fragments used to generate the molecules

abstract trainNet(loader, epoch, epochs)[source]

Train the generator for a single epoch.

Parameters:
  • loader (DataLoader) – a DataLoader instance to use for training

  • epoch (int) – the current epoch

  • epochs (int) – the total number of epochs

abstract validateNet(loader=None, evaluator=None, no_multifrag_smiles=True, n_samples=None)[source]

Validate the performance of the generator.

Parameters:
  • loader (DataLoader) – a DataLoader instance to use for validation.

  • evaluator (ModelEvaluator) – a ModelEvaluator instance to use for validation

  • no_multifrag_smiles (bool) – if True, only single-fragment SMILES are considered valid

  • n_samples (int) – the number of samples to use for validation. Not used by transformers.

Returns:

  • valid_metrics (dict) – a dictionary with the validation metrics

  • smiles_scores (DataFrame) – a DataFrame with the scores for each molecule

drugex.training.generators.sequence_rnn module

class drugex.training.generators.sequence_rnn.SequenceRNN(voc, embed_size=128, hidden_size=512, is_lstm=True, lr=0.001, device=device(type='cuda'), use_gpus=(0,))[source]

Bases: Generator

Sequence RNN model for molecule generation.

attachToGPUs(gpus)[source]

This model currently uses only one GPU. Therefore, only the first one from the list will be used.

Parameters:

gpus: tuple

A tuple of GPU indices.

Returns:

None

evolve(batch_size, epsilon=0.01, crover=None, mutate=None)[source]

Evolve a SMILES from the model by sequantial addition of tokens.

Parameters:

batch_size: int

Batch size.

epsilon: float

Probability using the mutate network to generate the next token.

crover: drugex.models.Crover

Crover network.

mutate: drugex.models.Mutate

Mutate network.

Returns:

TODO: check if ouput smiles are still encoded

forward(input, h)[source]

Forward pass of the model.

Parameters:

input: torch.Tensor

Input tensor of shape (batch_size, 1).

h: torch.Tensor

# TODO: Verify h shape. Hidden state tensor of shape (num_layers, batch_size, hidden_size).

Returns:

TODO: fill outputs

generate(num_samples=100, batch_size=32, n_proc=1, drop_duplicates=True, drop_invalid=True, evaluator=None, no_multifrag_smiles=True, drop_undesired=False, raw_scores=True, progress=True, tqdm_kwargs={})[source]

Generate molecules from the generator.

Returns:

df_smiles – a DataFrame with the generated molecules (and their scores)

Return type:

DataFrame

init_h(batch_size, labels=None)[source]

Initialize hidden state of the model.

Hidden state is initialized with random values. If labels are provided, the first hidden state will be set to the labels.

Parameters:

batch_size: int

Batch size.

labels: torch.Tensor

Labels tensor of shape (batch_size, 1).

Returns:

TODO: fill outputs

likelihood(target)[source]

Calculate the likelihood of the target sequence.

Parameters:

target: torch.Tensor

Target tensor of shape (batch_size, seq_len).

Returns:

scores: torch.Tensor

Scores tensor of shape (batch_size, seq_len).

sample(batch_size)[source]

Sample a SMILES from the model.

Parameters:

batch_size: int

Batch size.

Returns:

smiles: list

List of SMILES.

trainNet(loader, epoch, epochs)[source]

Train the RNN network for one epoch

Parameters:

loadertorch.utils.data.DataLoader

The data loader for the training set

epochint

The current epoch

epochsint

The total number of epochs

returns:

loss – The training loss of the epoch

rtype:

float

validateNet(loader=None, evaluator=None, no_multifrag_smiles=True, n_samples=128)[source]

Validate the network

Parameters:
  • loader (torch.utils.data.DataLoader) – A dataloader object to iterate over the validation data to compute the validation loss

  • evaluator (Evaluator) – An evaluator object to evaluate the generated SMILES

  • no_multifrag_smiles (bool) – If True, only single-fragment SMILES are considered valid

  • n_samples (int) – The number of SMILES to sample from the model

Returns:

  • valid_metrics (dict) – Dictionary containing the validation metrics

  • scores (pandas.DataFrame) – DataFrame containing Smiles, frags and the scores for each SMILES

Notes

The validation metrics are:
  • valid_ratio: the ratio of valid SMILES

  • accurate_ratio: the ratio of SMILES that are valid and have the desired fragments

  • loss_valid: the validation loss

drugex.training.generators.sequence_transformer module

class drugex.training.generators.sequence_transformer.Block(d_model, n_head, d_inner)[source]

Bases: Module

forward(x, key_mask=None, atn_mask=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.sequence_transformer.GPT2Layer(voc, d_emb=512, d_model=512, n_head=12, d_inner=1024, n_layer=12, pad_idx=0)[source]

Bases: Module

forward(input: Tensor, key_mask=None, atn_mask=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.sequence_transformer.SequenceTransformer(voc_trg, d_emb=512, d_model=512, n_head=8, d_inner=1024, n_layer=12, pad_idx=0, device=device(type='cuda'), use_gpus=(0,))[source]

Bases: FragGenerator

Sequence Transformer for molecule generation from fragments

decodeLoaders(src, trg)[source]
forward(src, trg=None)[source]

Forward pass of the model

Parameters:

src: torch.Tensor

TODO: check that the shape is correct Source tensor of shape [batch_size, 200]

trg: torch.Tensor

Target tensor of shape [batch_size, 200]

Returns:

TODO: fill outputs

iterLoader(loader)[source]
loaderFromFrags(frags, batch_size=32, n_proc=1)[source]

Encode the input fragments and create a dataloader object

Parameters:

frags: list

A list of input fragments (in SMILES format)

batch_size: int

Batch size for the dataloader

n_proc: int

Number of processes to use for encoding the fragments

Returns:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the input fragments

sample(loader)[source]

Sample SMILES from the model

Parameters:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the input fragments

Returns:

smiles: list

A list of sampled SMILES

frags: list

A list of input fragments

trainNet(loader, epoch, epochs)[source]

Train the model for one epoch

Parameters:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the training data

epoch: int

Current epoch number

epochs: int

Total number of epochs

Returns:

loss: float

The loss value for the current epoch

validateNet(loader, evaluator=None, no_multifrag_smiles=True, n_samples=None)[source]

Validate the model

Parameters:

loader: torch.utils.data.DataLoader

A dataloader object to iterate over the validation data

evaluator: Evaluator

An evaluator object to evaluate the generated SMILES

no_multifrag_smiles: bool

If True, only single-fragment SMILES are considered valid

Returns:

valid_metrics: dict

A dictionary containing the validation metrics

scores: pandas.DataFrame

DataFrame containing Smiles, frags and the scores for each SMILES

Notes:

The validation metrics are:
  • valid_ratio: ratio of valid SMILES

  • accurate_ratio: ratio of SMILES that are valid and have the desired fragments

  • loss_valid: loss on the validation set

drugex.training.generators.utils module

Define the Layers

class drugex.training.generators.utils.PositionalEmbedding(d_model: int, max_len=100, batch_first=False)[source]

Bases: Module

Positional embedding for sequence transformer

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.utils.PositionalEncoding(d_model: int, max_len=100, batch_first=False)[source]

Bases: Module

Positional encoding for graph transformer

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.utils.PositionwiseFeedForward(d_in, d_hid)[source]

Bases: Module

A two-feed-forward-layer module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class drugex.training.generators.utils.SublayerConnection(size, dropout=0.1)[source]

Bases: Module

A residual connection followed by a layer norm

forward(x, sublayer)[source]

Apply residual connection to any sublayer with the same size

drugex.training.generators.utils.pad_mask(seq, pad_idx=0)[source]
drugex.training.generators.utils.tri_mask(seq, diag=1)[source]

For masking out the subsequent info.

drugex.training.generators.utils.unique(arr)[source]

Module contents