qsprpred.extra.data.utils package

Subpackages

qsprpred.extra.data.utils.testing package

Submodules

qsprpred.extra.data.utils.msa_calculator module

Various implementations of multiple sequence alignment (MSA).

The MSA providers are used to align sequences for protein descriptor calculation. This is required for the calculation of descriptors that are based on sequence alignments, such as ProDec.

class qsprpred.extra.data.utils.msa_calculator.BioPythonMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: MSAProvider, JSONSerializable, ABC

Common functionality for MSA providers using BioPython command line wrappers.

Variables:

outDir – directory to save the alignment to
fname – file name of the alignment file
cache – cache of alignments performed so far by the provider

Initializes the MSA provider.

Parameters:

out_dir (str) – directory to save the alignment to
fname (str) – file name of the alignment file

checkTool() → bool[source]: Check if the MAFFT tool is installed

abstract property cmd: str

The command that runs the alignment algorithm.

Returns:: the command to run the alignment algorithm
Return type:: cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:: str] | None): the current alignment
Return type:: alignment (dict[str

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getFromCache(target_ids: list[str]) → dict[slice(<class 'str'>, <class 'str'>, None)] | None[source]

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):: the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) → dict[str, str][source]

Parse the alignment from the output file of the alignment algorithm.

Parameters:: sequences – the original dictionary of sequences that were aligned
Returns:: the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) → tuple[str, int][source]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:

sequences (dict[str,str]) – sequences to align
**kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])[source]

Saves the alignment to the cache for a list of sequence IDs.

Parameters:

target_ids (list[str]) – list of sequence IDs to save the alignment for
(dict[str (alignment) – str]): the alignment to save

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.extra.data.utils.msa_calculator.ClustalMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: BioPythonMSA

Multiple sequence alignment provider using the Clustal Omega Linux program - http://www.clustal.org/omega/

Uses the BioPython wrapper for Clustal Omega - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.ClustalOmegaCommandline

Initializes the MSA provider.

Parameters:

out_dir (str) – directory to save the alignment to
fname (str) – file name of the alignment file

checkTool() → bool: Check if the MAFFT tool is installed

property cmd: str

The command that runs the alignment algorithm.

Returns:: the command to run the alignment algorithm
Return type:: cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:: str] | None): the current alignment
Return type:: alignment (dict[str

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getFromCache(target_ids: list[str]) → dict[slice(<class 'str'>, <class 'str'>, None)] | None

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):: the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) → dict[str, str]

Parse the alignment from the output file of the alignment algorithm.

Parameters:: sequences – the original dictionary of sequences that were aligned
Returns:: the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) → tuple[str, int]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:

sequences (dict[str,str]) – sequences to align
**kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])

Saves the alignment to the cache for a list of sequence IDs.

Parameters:

target_ids (list[str]) – list of sequence IDs to save the alignment for
(dict[str (alignment) – str]): the alignment to save

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.extra.data.utils.msa_calculator.MAFFT(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: BioPythonMSA

Multiple sequence alignment provider using the MAFFT cross-platform program - https://mafft.cbrc.jp/alignment/software/

Uses the BioPython wrapper for MAFFT: - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.MafftCommandline

Initializes the MSA provider.

Parameters:

out_dir (str) – directory to save the alignment to
fname (str) – file name of the alignment file

checkTool() → bool: Check if the MAFFT tool is installed

property cmd: str

The command that runs the alignment algorithm.

Returns:: the command to run the alignment algorithm
Return type:: cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:: str] | None): the current alignment
Return type:: alignment (dict[str

classmethod fromFile(filename: str) → Any

Initialize a new instance from a JSON file.

Parameters:: filename (str) – path to the JSON file
Returns:: new instance of the class
Return type:: instance (object)

classmethod fromJSON(json: str) → Any

Reconstruct object from a JSON string.

Parameters:: json (str) – JSON string of the object
Returns:: reconstructed object
Return type:: obj (object)

getFromCache(target_ids: list[str]) → dict[slice(<class 'str'>, <class 'str'>, None)] | None

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):: the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) → dict[str, str]

Parse the alignment from the output file of the alignment algorithm.

Parameters:: sequences – the original dictionary of sequences that were aligned
Returns:: the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) → tuple[str, int]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:

sequences (dict[str,str]) – sequences to align
**kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])

Saves the alignment to the cache for a list of sequence IDs.

Parameters:

target_ids (list[str]) – list of sequence IDs to save the alignment for
(dict[str (alignment) – str]): the alignment to save

toFile(filename: str) → str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved JSON file of the object
Return type:: filename (str)

toJSON() → str

Serialize object to a JSON string. This JSON string should: contain all data necessary to reconstruct the object.

Returns:: JSON string of the object
Return type:: json (str)

class qsprpred.extra.data.utils.msa_calculator.MSAProvider[source]

Bases: FileSerializable, ABC

Interface for multiple sequence alignment providers.

This interface defines how calculation and storage of multiple sequence alignments (MSAs) is handled.

abstract property current: dict[slice(<class 'str'>, <class 'str'>, None)] | None

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:: str] | None): the current alignment
Return type:: alignment (dict[str

abstract classmethod fromFile(filename: str) → object

Reconstruct object from a metafile.

Parameters:: filename (str) – filename of the metafile to load object from
Returns:: reconstructed object
Return type:: obj (object)

abstract toFile(filename: str) → str

Serialize object to a metafile. This metafile should contain all data necessary to reconstruct the object.

Parameters:: filename (str) – filename to save object to
Returns:: absolute path to the saved metafile of the object
Return type:: filename (str)

qsprpred.extra.data.utils package

Subpackages

Submodules

qsprpred.extra.data.utils.msa_calculator module

Module contents