qsprpred.extra.data.utils package

Subpackages

Submodules

qsprpred.extra.data.utils.msa_calculator module

Various implementations of multiple sequence alignment (MSA).

The MSA providers are used to align sequences for protein descriptor calculation. This is required for the calculation of descriptors that are based on sequence alignments, such as ProDec.

class qsprpred.extra.data.utils.msa_calculator.BioPythonMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: MSAProvider, JSONSerializable, ABC

Common functionality for MSA providers using BioPython command line wrappers.

Variables:
  • outDir – directory to save the alignment to

  • fname – file name of the alignment file

  • cache – cache of alignments performed so far by the provider

Initializes the MSA provider.

Parameters:
  • out_dir (str) – directory to save the alignment to

  • fname (str) – file name of the alignment file

checkTool() bool[source]

Check if the MAFFT tool is installed

abstract property cmd: str

The command that runs the alignment algorithm.

Returns:

the command to run the alignment algorithm

Return type:

cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:

str] | None): the current alignment

Return type:

alignment (dict[str

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None[source]

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):

the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str][source]

Parse the alignment from the output file of the alignment algorithm.

Parameters:

sequences – the original dictionary of sequences that were aligned

Returns:

the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int][source]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:
  • sequences (dict[str,str]) – sequences to align

  • **kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])[source]

Saves the alignment to the cache for a list of sequence IDs.

Parameters:
  • target_ids (list[str]) – list of sequence IDs to save the alignment for

  • (dict[str (alignment) – str]): the alignment to save

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.extra.data.utils.msa_calculator.ClustalMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: BioPythonMSA

Multiple sequence alignment provider using the Clustal Omega Linux program - http://www.clustal.org/omega/

Uses the BioPython wrapper for Clustal Omega - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.ClustalOmegaCommandline

Initializes the MSA provider.

Parameters:
  • out_dir (str) – directory to save the alignment to

  • fname (str) – file name of the alignment file

checkTool() bool

Check if the MAFFT tool is installed

property cmd: str

The command that runs the alignment algorithm.

Returns:

the command to run the alignment algorithm

Return type:

cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:

str] | None): the current alignment

Return type:

alignment (dict[str

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):

the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str]

Parse the alignment from the output file of the alignment algorithm.

Parameters:

sequences – the original dictionary of sequences that were aligned

Returns:

the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:
  • sequences (dict[str,str]) – sequences to align

  • **kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])

Saves the alignment to the cache for a list of sequence IDs.

Parameters:
  • target_ids (list[str]) – list of sequence IDs to save the alignment for

  • (dict[str (alignment) – str]): the alignment to save

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.extra.data.utils.msa_calculator.MAFFT(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]

Bases: BioPythonMSA

Multiple sequence alignment provider using the MAFFT cross-platform program - https://mafft.cbrc.jp/alignment/software/

Uses the BioPython wrapper for MAFFT: - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.MafftCommandline

Initializes the MSA provider.

Parameters:
  • out_dir (str) – directory to save the alignment to

  • fname (str) – file name of the alignment file

checkTool() bool

Check if the MAFFT tool is installed

property cmd: str

The command that runs the alignment algorithm.

Returns:

the command to run the alignment algorithm

Return type:

cmd (str)

property current

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:

str] | None): the current alignment

Return type:

alignment (dict[str

classmethod fromFile(filename: str) Any

Initialize a new instance from a JSON file.

Parameters:

filename (str) – path to the JSON file

Returns:

new instance of the class

Return type:

instance (object)

classmethod fromJSON(json: str) Any

Reconstruct object from a JSON string.

Parameters:

json (str) – JSON string of the object

Returns:

reconstructed object

Return type:

obj (object)

getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None

Gets the alignment from the cache if it exists for a list of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]

Returns:

str] | None):

the alignment if it exists in the cache, None otherwise

Return type:

alignment (dict[str

parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str]

Parse the alignment from the output file of the alignment algorithm.

Parameters:

sequences – the original dictionary of sequences that were aligned

Returns:

the aligned sequences mapped to their IDs

parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int]

Create object with sequences and the passed metadata.

Saves the sequences to a file that will serve as input to the command line tools.

Parameters:
  • sequences (dict[str,str]) – sequences to align

  • **kwargs – metadata to be stored with the alignment

Returns:

path to the file with the sequences n_sequences (int): number of sequences in the file

Return type:

sequences_path (str)

saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])

Saves the alignment to the cache for a list of sequence IDs.

Parameters:
  • target_ids (list[str]) – list of sequence IDs to save the alignment for

  • (dict[str (alignment) – str]): the alignment to save

toFile(filename: str) str

Serialize object to a JSON file. This JSON file should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved JSON file of the object

Return type:

filename (str)

toJSON() str
Serialize object to a JSON string. This JSON string should

contain all data necessary to reconstruct the object.

Returns:

JSON string of the object

Return type:

json (str)

class qsprpred.extra.data.utils.msa_calculator.MSAProvider[source]

Bases: FileSerializable, ABC

Interface for multiple sequence alignment providers.

This interface defines how calculation and storage of multiple sequence alignments (MSAs) is handled.

abstract property current: dict[slice(<class 'str'>, <class 'str'>, None)] | None

The current alignment.

Returns the current alignment as a dictionary where keys are sequence IDs as str and values are aligned sequences as str. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated, None is returned.

Returns:

str] | None): the current alignment

Return type:

alignment (dict[str

abstract classmethod fromFile(filename: str) object

Reconstruct object from a metafile.

Parameters:

filename (str) – filename of the metafile to load object from

Returns:

reconstructed object

Return type:

obj (object)

abstract toFile(filename: str) str

Serialize object to a metafile. This metafile should contain all data necessary to reconstruct the object.

Parameters:

filename (str) – filename to save object to

Returns:

absolute path to the saved metafile of the object

Return type:

filename (str)

Module contents