qsprpred.extra.data.utils package
Subpackages
- qsprpred.extra.data.utils.testing package
- Submodules
- qsprpred.extra.data.utils.testing.path_mixins module
DataSetsMixInExtras
DataSetsMixInExtras.clearGenerated()
DataSetsMixInExtras.createLargeMultitaskDataSet()
DataSetsMixInExtras.createLargeTestDataSet()
DataSetsMixInExtras.createPCMDataSet()
DataSetsMixInExtras.createSmallTestDataSet()
DataSetsMixInExtras.createTestDataSetFromFrame()
DataSetsMixInExtras.getAllDescriptors()
DataSetsMixInExtras.getAllProteinDescriptors()
DataSetsMixInExtras.getBigDF()
DataSetsMixInExtras.getDataPrepGrid()
DataSetsMixInExtras.getDefaultCalculatorCombo()
DataSetsMixInExtras.getDefaultPrep()
DataSetsMixInExtras.getMSAProvider()
DataSetsMixInExtras.getPCMDF()
DataSetsMixInExtras.getPCMSeqProvider()
DataSetsMixInExtras.getPCMTargetsDF()
DataSetsMixInExtras.getPrepCombos()
DataSetsMixInExtras.getSmallDF()
DataSetsMixInExtras.setUpPaths()
DataSetsMixInExtras.tearDown()
DataSetsMixInExtras.validate_split()
- Module contents
Submodules
qsprpred.extra.data.utils.msa_calculator module
Various implementations of multiple sequence alignment (MSA).
The MSA providers are used to align sequences for protein descriptor calculation. This
is required for the calculation of descriptors that are based on sequence alignments,
such as ProDec
.
- class qsprpred.extra.data.utils.msa_calculator.BioPythonMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]
Bases:
MSAProvider
,JSONSerializable
,ABC
Common functionality for MSA providers using BioPython command line wrappers.
- Variables:
outDir – directory to save the alignment to
fname – file name of the alignment file
cache – cache of alignments performed so far by the provider
Initializes the MSA provider.
- Parameters:
- abstract property cmd: str
The command that runs the alignment algorithm.
- Returns:
the command to run the alignment algorithm
- Return type:
cmd (str)
- property current
The current alignment.
Returns the current alignment as a dictionary where keys are sequence IDs as
str
and values are aligned sequences asstr
. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated,None
is returned.
- getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None [source]
Gets the alignment from the cache if it exists for a
list
of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]
- parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str] [source]
Parse the alignment from the output file of the alignment algorithm.
- Parameters:
sequences – the original dictionary of sequences that were aligned
- Returns:
the aligned sequences mapped to their IDs
- parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int] [source]
Create object with sequences and the passed metadata.
Saves the sequences to a file that will serve as input to the command line tools.
- saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])[source]
Saves the alignment to the cache for a
list
of sequence IDs.
- class qsprpred.extra.data.utils.msa_calculator.ClustalMSA(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]
Bases:
BioPythonMSA
Multiple sequence alignment provider using the Clustal Omega Linux program - http://www.clustal.org/omega/
Uses the BioPython wrapper for Clustal Omega - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.ClustalOmegaCommandline
Initializes the MSA provider.
- Parameters:
- property cmd: str
The command that runs the alignment algorithm.
- Returns:
the command to run the alignment algorithm
- Return type:
cmd (str)
- property current
The current alignment.
Returns the current alignment as a dictionary where keys are sequence IDs as
str
and values are aligned sequences asstr
. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated,None
is returned.
- getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None
Gets the alignment from the cache if it exists for a
list
of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]
- parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str]
Parse the alignment from the output file of the alignment algorithm.
- Parameters:
sequences – the original dictionary of sequences that were aligned
- Returns:
the aligned sequences mapped to their IDs
- parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int]
Create object with sequences and the passed metadata.
Saves the sequences to a file that will serve as input to the command line tools.
- saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])
Saves the alignment to the cache for a
list
of sequence IDs.
- class qsprpred.extra.data.utils.msa_calculator.MAFFT(out_dir: str = '.', fname: str = 'alignment.aln-fasta.fasta')[source]
Bases:
BioPythonMSA
Multiple sequence alignment provider using the MAFFT cross-platform program - https://mafft.cbrc.jp/alignment/software/
Uses the BioPython wrapper for MAFFT: - https://biopython.org/docs/1.76/api/Bio.Align.Applications.html#Bio.Align.Applications.MafftCommandline
Initializes the MSA provider.
- Parameters:
- property cmd: str
The command that runs the alignment algorithm.
- Returns:
the command to run the alignment algorithm
- Return type:
cmd (str)
- property current
The current alignment.
Returns the current alignment as a dictionary where keys are sequence IDs as
str
and values are aligned sequences asstr
. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated,None
is returned.
- getFromCache(target_ids: list[str]) dict[slice(<class 'str'>, <class 'str'>, None)] | None
Gets the alignment from the cache if it exists for a
list
of sequence IDs. :param target_ids: list of sequence IDs to get the alignment for, :type target_ids: list[str]
- parseAlignment(sequences: dict[slice(<class 'str'>, <class 'str'>, None)]) dict[str, str]
Parse the alignment from the output file of the alignment algorithm.
- Parameters:
sequences – the original dictionary of sequences that were aligned
- Returns:
the aligned sequences mapped to their IDs
- parseSequences(sequences: dict[str, str], **kwargs) tuple[str, int]
Create object with sequences and the passed metadata.
Saves the sequences to a file that will serve as input to the command line tools.
- saveToCache(target_ids: list[str], alignment: dict[slice(<class 'str'>, <class 'str'>, None)])
Saves the alignment to the cache for a
list
of sequence IDs.
- class qsprpred.extra.data.utils.msa_calculator.MSAProvider[source]
Bases:
FileSerializable
,ABC
Interface for multiple sequence alignment providers.
This interface defines how calculation and storage of multiple sequence alignments (MSAs) is handled.
- abstract property current: dict[slice(<class 'str'>, <class 'str'>, None)] | None
The current alignment.
Returns the current alignment as a dictionary where keys are sequence IDs as
str
and values are aligned sequences asstr
. The values are of the same length and contain gaps (“-”) where necessary. If the alignment is not yet calculated,None
is returned.