Other Engines

Fetcher

Get FTP Files 1_0_0

class ursgal.wrappers.get_ftp_files_1_0_0.get_ftp_files_1_0_0(*args, **kwargs)

get_ftp_files_1_0_0 UNode

Downloads files from FTP servers

Note

meta info param ‘output_extensions’ is by default txt, so that the temporary txt json files get properly deleted

Parameters:
  • ftp_url (*) –
  • folder (*) –
  • login (*) –
  • password (*) –
  • include_ext (*) –
  • output_folder (*) –
  • max_number_of_files (*) –
  • blocksize (*) –
_execute()

Downloads files from FTP server

ursgal.resources.platform_independent.arc_independent.get_ftp_files_1_0_0.get_ftp_files_1_0_0.main(ftp_url=None, folder=None, login=None, password=None, include_ext=None, output_folder=None, max_number_of_files=None, blocksize=None)

Get HTTP Files 1_0_0

class ursgal.wrappers.get_http_files_1_0_0.get_http_files_1_0_0(*args, **kwargs)

get_http_files_1_0_0 UNode

Downloads files via http

Parameters:
  • http_url (*) –
  • http_output_folder (*) –

Note

meta info param ‘output_extensions’ is by default txt, so that the temporary txt json files get properly deleted

_execute()

Downloads files via http

ursgal.resources.platform_independent.arc_independent.get_http_files_1_0_0.get_http_files_1_0_0.main(http_url=None, http_output_folder=None)

Meta Engines

Combine FDR 0_1

class ursgal.wrappers.combine_FDR_0_1.combine_FDR_0_1(*args, **kwargs)

combine FDR 0_1 UNode

An implementation of the “combined FDR Score” algorithm, as described in: Jones AR, Siepen JA, Hubbard SJ, Paton NW (2009): “Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines.”

Input should be multiple CSV files from different search engines. Each CSV requires a PEP column, for instance by post-processing with Percolator.

Returns a merged CSV file with all PSMs that were found and an added column “Combined FDR Score”.

_execute()

Executing the combine_FDR_0_1 main function with parameters that were defined in preflight (stored in self.command_dict)

The main function is imported and then executed using the parameters from command_dict.

Returns:None
preflight()

Building the list of parameters that will be passed to the combine_FDR_0_1 main function.

These parameters are stored in self.command_dict

Returns:None

Combine PEP 1_0_0

class ursgal.wrappers.combine_pep_1_0_0.combine_pep_1_0_0(*args, **kwargs)

combine_pep_1_0_0 UNode

Combining Multiengine Search Results with “Combined PEP”

“Combined PEP” is a hybrid approach combining elements of the “combined FDR” approach (Jones et al., 2009), elements of PeptideShaker, and elements of Bayes’ theorem. Similar to “combined FDR”, “combined PEP” groups the PSMs. For each search engine, the reported PSMs are treated as a set and the logical combinations of all sets are treated separately as done in the “combined FDR” approach. For instance, three search engines would result in seven PSM groups, which can be visualized by the seven intersections of a three-set Venn diagram. Typically, a PSM group that is shared by multiple engines contains fewer decoy hits and thus represents a higher quality subset and thus its PSMs receive a higher score. This approach is based on the assumption that the search engines agree on the decoys and false-positives as they agree on the targets.

The combined PEP approach uses Bayes’ theorem to calculate a multiengine PEP (MEP) for each PSM based on the PEPs reported by, for example, Percolator for different search engines, that is

http://pubs.acs.org/appl/literatum/publisher/achs/journals/content/jprobs/2016/jprobs.2016.15.issue-3/acs.jproteome.5b00860/20160229/images/pr-2015-00860d_m001.gif

This is done for each PSM group separately.

Then, the combined PEP (the final score) is computed similar to PeptideShaker using a sliding window over all PSMs within each group (sorted by MEP). Each PSM receives a PEP based on the target/decoy ratio of the surrounding PSMs.

http://pubs.acs.org/appl/literatum/publisher/achs/journals/content/jprobs/2016/jprobs.2016.15.issue-3/acs.jproteome.5b00860/20160229/images/pr-2015-00860d_m002.gif

Finally, all groups are merged and the results reported in one output, including all the search result scores from the individual search engines as well as the FDR based on the “combined PEP”.

The sliding window size can be defined by adjusting the Ursgal parameter “window_size” (default is 249).

Input should be multiple CSV files from different search engines. Each CSV requires a PEP column, for instance by post-processing with Percolator.

Returns a merged CSV file with all PSMs that were found and two added columns:

  • column “Bayes PEP”:
    The multi-engine PEP, see explanation above
  • column “combined PEP”:
    The PEP as computed within the engine combination PSMs

For optimal ranking, PSMs should be sorted by combined PEP. Ties can be resolved by sorting them by Bayes PEP.

_execute()

Executing the combine_FDR_0_1 main function with parameters that were defined in preflight (stored in self.command_dict)

The main function is imported and then executed using the parameters from command_dict.

Returns:None
preflight()

Building the list of parameters that will be passed to the combine_pep_1_0_0 main function.

These parameters are stored in self.command_dict

Returns:None

Misc Engines

Filter CSV 1_0_0

class ursgal.wrappers.filter_csv_1_0_0.filter_csv_1_0_0(*args, **kwargs)

filter_csv_1_0_0 UNode

Filters .csv files row-wise according to user-defined rules.

The filter rules have to be defined in the params. See the engine documentation for further information ( filter_csv_1_0_0._execute() ).

_execute()

Result files (.csv) are filtered for defined filter parameters.

Input file has to be a .csv

Creates a _accepted.csv file and returns its path. If defined also rejected entries are written to _rejected.csv.

Note

To write the rejected entries define ‘write_unfiltered_results’ as True in the parameters.

Available rules:

  • lte
  • gte
  • lt
  • gt
  • contains
  • contains_not
  • equals
  • equals_not
  • regex

Example

>>> params = {
>>>     'csv_filter_rules':[
>>>         ['PEP', 'lte', 0.01],
>>>         ['Is decoy', 'equals', 'false']
>>>     ]
>>>}

The example above would filter for posterior error probabilities lower than or equal to 0.01 and filter out all decoy proteins.

Rules are defined as list of lists with the first list element as the column name/csv fieldname, the second list element the rule and the third list element the value which should be compared. Multiple rules can be applied, see example above. If the same fieldname should be filtered multiply (E.g. Sequence should not contain ‘T’ and ‘Y’), the rules have to be defined separately.

Example

>>> params = {
>>>     'csv_filter_rules':[
>>>         ['Sequence','contains_not','T'],
>>>         ['Sequence','contains_not','Y']
>>>     ]
>>>}

lte:

‘lower than or equal’ (<=) value has to comparable i.e. float or int. Values are accepted if they are lower than or equal to the defined value. E.g. [‘PEP’,’lte’,0.01]

gte:

‘greater than or equal’ (>=) value has to comparable i.e. float or int. Values are accepted if they are greater than or equal to the defined value. E.g. [‘Exp m/z’,’gte’,180]

lt:

‘lower than’ (<=) value has to comparable i.e. float or int. Values are accepted if they are lower than the defined value. E.g. [‘PEP’,’lt’,0.01]

gt:

‘greater than’ (>=) value has to comparable i.e. float or int. Values are accepted if they are greater than the defined value. E.g. [‘PEP’,’gt’,0.01]

contains:

Substrings are checked if they are present in the the full string. E.g. [‘Modifications’,’contains’,’Oxidation’]

contains_not:

Substrings are checked if they are present in the the full string. E.g. [‘Sequence’,’contains_not’,’M’]

equals:

String comparison (==). Comparison has to be an exact match to pass. E.g. [‘Is decoy’,’equals’,’false’]. Floats and ints are not compared at the moment!

equals_not:

String comparison (!=). Comparisons differing will be rejected. E.g. [‘Is decoy’,’equals_not’,’true’]. Floats and ints are not compared at the moment!

regex:

Any regular expression matching is possible E.g. CT and CD motif search [‘Sequence’,’regex’,’C[T|D]’]

Note

Some spreadsheet tools interpret False and True and show them as upper case when opening the files, even if they are actually written in lower case. This is especially important for target and decoy filtering, i.e. [‘Is decoy’,’equals’,’false’]. ‘false’ has to be lower case, even if the spreadsheet tool displays it as ‘FALSE’.

ursgal.resources.platform_independent.arc_independent.filter_csv_1_0_0.filter_csv_1_0_0.main(input_file=None, output_file=None, filter_rules=None, output_file_unfiltered=None)

Filters csvs

Generate Target Decoy 1_0_0

class ursgal.wrappers.generate_target_decoy_1_0_0.generate_target_decoy_1_0_0(*args, **kwargs)

Generate Target Decoy 1_0_0 UNode

_execute()

Creates a target decoy database based on shuffling of peptides or complete reversing the protein sequence.

The engine currently available generates a very stringent target decoy database by peptide shuffling but also offers the possibility to simple reverse the protein sequence. The mode can be defined in the params with ‘decoy_generation_mode’.

The shuffling peptide method is described below. As one of the first steps redundant sequences are filtered and the protein gets a tag which highlight its double occurence in the database. This ensures that no unequal distribution of target and decoy peptides is present. Further, every peptide is shuffled, while the amindo acids where the enzyme cleaves aremaintained at their original position. Every peptide is only shuffled once and the shuffling result is stored. As a result it is ensured that if a peptide occurs multiple times it is shuffled the same way. It is further ensured that unmutable peptides (e.g. ‘RR’ for trypsin) are not shuffled and are reported by the engine as unmutable peptides in a text file, so that they can be excluded in the further analysis. This way of generating a target decoy database lead to the fulfillment of the following quality criteria (Proteome Bioinformatics, Eds: S.J. Hubbard, A.R. Jones, Humana Press ).

Quality criteria:

  • every target peptide sequence has exactly one decoy peptide sequence
  • equal amino acid distribution
  • equal protein and peptide length
  • equal number of proteins and peptides
  • similar mass distribution
  • no predicted peptides in common

Avaliable modes:

  • shuffle_peptide - stringent target decoy generation with shuffling
    of peptides with maintaining the cleavage site amino acid.
  • reverse_protein - reverses the protein sequence

Available enzymes and their cleavage site can be found in the knowledge base of generate_target_decoy_1_0_0.

ursgal.resources.platform_independent.arc_independent.generate_target_decoy_1_0_0.generate_target_decoy_1_0_0.main(input_files=None, output_file=None, enzyme=None, decoy_tag='decoy_', mode='shuffle_peptide')

Merge CSVS 1_0_0

class ursgal.wrappers.merge_csvs_1_0_0.merge_csvs_1_0_0(*args, **kwargs)

Merge CSVS 1_0_0 UNode

_execute()

Merges .csv files

for same header, new rows are appended

for different header, new columns are appended

ursgal.resources.platform_independent.arc_independent.merge_csvs_1_0_0.merge_csvs_1_0_0.main(csv_files=None, output=None)

Merges ident csvs

Sanitize CSV 1_0_0

class ursgal.wrappers.sanitize_csv_1_0_0.sanitize_csv_1_0_0(*args, **kwargs)

sanitize_csv_1_0_0 UNode

Result files (.csv) are sanitized following defined parameters. That means, for each spectrum PSMs are compared and the best spectrum (spectra) is (are) chosen.

The parameters have to be defined in the params. See the engine documentation for further information ( sanitize_csv_1_0_0._execute() ).

_execute()

Result files (.csv) are sanitized following defined parameters. That means, for each spectrum PSMs are compared and the best spectrum (spectra) is (are) chosen

Input file has to be a .csv

Creates a _sanitized.csv file and returns its path.

Note

If not specified, the validation_score_field and bigger_scores_better parameters are determined from the last engine. Therefore, if sanitize_csv_1_0_0 is applied to merged or processed result files, both parameters need to be specified.

Available parameters:

  • score_diff_threshold (float): minimum score difference between
    the best PSM and the first rejected PSM of one spectrum
  • threshold_is_log10 (bool): True, if log10 scale has been used for
    score_diff_threshold.
  • accept_conflicting_psms (bool): If True, multiple PSMs for one
    spectrum can be reported if their score difference is below the threshold. If False, all PSMs for one spectrum are removed if the score difference between the best and secondbest PSM is not above the threshold, i.e. if there are conflicting PSMs with similar scores.
  • num_compared_psms (int): maximum number of PSMs (sorted by score,
    starting with the best scoring PSM) that are compared
  • remove_redundant_psms (bool): If True, redundant PSMs (e.g.
    the same identification reported by multiple engined) for the same spectrum are removed. An identification is defined by the combination of ‘Sequence’, ‘Modifications’ and ‘Charge’.
ursgal.resources.platform_independent.arc_independent.sanitize_csv_1_0_0.sanitize_csv_1_0_0.main(input_file=None, output_file=None, grouped_psms=None, validation_score_field=None, bigger_scores_better=None, score_diff_threshold=2.0, log10_threshold=True, accept_conflicting_psms=False, num_compared_psms=2, remove_redundant_psms=False)

Spectra with multiple PSMs are sanitized, i.e. only the PSM with best PEP score is accepted and only if the best hit has a PEP that is at least two orders of magnitude smaller than the others

Upeptide mapper v1_0_0

class ursgal.wrappers.upeptide_mapper_1_0_0.upeptide_mapper_1_0_0(*args, **kwargs)

upeptide_mapper_1_0_0 UNode

Note

Different converter versions can be used (see parameter ‘peptide_mapper_converter_version’) as well as different classes inside the converter node (see parameter ‘peptide_mapper_class_version’ )

Available converter classes of upeptide_mapper_1_0_0
  • UPeptideMapper_v3 (default)
  • UPeptideMapper_v4 (no buffering and enhanced speed to v3)
  • UPeptideMapper_v2
_execute()

Peptides from search engine csv file are mapped to the given database(s)

ursgal.resources.platform_independent.arc_independent.upeptide_mapper_1_0_0.upeptide_mapper_1_0_0.main(input_file=None, output_file=None, params=None)

Peptide mapping implementation as Unode.

Parameters:
  • input_file (str) – input filename of csv
  • output_file (str) – output filename
  • params (dict) – dictionary containing ursgal params
Results and fixes
  • All peptide Sequences are remapped to their corresponding protein, assuring correct start, stop, pre and post aminoacid.
  • It is determined if the corresponding proteins are decoy proteins. These peptides are reported after the mapping process.
  • Non-mappable peptides are reported. This can e.g. due to ‘X’ in protein sequences in the fasta file or other non-standard amino acids. These are sometimes replaced/interpreted/interpolated by the search engine. A recheck is performed if the peptides can be mapped containing an ‘X’ at any position. These peptides are also reported. If peptides can still not be mapped after re-mapping, these are reported as well.

Mapper class v4 (dev)

class ursgal.resources.platform_independent.arc_independent.upeptide_mapper_1_0_0.upeptide_mapper_1_0_0.UPeptideMapper_v4(fasta_database)

UPeptideMapper V4

Improved version of class version 3 (changes proposed by Christian)

Note

Uses the implementation of Aho-Corasick algorithm pyahocorasick. Please refer to https://pypi.python.org/pypi/pyahocorasick/ for more information.

cache_database(fasta_database)

Function to cache the given fasta database.

Parameters:fasta_database (str) – path to the fasta database

Note

If the same fasta_name is buffered again all info is purged from the class.

map_peptides(peptide_list)

Function to map a given peptide list in one batch.

Parameters:peptide_list (list) – list with peptides to be mapped
Returns:
Dictionary containing
peptides as keys and lists of protein mappings as values of the given fasta_name
Return type:peptide_2_protein_mappings (dict)

Note

Based on the number of peptides the returned mapping dictionary can become very large.

Warning

The peptide to protein mapping is resetted if a new list o peptides is mapped to the same database (fasta_name).

Examples:

peptide_2_protein_mappings['PEPTIDE']  = [
    {
        'start' : 1,
        'end'   : 10,
        'pre'   : 'K',
        'post'  : 'D',
        'id'    : 'BSA'
    }
]

Mapper class v3 (dev)

class ursgal.resources.platform_independent.arc_independent.upeptide_mapper_1_0_0.upeptide_mapper_1_0_0.UPeptideMapper_v3(fasta_database)

UPeptideMapper V3

New improved version which is faster and consumes less memory than earlier versions. Is the new default version for peptide mapping.

Note

Uses the implementation of Aho-Corasick algorithm pyahocorasick. Please refer to https://pypi.python.org/pypi/pyahocorasick/ for more information.

Warning

The new implementation is still in beta/testing phase. Please use, check and interpret accordingly

cache_database(fasta_database, fasta_name)

Function to cache the given fasta database.

Parameters:
  • fasta_database (str) – path to the fasta database
  • fasta_name (str) – name of the database (e.g. os.path.basename(fasta_database))

Note

If the same fasta_name is buffered again all info is purged from the class.

map_peptides(peptide_list, fasta_name)

Function to map a given peptide list in one batch.

Parameters:
  • peptide_list (list) – list with peptides to be mapped
  • fasta_name (str) – name of the database (e.g. os.path.basename(fasta_database))
Returns:

Dictionary containing

peptides as keys and lists of protein mappings as values of the given fasta_name

Return type:

peptide_2_protein_mappings (dict)

Note

Based on the number of peptides the returned mapping dictionary can become very large.

Warning

The peptide to protein mapping is resetted if a new list o peptides is mapped to the same database (fasta_name).

Examples:

peptide_2_protein_mappings['BSA1']['PEPTIDE']  = [
    {
        'start' : 1,
        'end'   : 10,
        'pre'   : 'K',
        'post'  : 'D',
        'id'    : 'BSA'
    }
]
purge_fasta_info(fasta_name)

Purges regular sequence lookup and fcache for a given fasta_name

Mapper class v2 (deprecated)

class ursgal.resources.platform_independent.arc_independent.upeptide_mapper_1_0_0.upeptide_mapper_1_0_0.UPeptideMapper_v2(word_len=None)

UPeptideMapper class offers ultra fast peptide to sequence mapping using a fast cache, hereafter referred to fcache.

The fcache is build using the build_lookup_from_file or build_lookup functions. The fcache can be queried using the UPeptideMapper.map_peptide() function.

Note

This is the deprectaed version of the peptide mapper which can be used by setting the parameter ‘peptide_mapper_class_version’ to ‘UPeptideMapper_v2’. Otherwise the new mapper class version (‘UPeptideMapper_v3’) is used as default.

_create_fcache(id=None, seq=None, fasta_name=None)

Updates the fast cache with a given sequence

_format_hit_dict(seq, start, end, id)

Creates a formated dictionary from a single mapping hit. At the same time evaluating pre and pos amino acids from the given sequence Final output looks for example like this:

{
    'start' : 12,
    'end'   : 18,
    'id'    : 'Protein Id passed to the function',
    'pre'   : 'A',
    'post'  : 'V',
}

Note

If the pre or post amino acids are N- or C-terminal, respectively, then the reported amino acid will be ‘-‘

build_lookup(fasta_name=None, fasta_stream=None, force=True)

Builds the fast cache and regular sequence dict from a fasta stream

build_lookup_from_file(path_to_fasta_file, force=True)

Builds the fast cache and regular sequence dict from a fasta stream

return the internal fasta name, i.e. dirs stripped away from the path

map_peptide(peptide=None, fasta_name=None, force_regex=False)

Maps a peptide to a fasta database.

Returns a list of single hits which look for example like this:

{
    'start' : 12,
    'end'   : 18,
    'id'    : 'Protein Id passed to the function',
    'pre'   : 'A',
    'post'  : 'V',
}
map_peptides(peptide_list, fasta_name=None, force_regex=False)

Wrapper function to map a given peptide list in one batch.

Parameters:
  • peptide_list (list) – list with peptides to be mapped
  • fasta_name (str) – name of the database
purge_fasta_info(fasta_name)

Purges regular sequence lookup and fcache for a given fasta_name

Quantification Engines

pyQms 1_0_0

Validation Engines

Kojak tailored Percolator 2_08

class ursgal.wrappers.kojak_percolator_2_08.kojak_percolator_2_08(*args, **kwargs)

Kojak adjusted Percolator 2_08 UNode

Kojak provides preformatted Percolator input, this is used direclty as the input file for Percolator. In contrast to the original Percolator node, the input files are not reformatted or used to write a new input file.

Note

Percolator (2.08) has to be symlinked or copied to engine-folder ‘kojak_percolator_2_08’ in order to make this node work.

Reference: Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets.

postflight()

Convert the percolator output .tsv into the .csv format with headers as in the unified csv format.

preflight()

Formatting the command line to via self.params

Percolator 2_08

class ursgal.wrappers.percolator_2_08.percolator_2_08(*args, **kwargs)

Percolator 2_08 UNode

q-value and posterior error probability calculation by a semi-supervised learning algorithm that dynamically learns to separate target from decoy peptide-spectrum matches (PSMs)

Reference: Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets.

postflight()

read the output and merge in back to the ident csv

preflight()

Formating the command line to via self.params

qvality 2_02

class ursgal.wrappers.qvality_2_02.qvality_2_02(*args, **kwargs)

qvality_2_02 UNode

q-value and posterior error probability calculation from score distributions

Reference: Kテ、ll L, Storey JD, Noble WS (2009) QVALITY: non-parametric estimation of q-values and posterior error probabilities.

postflight()

Parse the qvality output and merge it back into the csv file

preflight()

Formating the command line to via self.params

Visualizer

Plot pyGCluster heatmap from CSV 1_0_0

class ursgal.wrappers.plot_pygcluster_heatmap_from_csv_1_0_0.plot_pygcluster_heatmap_from_csv_1_0_0(*args, **kwargs)

plot_pygcluster_heatmap_from_csv_1_0_0 UNode

_execute()

Venn Diagram v1_0_0

class ursgal.wrappers.venndiagram_1_0_0.venndiagram_1_0_0(*args, **kwargs)

Venn Diagram uNode

_execute()

Plot Venn Diagramm for a list of .csv result files (2-5)

Arguments are set in uparams.py but passed to the engine by self.params attribute

Returns:results for the different areas e.g. dict[‘C-(A|B|D)’][‘results’]

Output file is written to the common_top_level_dir

Return type:dict
ursgal.resources.platform_independent.arc_independent.venndiagram_1_0_0.venndiagram_1_0_0.main(*args, **kwargs)

Creates a simple SVG VennDiagram requires 2, 3, 4 or 5 sets as arguments

Keyword Arguments:
 
  • output_file
  • header
  • label_A
  • label_B
  • label_C
  • label_D
  • label_E
  • color_A – e.g. #FF8C00
  • color_B
  • color_C
  • color_D
  • color_E
  • font

the function returns a dict with the following keys were the results can be accesse by e.g. dict[‘C-(A|B|D)’][‘results’]

‘A&B-(C|D)’ ‘C&D-(A|B)’ ‘B&C-(A|D)’ ‘A&B&C&D’ ‘A&C-(B|D)’ ‘B&D-(A|C)’ ‘A&D-(B|C)’ ‘(A&C&D)-B’ ‘(A&B&D)-C’ ‘(A&B&C)-D’ ‘(B&C&D)-A’ ‘A-(B|C|D)’ ‘D-(A|B|C)’ ‘B-(A|C|D)’ ‘C-(A|B|D)’

or for 2 or 3 or 5 VennDiagrams the appropriate combinations …