UController

class ursgal.ucontroller.UController(*args, **kwargs)

ursgal main class

Keyword Arguments:
 
  • params (dict) – params that are used for all further analyses, overriding default values from ursgal/kb/*.py
  • profile (str) –

    Profiles key for faster parameter selection. This idea is adapted from MS-GF+ and translated to all search engines. Currently available profiles are:

    • ‘QExactive+’
    • ‘LTQ XL high res’
    • ‘LTQ XL low res’.

Example:

>>> us = ursgal.UController(
...    profile = 'LTQ XL low res',
...    params = { 'database': 'BSA.fasta' }
...)
add_estimated_fdr(input_file=None, force=False, output_file_name=None)

The UController add_estimated_fdr function

Parses a target/decoy search result file and adds a column called “estimated_FDR”.

The CSV must contain:

  • a column with a quality score for each PSM (e-value, error probability etc.)
  • a column called “Is decoy” indicating whether a PSM is decoy or target.
Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file has to be a .csv file meeting the criteria described above.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc.params['validation_score_field'] = 'e-value'
>>> uc.params['bigger_scores_better']   = False
>>> uc.add_estimated_fdr(
...     input_file = 'my_search_results.csv',
... )

Note

This function can be used to independently compare the performance of different quality scores (where performance is the ability to distinguish target PSMs from decoy PSMs).

Returns:Path of the output file
Return type:str
collect_all_unodes_from_kb()

The ucontroller function to collect all unodes

Iterates over all files in the kb folder and checks if the import is possible. Nodes in developement are loaded (‘in_developement’ = True), but not shown in the UController overview.

Note: internal function

Returns:Dictionary of unodes
Return type:dict
combine_search_results(input_files, engine, force=None, output_file_name=None)

The ucontroller combine_search_results function combines search result .csv files that were generated by different search engines.

Keyword Arguments:
 
  • input_files (list) – A list containing the complete paths to two or more input files. Input files have to be unified result .csv files that were produced by different engines.
  • engine (str) – The name of the desired search result combiner. Can also be a shortened version if it is unambigous.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc=ursgal.UController()
>>> unified_merged_results = [
...    'BSA_xtandem_piledriver_unified_merged.csv',
...    'BSA_msgfplus_unified_merged.csv',
...    'BSA_omssa_unified_merged.csv'
...]
>>> uc.combine_search_results(
...    input_files = unified_merged_results,
...    engine      = 'combine_FDR_0_1'
...)

Note

If you have multiple result files from the same engine, you can merge them with merge_csvs().

Returns:Path of the output file
Return type:str
convert_results_to_csv(input_file, force=None, output_file_name=None)

The ucontroller convert_results_to_csv function

Note: uses the Java mzidentml library (Reisinger et al., 2012)

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file currently has to be an identification engine result file
  • force (bool) – (re)do the analysis if output files already exists
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> us=ursgal.UController( profile='LTQ XL high res' )
>>> us.convert_results_to_csv(
...    input_file = 'my_result.xml',
...)
Returns:Path of the output file
Return type:str
convert_to_mgf_and_update_rt_lookup(input_file, force=None, output_file_name=None)

Converts the mzML to mgf and updates the scanID to retention time lookup. The looukp is needed for the unifying of the .csv files.

Parameters:input_file (str) – mzML input file name
Returns:name of the output mgf file
Return type:str
determine_availability_of_unodes()

The ucontroller determine_availability_of_unodes function

Note: internal function

Checks for engines in ursgal/resources/<platform>/<architecture> and expects the executable to be in the corresponding folder.

distinguish_multi_and_single_input(in_input)

Finds out whether the input is a single file or a list of files and returns a bool indicating so, as well as the input file(s)

download_resources()

Function to download all executable from the specified http url

dump_multi_json(fpath, fdicts)

For UNodes that take multiple input files. Generates a json for the multi-input helper file. This json allows ursgal to check whether input changed or not, to determine if a node has to be re-run or not.

engine_sanity_check(short_engine)

The ucontroller engine_sanity_check function

Takes input and name and tries to guess the full engine name, e.g. including the version number. omssa as inpout will yield omssa_2_1_9 if there is only one omssa engine installed, i.e. the mapping (<stored_fulle_engine_name>.startswith( <input> ) has to be unique and defined.

Additionally, sanity check also validates if engine is available on the system.

Note: internal function, since assertion error is called.

Parameters:short_engine (str) – engine short name or tag

calls self.guess_engine_name()

Returns:Full name of the engine or None.
Return type:str
eval_if_run_needs_to_be_executed(engine=None, force=None)

Returns the reason why self.run needs to be executed or None if there is no need

execute_unode(input_file, engine, force=False, output_file_name=None, dry_run=False)

The UController execute_unode function. Executes arbitrary UNodes, as specified by their name.

Keyword Arguments:
 
  • input_file (str or list of str) – The complete path to the input, or a list of paths to the input files.
  • engine (str) – Engine name one wants to execute
  • force (bool) – (Re)do the analysis if output files already exists
  • dry_run (bool) – Do not execute; only return the output file name

Note

Can also execute UNodes that are tagged as ‘in development’ in kb (=not shown in UController overview) if their name is specified.

fetch_file(engine=None)

The UController fetch_file function

Downloads files (FTP or HTTP).

Keyword Arguments:
 engine (str) – Available options are ‘get_http_files_1_0_0’ and ‘get_ftp_files_1_0_0’

Example:

>>> params = {
...     'ftp_url'       : 'ftp.peptideatlas.org',
...     'ftp_login'         : 'PASS00269',
...     'ftp_password'      : 'FI4645a',
...     'ftp_include_ext'   : [
...         'JB_FASP_pH8_2-3_28122012.mzML',
...     ],
...     'ftp_output_folder' : '/home/Desktop/,
... }
>>> uc = ursgal.UController(
...     params = params
... )
>>> uc.fetch_file(
...     engine     = 'get_ftp_files_1_0_0'
... )
Returns:Path of the downloaded file
Return type:str
filter_csv(input_file, force=False, output_file_name=None)

The UController filter_csv function

Filters .csv files row-wise according to user-defined rules.

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file has currently to be a .csv file.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

The filter rules have to be defined in the params. See the engine documentation for further information ( filter_csv_1_0_0._execute() ).

Example

>>> # Only columns with these attributes will be retained:
>>> # a) 'PEP' column value must be lower than or equal to 0.01
>>> # b) 'Is decoy' column value must equal 'false'
>>> uc.params['csv_filter_rules'] = [
...     ['PEP',      'lte',    0.01   ],
...     ['Is decoy', 'equals', 'false']
... ]
>>> uc.filter_csv( 'my_results.csv' )
generate_multi_file_dicts(input_files)

generates a file_dict for access in the UNode classes. in the UNode classes, a file_dict can be found for each input file under self.params[“input_file_dicts”]. also adds some “quick-access” entries to the file_dicts. these file dicts contain the input/output file dicts for that file, as well as quick-access information (i.e. “last_engine”)

generate_multi_helper_file(input_files)

for UNodes that take multiple input files. generates a temporary single input helper file, which acts as the input file so that all the routines (set_io, write history) work normally with multiple files.

generate_target_decoy(input_files=None, engine=None, force=False, output_file_name=None)

The ucontroller function for target_decoy database generation.

Keyword Arguments:
 
  • input_files (list) – List with complete paths to one or more fasta databases.
  • engine (str) – name of the database generator which should be run, can also be a short version if this name is unambigous
  • force (bool) – (re)do the analysis if ouput files already exists
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> my_databases = ['homo_sapiensA.fasta', 'homo_sapiensB.fasta']
>>> uc = ursgal.UController()
>>> new_target_decoy_db = uc.generate_target_decoy(
...    input_files      = my_databases,
...    engine           = 'generate_target_decoy_1_0_0',
...    output_file_name = 'my_homo_sapiens_target_decoy_db.fasta'
...)

The returned database can then be set as the new database for searches.

Example:

>>> uc.params['database'] = new_target_decoy_db
Returns:Name/path of the output file
Return type:str
get_mzml_that_corresponds_to_mgf(mgf_path)

Checks the history of a MGF file to determine which mzML is stems from. Returns the path to that mzML.

guess_engine_name(short_engine)

The ucontroller function for guessing the right engine name from a short name. For example ‘omssa’ is translated into omssa_2_1_9 which is the only available version of omssa in ursgal. If you use an ambigous name or if a engine has multiple version, it is required to name the engine unambigously. Instead of myrimatch use myrimatch_2_1_138.

Parameters:short_engine (str) – engine short name or tag

Iterates over self.unodes.keys() and checks if:

  • the keys start with the short_engine
  • that the match is unique

Notes: internal function

Returns:
Full name of engine or None if short_engine has
multiple hits
Return type:str
input_file_sanity_check(input_file, engine=None, extensions=None, multi=False, custom_str=None)

The ucontroller input_file_sanity_check function

Asserts that input files exist, can be read, have the right file type and file extension etc. Raises an AssertionError if any criterion is violated.

Keyword Arguments:
 
  • input_file (str or list) – input file path to be checked, or a list of input file paths in the case of multi-nodes
  • engine (str) – the name of the engine, file extension requirements will be looked up in engine/kb (optional)
  • extensions (list) – a list of permitted file extensions (optional)
  • multi (bool) – whether the UNode accepts multiple input files or not

Note

Internal Function

Returns:None
merge_csvs(input_files, force=None, output_file_name=None)

The ucontroller merge_csvs function

Merges unified .csv files generated by the same search engine into a single .csv file. This is needed if you want to validate search results from the same identification engine on multiple mzML files. For example if multiple fraction of the original sample for LS-MS/MS analysis were measured and represent a sample/analysis entity.

Keyword Arguments:
 
  • input_files (list) – A list containing the complete paths to two or more input files. Input files have to be .csv files.
  • force (bool) – (re)do the analysis if output file already exists
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> us = ursgal.UController()
>>> xtandem_results = [
...     'BSA_1_xtandem_sledgehammer_unified.csv',
...     'BSA_2_xtandem_sledgehammer_unified.csv',
...     'BSA_3_xtandem_sledgehammer_unified.csv'
... ]
>>> us.merge_csvs( input_files = xtandem_results )
Returns:Path of the output file
Return type:str
merge_fdicts(*fdicts)
prepare_resources(root_zip_target_folder)
run_unode_if_required(force, engine_name, answer, history_addon=None)

The ucontroller run_unode_if_required function

Note

internal function

Executes a UNode if required. Otherwise prints why the run was not required. If the UNode is executed, the corresponding json is dumped and the history is updated.

Keyword Arguments:
 
  • force (bool) – (re)do the analysis if output files already exists
  • engine_name (str) – name of the engine to be executed (after verifying with engine_sanity_check )
  • answer (str or None) – The answer of prepare_unode_run(). Can be None if no re-run is required, or a string indicating the reason for re-run
sanitize_csv(input_file, force=False, output_file_name=None)

The UController sanitize_csv function

Result files (.csv) are sanitized following defined parameters. That means, for each spectrum PSMs are compared and the best spectrum (spectra) is (are) chosen.

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file has currently to be a .csv file.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

The parameters have to be defined in the params. See the engine documentation for further information ( sanitize_csv_1_0_0._execute() ).

Example

>>> # Only the best PSM for one spectrum is retained
>>> # and only if its PEP is differing from the secondbest by
>>> # two orders of magnitude
>>> uc.params['validation_score_field'] = 'PEP'
>>> uc.params['bigger_scores_better'] = False
>>> uc.params['score_diff_threshold'] = 2
>>> uc.params['threshold_is_log10'] = True
>>> uc.sanitize_csv( 'my_results.csv' )
sanitize_userdefined_output_filename(user_fname, engine)

If the user defined a node output file name, we remove all path info from it (not supported) and throw a warning; possibly add a prefix; possibly add the correct file extension (if user didn’t already include it)

search(input_file, engine, force=None, output_file_name=None)

The ucontroller search function

Performs a peptide search using the specified search engine and mzML file. Produces a CSV file with peptide spectrum matches in the unified Ursgal CSV format. see: List of available engines

Keyword Arguments:
 
  • input_file (str) – The complete path to the mzML file, or an MGF file that was converted from mzML.
  • engine (str) – The name of the identification engine which should be used, can also be a short version if this name is unambigous.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.
Example::
>>> uc = ursgal.UController(
...    profile = 'LTQ XL high res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> uc.search(
...    input_file = 'BSA.mzML',
...    engine     = 'omssa'
... )
Returns:Path of the output file (unified CSV format)
Return type:str

Note

Some search engines require a lot of RAM (up to 14GB, depending on your input files). If you don’t have a lot of RAM, some engines might crash. Consider using X!Tandem or OMSSA in these cases, since they are less demanding.

Note

This function calls four search-related ursgal functions in succession, all of which can also be called individually:

search_mgf(input_file, engine, force=None, output_file_name=None)

The UController search_mgf function

Does the main peptide identification search with the specified identification engine. This function is called with every mzML and every search which should be used. The function uses UNode.run() to execute a single search engine. For example to execute X!Tandem via command line.

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file has to be a .MGF file (but .mzML files can be converted to .MGF with Ursgal)
  • engine (str) – the name of the identification engine which should be run, can also be a short version if this name is unambigous.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc = ursgal.UController(
...    profile ='LTQ XL high res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> uc.search_mgf(
...    input_file = 'BSA.mgf',
...    engine     = 'xtandem_piledriver'
... )
Returns:Path of the output file
Return type:str

Note

Consider using search() instead. search() automatically converts mzML to MGF and produces a unified CSV output file.

set_file_info_dict(in_file)

Splits ext and path and so on

set_profile(profile, dev_mode=False)

The ucontroller set_profile function

Note

internal function

Parameters:profile (str) – Profile speficied to use for all searches.

Available profiles:

  • ‘QExactive+’
  • ‘LTQ XL high res’
  • ‘LTQ XL low res’

Sets self.params according to profile name defined in ursgal.kb.profiles

Example:

>>>'LTQ XL low res' : {
...    # MS 1 orbitrap & MSn iontrap
...    'frag_mass_tolerance'       : 0.5,
...    'frag_mass_tolerance_unit'  : 'da',
...    'instrument'                : 'low_res_LTQ',
...    'frag_method'               : 'cid'
...}

Own profiles can easily be defined in profiles.py in ursgal/kb according to the need parameters or machine specifications.

show_unode_overview()

The ucontroller show_unode_overview function

Note

internal function

Prints the overview of all available nodes. The overview includes the category, name and availability of each node. Available nodes are highlighted. Here also the correct functionality of the engine avaibility and installation is verified.

unify_csv(input_file, force=False, output_file_name=None)

The ucontroller unify_csv function

Unifies the .csv files which were converted by the mzidentml library. The corrections for each engine are listed in the node under ursgal/resources/arc_independent/unify_csv_1_0_0

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, input file has currently to be a .csv file.
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc=ursgal.UController(
...     profile = 'LTQ XL low res',
...     params  = {'database': 'BSA.fasta'}
... )
>>> xtandem_result_xml = uc.search_mgf(
...     input_file = 'BSA.mzML',
...     engine     = 'xtandem',
... )
>>> xtandem_result_csv = uc.convert_results_to_csv(
...     input_file = xtandem_result_xml
... )
>>> unified_csv = uc.unify_csv(
...     input_file = xtandem_result_csv
... )
Returns:Path of the output file
Return type:str
validate(input_file, engine, force=None, output_file_name=None)

The UController validate function

Does statistical post-processing of unified search result .csv files with the specified validation engine.

Depending on the validation method a posterior error probability (PEP) and/or a q-value will be available in the final results.

Keyword Arguments:
 
  • input_file (str) – The complete path to the input, a unified (and possibly merged) search result .csv.
  • engine (str) – the name of the validation engine which should be run, can also be a short version if this name is unambigous
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Note

Input files to validate() must be in unified csv format (i.e. output files of search() or unify_csv()).

Example:

>>> uc = ursgal.UController(
...    profile = 'LTQ XL low res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> xtandem_result_csv = uc.search(
...    input_file = 'BSA.mzML',
...    engine     = 'xtandem_piledriver'
... )
>>> validated_csv = uc.validate(
...    input_file = xtandem_result_csv,
...    engine     = 'percolator_2_08'
... )
Returns:Path of the output file
Return type:str
verify_engine_produced_an_output_file(expected_fpath, engine_name)

Since not all engines raise an exception when they fail, we check if the output file was successfully produced or not to throw a proper exception in case the engine crashed.

visualize(input_files, engine, force=None, output_file_name=None, multi=True)

The ucontroller function for visualization

Does graphical visualization of result .csv files.

Keyword Arguments:
 
  • input_files (list) – list with complete paths of .csv files
  • engine (str) – the name of the visualizer which should be run, can also be a short version if this name is unambigous
  • force (bool) – (Re)do the analysis, even if output file already exists.
  • output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc = ursgal.UController( profile='LTQ XL high res' )
>>> xtandem_result_csv = uc.search(
...     input_file = 'BSA.mzML',
...     engine     = 'xtandem_piledriver',
... )
>>> omssa_result_csv = uc.search(
...     input_file = 'BSA.mzML',
...     engine     = 'omssa',
... )
>>> uc.visualize(
...     input_files = [xtandem_result_csv, omssa_result_csv],
...     engine      = 'venndiagram',
... )

Note

For detailed information about the VennDiagram UNode, see venndiagram_1_0_0._execute().

Returns:Path of the output file
Return type:str