UController¶

class ursgal.ucontroller.UController(*args, **kwargs)¶

ursgal main class

Keyword Arguments:
	params (dict) – params that are used for all further analyses, overriding default values from ursgal/uparams.py profile (str) – Profiles key for faster parameter selection. This idea is adapted from MS-GF+ and translated to all search engines. Currently available profiles are: ’QExactive+’ ’LTQ XL high res’ ’LTQ XL low res’

Example:

>>> us = ursgal.UController(
...    profile = 'LTQ XL low res',
...    params = { 'database': 'BSA.fasta' }
...)

combine_search_results(input_files, engine=None, force=None, output_file_name=None)¶

The ucontroller combine_search_results function combines search result .csv files that were generated by different search engines.

Keyword Arguments:

input_files (list) – A list containing the complete paths to two or more input files. Input files have to be unified result .csv files that were produced by different engines.
engine (str) – The name of the desired search result combiner. Can also be a shortened version if it is unambigous.
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc=ursgal.UController()
>>> unified_merged_results = [
...    'BSA_xtandem_piledriver_unified_merged.csv',
...    'BSA_msgfplus_unified_merged.csv',
...    'BSA_omssa_unified_merged.csv'
...]
>>> uc.combine_search_results(
...    input_files = unified_merged_results,
...    engine      = 'combine_FDR_0_1'
...)

Note

If you have multiple result files from the same engine, you can merge them with merge_csvs().

Returns:	Path of the output file
Return type:	str

convert(input_file, engine=None, force=None, output_file_name=None, guess_engine=False)¶

The UController convert function converts the given input_file into another format as defined by the specified engine.

Keyword Arguments:

input_file (str) – The complete path to the input file.
engine (str) – The name of the desired converter engine. Can also be a shortened version if it is unambigous.
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.
guess_engine (bool) – The converter engine is guessed based on the input file. This works so far for mzml2mgf conversion and conversion of search_engine result files to csv.

Example:

>>> uc=ursgal.UController()
>>> unified_merged_results = 'BSA_msgfplus_unified_merged.csv',
>>> uc.convert_file(
...    input_file = unified_merged_results,
...    engine     = 'csv2ssl_1_0_0'
...)

Returns:	Path of the output file
Return type:	str

convert_results_to_csv(input_file, force=None, output_file_name=None)¶

The ucontroller convert_results_to_csv function

Note: uses the Java mzidentml library (Reisinger et al., 2012)

Keyword Arguments:
	input_file (str) – The complete path to the input, input file currently has to be an identification engine result file force (bool) – (re)do the analysis if output files already exists output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> us=ursgal.UController( profile='LTQ XL high res' )
>>> us.convert_results_to_csv(
...    input_file = 'my_result.xml',
...)

Returns:	Path of the output file
Return type:	str

Notes: internal function, use convert() instead

convert_to_mgf_and_update_rt_lookup(input_file, force=None, output_file_name=None)¶

Converts the mzML to mgf and updates the scanID to retention time lookup. The looukp is needed for the unifying of the .csv files.

Parameters:	input_file (str) – mzML input file name
Returns:	name of the output mgf file
Return type:	str

Notes: internal function, use convert() instead

determine_availability_of_unodes()¶

The ucontroller determine_availability_of_unodes function

Note: internal function

Checks for engines in ursgal/resources/<platform>/<architecture> and expects the executable to be in the corresponding folder.

distinguish_multi_and_single_input(in_input)¶: Finds out whether the input is a single file or a list of files and returns a bool indicating so, as well as the input file(s)

download_resources(resources=None)¶

Function to download all executable from the specified http url

Keyword Arguments:
	resources (list) – list of specific resources that should be downloaded. If left to None, all possible resources are downloaded.

dump_multi_json(fpath, fdicts)¶: For UNodes that take multiple input files. Generates a json for the multi-input helper file. This json allows ursgal to check whether input changed or not, to determine if a node has to be re-run or not.

engine_sanity_check(short_engine)¶

The ucontroller engine_sanity_check function

Takes input and name and tries to guess the full engine name, e.g. including the version number. omssa as inpout will yield omssa_2_1_9 if there is only one omssa engine installed, i.e. the mapping (<stored_fulle_engine_name>.startswith( <input> ) has to be unique and defined.

Additionally, sanity check also validates if engine is available on the system.

Note: internal function, since assertion error is called.

Parameters:	short_engine (str) – engine short name or tag

calls self.guess_engine_name()

Returns:	Full name of the engine or None.
Return type:	str

eval_if_run_needs_to_be_executed(engine=None, force=None)¶: Returns the reason why self.run needs to be executed or None if there is no need

execute_misc_engine(input_file, engine=None, force=None, output_file_name=None, merge_duplicates=False)¶

The UController execute_misc_engine function

This function can be used to execute any misc engine by only giving the input_file and engine name.

Keyword Arguments:

input_file (str) – The complete path to the input, a unified (and possibly merged) search result .csv.
engine (str) – the name of the validation engine which should be run, can also be a short version if this name is unambigous
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.
merge_duplicates (bool) – If True, the produced output file will be checked for duplicated PSMs, which will be merged into a single line. Caution, the original output file will be overwritten!

Note

Input files to validate() must be in unified csv format (i.e. output files of search() or unify_csv()).

Example:

>>> my_databases = ['homo_sapiensA.fasta', 'homo_sapiensB.fasta']
>>> uc = ursgal.UController()
>>> new_target_decoy_db = uc.execute_misc_engine(
...    input_files      = my_databases,
...    engine           = 'generate_target_decoy_1_0_0',
...    output_file_name = 'my_homo_sapiens_target_decoy_db.fasta'
...)

Returns:	Path of the output file
Return type:	str

execute_unode(input_file, engine=None, force=False, output_file_name=None, dry_run=False, merge_duplicates=False)¶

The UController execute_unode function. Executes arbitrary UNodes, as specified by their name.

Keyword Arguments:
	input_file (str or list of str) – The complete path to the input, or a list of paths to the input files. engine (str) – Engine name one wants to execute force (bool) – (Re)do the analysis if output files already exists dry_run (bool) – Do not execute; only return the output file name

Note

Can also execute UNodes that are tagged as ‘in development’ in kb (=not shown in UController overview) if their name is specified.

fetch_file(engine=None)¶

The UController fetch_file function

Downloads files (FTP or HTTP).

Keyword Arguments:
	engine (str) – Available options are ‘get_http_files_1_0_0’ and ‘get_ftp_files_1_0_0’

Example:

>>> params = {
...     'ftp_url'       : 'ftp.peptideatlas.org',
...     'ftp_login'         : 'PASS00269',
...     'ftp_password'      : 'FI4645a',
...     'ftp_include_ext'   : [
...         'JB_FASP_pH8_2-3_28122012.mzML',
...     ],
...     'ftp_output_folder' : '/home/Desktop/,
... }
>>> uc = ursgal.UController(
...     params = params
... )
>>> uc.fetch_file(
...     engine     = 'get_ftp_files_1_0_0'
... )

Returns:	Path of the downloaded file
Return type:	str

filter_csv(input_file, force=False, output_file_name=None)¶

[ WARNING ] This function is not supported anymore!: Please use execute_misc_engine() instead

The UController filter_csv function

Filters .csv files row-wise according to user-defined rules.

Keyword Arguments:
	input_file (str) – The complete path to the input, input file has currently to be a .csv file. force (bool) – (Re)do the analysis, even if output file already exists. output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

The filter rules have to be defined in the params. See the engine documentation for further information ( filter_csv_1_0_0._execute() ).

Example

>>> # Only columns with these attributes will be retained:
>>> # a) 'PEP' column value must be lower than or equal to 0.01
>>> # b) 'Is decoy' column value must equal 'false'
>>> uc.params['csv_filter_rules'] = [
...     ['PEP',      'lte',    0.01   ],
...     ['Is decoy', 'equals', 'false']
... ]
>>> uc.filter_csv( 'my_results.csv' )

generate_multi_file_dicts(input_files)¶: generates a file_dict for access in the UNode classes. in the UNode classes, a file_dict can be found for each input file under self.params[“input_file_dicts”]. also adds some “quick-access” entries to the file_dicts. these file dicts contain the input/output file dicts for that file, as well as quick-access information (i.e. “last_engine”)

generate_multi_helper_file(input_files)¶: for UNodes that take multiple input files. generates a temporary single input helper file, which acts as the input file so that all the routines (set_io, write history) work normally with multiple files.

generate_target_decoy(input_files=None, engine=None, force=False, output_file_name=None)¶

[ WARNING ] This function is not supported anymore!: Please use execute_misc_engine() instead

The ucontroller function for target_decoy database generation.

Keyword Arguments:

input_files (list) – List with complete paths to one or more fasta databases.
engine (str) – name of the database generator which should be run, can also be a short version if this name is unambigous
force (bool) – (re)do the analysis if ouput files already exists
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> my_databases = ['homo_sapiensA.fasta', 'homo_sapiensB.fasta']
>>> uc = ursgal.UController()
>>> new_target_decoy_db = uc.generate_target_decoy(
...    input_files      = my_databases,
...    engine           = 'generate_target_decoy_1_0_0',
...    output_file_name = 'my_homo_sapiens_target_decoy_db.fasta'
...)

The returned database can then be set as the new database for searches.

Example:

>>> uc.params['database'] = new_target_decoy_db

Returns:	Name/path of the output file
Return type:	str

get_mzml_that_corresponds_to_mgf(mgf_path)¶: Checks the history of a MGF file to determine which mzML is stems from. Returns the path to that mzML.

guess_engine_name(short_engine)¶

The ucontroller function for guessing the right engine name from a short name. For example ‘omssa’ is translated into omssa_2_1_9 which is the only available version of omssa in ursgal. If you use an ambigous name or if a engine has multiple version, it is required to name the engine unambigously. Instead of myrimatch use myrimatch_2_1_138.

Parameters:	short_engine (str) – engine short name or tag

Iterates over self.unodes.keys() and checks if:

the keys start with the short_engine

that the match is unique

Notes: internal function

Returns:	Full name of engine or None if short_engine has multiple hits
Return type:	str

input_file_sanity_check(input_file, engine=None, extensions=None, multi=False, custom_str=None)¶

The ucontroller input_file_sanity_check function

Asserts that input files exist, can be read, have the right file type and file extension etc. Raises an AssertionError if any criterion is violated.

Keyword Arguments:
	input_file (str or list) – input file path to be checked, or a list of input file paths in the case of multi-nodes engine (str) – the name of the engine, file extension requirements will be looked up in engine/kb (optional) extensions (list) – a list of permitted file extensions (optional) multi (bool) – whether the UNode accepts multiple input files or not

Note

Internal Function

Returns:	None

map_peptides_to_fasta(input_file, force=False, output_file_name=None)¶

[ WARNING ] This function is not supported anymore!: Please use execute_misc_engine() instead

The ucontroller function to call the upeptide_mapper node.

Note

Different converter versions can be used (see parameter ‘peptide_mapper_converter_version’) as well as different classes inside the converter node (see parameter ‘peptide_mapper_class_version’ )

Available converter nodes

upeptide_mapper_1_0_0

Available converter classes of upeptide_mapper_1_0_0

UPeptideMapper_v3 (default)
UPeptideMapper_v4 (no buffering and enhanced speed to v3)
UPeptideMapper_v2

Keyword Arguments:
	input_file (str) – The complete path to the input, input file has currently to be a .csv file. force (bool) – (Re)do the analysis, even if output file already exists. output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.
Returns:	Path of the output file
Return type:	str

merge_csvs(input_files, force=None, output_file_name=None, merge_duplicates=False)¶

The ucontroller merge_csvs function

Merges unified .csv files generated by the same search engine into a single .csv file. This is needed if you want to validate search results from the same identification engine on multiple mzML files. For example if multiple fraction of the original sample for LS-MS/MS analysis were measured and represent a sample/analysis entity.

Keyword Arguments:
	input_files (list) – A list containing the complete paths to two or more input files. Input files have to be .csv files. force (bool) – (re)do the analysis if output file already exists output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> us = ursgal.UController()
>>> xtandem_results = [
...     'BSA_1_xtandem_sledgehammer_unified.csv',
...     'BSA_2_xtandem_sledgehammer_unified.csv',
...     'BSA_3_xtandem_sledgehammer_unified.csv'
... ]
>>> us.merge_csvs( input_files = xtandem_results )

Returns:	Path of the output file
Return type:	str

quantify(input_file, engine, force=None, output_file_name=None, multi=False)¶

The ucontroller quantify function

Performs a peptide/protein quantification using the specified quantification engine and mzML/ident file file. Produces a CSV file with peptide/protein quants in the unified Ursgal CSV format. see: List of available engines

Keyword Arguments:

input_file (str) – The complete path to the mzML file.
engine (str) – The name of the quantification engine which should be used, can also be a short version if this name is unambigous.
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc = ursgal.UController(
...    profile = 'LTQ XL high res',
...    params  = {'evidence': 'BSA_idents.csv'}
... )
>>> uc.quantify(
...    input_file = 'BSA.mzML',
...    engine     = 'pyQms_0_0_1'
... )

Returns:	Path of the output file (unified CSV format)
Return type:	str

run_unode_if_required(force, engine_name, answer, merge_duplicates=False, history_addon=None)¶

The ucontroller run_unode_if_required function

Note

internal function

Executes a UNode if required. Otherwise prints why the run was not required. If the UNode is executed, the corresponding json is dumped and the history is updated.

Keyword Arguments:
	force (bool) – (re)do the analysis if output files already exists engine_name (str) – name of the engine to be executed (after verifying with engine_sanity_check ) answer (str or None) – The answer of prepare_unode_run(). Can be None if no re-run is required, or a string indicating the reason for re-run

sanitize_userdefined_output_filename(user_fname, engine)¶: If the user defined a node output file name, we remove all path info from it (not supported) and throw a warning; possibly add a prefix; possibly add the correct file extension (if user didn’t already include it)

search(input_file, engine=None, force=None, output_file_name=None, multi=False)¶

The ucontroller search function

Performs a peptide search using the specified search engine and mzML file. Produces a CSV file with peptide spectrum matches in the unified Ursgal CSV format. see: List of available engines

Keyword Arguments:

input_file (str) – The complete path to the mzML file, or an MGF file that was converted from mzML.
engine (str) – The name of the identification engine which should be used, can also be a short version if this name is unambigous.
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example::

>>> uc = ursgal.UController(
...    profile = 'LTQ XL high res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> uc.search(
...    input_file = 'BSA.mzML',
...    engine     = 'omssa'
... )

Returns:	Path of the output file (unified CSV format)
Return type:	str

Note

Some search engines require a lot of RAM (up to 14GB, depending on your input files). If you don’t have a lot of RAM, some engines might crash. Consider using X!Tandem or OMSSA in these cases, since they are less demanding.

Note

This function calls five search-related ursgal functions in succession, all of which can also be called individually:

convert() (mzml to mgf, if required, using the mzml2mgf engine)

search_mgf()

convert() (raw search results to csv, if required)

execute_misc_engine() (peptide_mapper)

execute_misc_engine() (unify_csv)

search_mgf(input_file, engine=None, force=None, output_file_name=None, multi=False)¶

The UController search_mgf function

Does the main peptide identification search with the specified identification engine. This function is called with every mzML and every search which should be used. The function uses UNode.run() to execute a single search engine. For example to execute X!Tandem via command line.

Keyword Arguments:

input_file (str) – The complete path to the input, input file has to be a .MGF file (but .mzML files can be converted to .MGF with Ursgal)
engine (str) – the name of the identification engine which should be run, can also be a short version if this name is unambigous.
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc = ursgal.UController(
...    profile ='LTQ XL high res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> uc.search_mgf(
...    input_file = 'BSA.mgf',
...    engine     = 'xtandem_piledriver'
... )

Returns:	Path of the output file
Return type:	str

Note

Consider using search() instead. search() automatically converts mzML to MGF and produces a unified CSV output file.

set_file_info_dict(in_file)¶: Splits ext and path and so on

set_profile(profile, dev_mode=False)¶

The ucontroller set_profile function

Note

internal function

Parameters:	profile (str) – Profile speficied to use for all searches.

Available profiles:

‘QExactive+’

‘LTQ XL high res’

‘LTQ XL low res’

Sets self.params according to profile name defined in ursgal.kb.profiles

Example:

>>>'LTQ XL low res' : {
...    # MS 1 orbitrap & MSn iontrap
...    'frag_mass_tolerance'       : 0.5,
...    'frag_mass_tolerance_unit'  : 'da',
...    'instrument'                : 'low_res_LTQ',
...    'frag_method'               : 'cid'
...}

Own profiles can easily be defined in profiles.py in ursgal/kb according to the need parameters or machine specifications.

show_unode_overview()¶

The ucontroller show_unode_overview function

Note

internal function

Prints the overview of all available nodes. The overview includes the category, name and availability of each node. Available nodes are highlighted. Here also the correct functionality of the engine avaibility and installation is verified.

unify_csv(input_file, force=False, output_file_name=None)¶

[ WARNING ] This function is not supported anymore!: Please use execute_misc_engine() instead

The ucontroller unify_csv function

Unifies the .csv files which were converted by the mzidentml library. The corrections for each engine are listed in the node under ursgal/resources/arc_independent/unify_csv_1_0_0

Keyword Arguments:
	input_file (str) – The complete path to the input, input file has currently to be a .csv file. force (bool) – (Re)do the analysis, even if output file already exists. output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc=ursgal.UController(
...     profile = 'LTQ XL low res',
...     params  = {'database': 'BSA.fasta'}
... )
>>> xtandem_result_xml = uc.search_mgf(
...     input_file = 'BSA.mzML',
...     engine     = 'xtandem',
... )
>>> xtandem_result_csv = uc.convert_results_to_csv(
...     input_file = xtandem_result_xml
... )
>>> unified_csv = uc.unify_csv(
...     input_file = xtandem_result_csv
... )

Returns:	Path of the output file
Return type:	str

validate(input_file, engine=None, force=None, output_file_name=None)¶

The UController validate function

Does statistical post-processing of unified search result .csv files with the specified validation engine.

Depending on the validation method a posterior error probability (PEP) and/or a q-value will be available in the final results.

Keyword Arguments:

input_file (str) – The complete path to the input, a unified (and possibly merged) search result .csv.
engine (str) – the name of the validation engine which should be run, can also be a short version if this name is unambigous
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Note

Input files to validate() must be in unified csv format (i.e. output files of search() or unify_csv()).

Example:

>>> uc = ursgal.UController(
...    profile = 'LTQ XL low res',
...    params  = {'database': 'BSA.fasta'}
... )
>>> xtandem_result_csv = uc.search(
...    input_file = 'BSA.mzML',
...    engine     = 'xtandem_piledriver'
... )
>>> validated_csv = uc.validate(
...    input_file = xtandem_result_csv,
...    engine     = 'percolator_2_08'
... )

Returns:	Path of the output file
Return type:	str

verify_engine_produced_an_output_file(expected_fpath, engine_name)¶: Since not all engines raise an exception when they fail, we check if the output file was successfully produced or not to throw a proper exception in case the engine crashed.

visualize(input_files, engine=None, force=None, output_file_name=None, multi=True)¶

The ucontroller function for visualization

Does graphical visualization of result .csv files.

Keyword Arguments:

input_files (list) – list with complete paths of .csv files
engine (str) – the name of the visualizer which should be run, can also be a short version if this name is unambigous
force (bool) – (Re)do the analysis, even if output file already exists.
output_file_name (str or None) – Desired output file name excluding path (optional). If None, output file name will be auto-generated.

Example:

>>> uc = ursgal.UController( profile='LTQ XL high res' )
>>> xtandem_result_csv = uc.search(
...     input_file = 'BSA.mzML',
...     engine     = 'xtandem_piledriver',
... )
>>> omssa_result_csv = uc.search(
...     input_file = 'BSA.mzML',
...     engine     = 'omssa',
... )
>>> uc.visualize(
...     input_files = [xtandem_result_csv, omssa_result_csv],
...     engine      = 'venndiagram',
... )

Note

For detailed information about the VennDiagram UNode, see venndiagram_1_0_0._execute().

Returns:	Path of the output file
Return type:	str