Create/Implement your own UNode

Before implementing your own UNode, make sure that you have read about the General structure of Ursgal. This page will explain how to integrate a standalone executable or Python script into Ursgal’s structure of resources, wrappers and uparams.py, based on two examples:

  • A Python script: filter_csv_1_0_0.py
  • A standalone search engine: MS-GF+ v9979

1. Integration into Resources

The resources/ folder contains the main code of each UNode (an executable or Python script). This executable should be standalone and executable from the command line. Each UNode requires its own subfolder in the resources/ folder, which contains the executable.

Note:

The UNodes’ resources/ subfolder, wrapper/ file and wrapper/ file Python class should all have the same name (lowercase and underscores instead of spaces, e.g. ‘msgfplus_v9979’ or ‘filter_csv_1_0_0’).

  1. Platfom dependent engines need to be placed according to the platform: darwin (OS X), linux or win32 (Windows 32 or 64 bit)

    • <ursgal_path>/resources/<platform>/<architecture>/<name_of_engine>/source (executable + potential additional files)

    Example: MS-GF+ on windows 64 bit:

    • <ursgal_path>/resources/win32/64bit/msgfplus_v9979/MSGFPlus.jar
  2. Architecture independent engines, like Python scripts or Java packages should be placed in /resources/platform_independent/arc_independent/

    • <ursgal_path>/resources/platform_independent/arc_independent/<name_of_engine>/engine.py

    Example: filter_csv_1_0_0.py:

    • <ursgal_path>/resources/platform_independent/arc_independent/filter_csv_1_0_0/filter_csv_1_0_0.py

    Actually, MS-GF+ is platform independent as well (since it it based on Java) and can therefore also be placed in:

    • <ursgal_path>/resources/platform_independent/arc_independent/msgfplus_v9979/MSGFPlus.jar

2. Integration into uparams.py

Each parameter that is used by an engine needs to be included in the file <ursgal_path>/ursgal/uparams.py. This is a dictionary containing all parameters that are available in ursgal, its structure is explained here.

For every parameter that can be used by a new engine, it should be checked if a corresponding parameter is already present in uparams.py. If this is the case, the new engine (unode name) needs to be included in ‘available_in_unode’. Furthermore, ‘ukey_translation’ needs to contain the utranslation_style that is defined in the engines META_INFO translating the ursgal parameter into the engine-specific parameter name. The parameter values can be translated in ‘uvalue_translation’ using the utranslation_style as well (only if a translation is necessary).

Example: include the parameter ‘-e’ for MS-GF+

# -e defines the enzyme that has been used for digestion. This is called 'enzyme' in ursgal.
'enzyme' : {
    # include msgfplus_v9979 in available_in_unode
    'available_in_unode' : [
        'xtandem_vengeance',
        'msgfplus_v9979',
    ],
    # default_value, description, trigger_rerun, utag and uvalue_type don't need to be changed
    'default_value' : "trypsin",
    'description' :  ''' Enzyme: Rule of protein cleavage
        Possible cleavages are ... '''
    'trigger_rerun' : True,
    # Translate the ursgal parameter name ('enzyme') to the MS-GF+ parameter name ('-e') using the translation style (msgfplus_style_1) in ukey_translation
    'ukey_translation' : {
        'msgfplus_style_1' : '-e',
        'xtandem_style_1' : 'protein, cleavage site',
    },
    # Translate the ursgal parameter values (e.g. 'trypsin') to the MS-GF+ parameter value (e.g. '1') using the translation style (msgfplus_style_1) in uvalue_translation
    'uvalue_translation' : {
        'msgfplus_style_1' : {
            'alpha_lp' : '8',
            'argc' : '6',
            'aspn' : '7',
            'chymotrypsin' : '2',
            'glutamyl_endopeptidase' : '5',
            'lysc' : '3',
            'lysn' : '4',
            'no_cleavage' : '9',
            'nonspecific' : '0',
            'trypsin' : '1',
        },
        'xtandem_style_1' : {
            'argc' : '[R]|{P}',
            'aspn' : '[X]|[D]',
            'chymotrypsin' : '[FMWY]|{P}',
            'chymotrypsin_p' : '[FMWY]|[X]',
            'clostripain' : '[R]|[X]',
            'cnbr' : '[M]|{P}',
            'elastase' : '[AGILV]|{P}',
            'formic_acid' : '[D]|{P}',
            'gluc' : '[DE]|{P}',
            'gluc_bicarb' : '[E]|{P}',
            'iodosobenzoate' : '[W]|[X]',
            'lysc' : '[K]|{P}',
            'lysc_p' : '[K]|[X]',
            'lysn' : '[X]|[K]',
            'lysn_promisc' : '[X]|[AKRS]',
            'nonspecific' : '[X]|[X]',
            'pepsina' : '[FL]|[X]',
            'protein_endopeptidase' : '[P]|[X]',
            'staph_protease' : '[E]|[X]',
            'tca' : '[FMWY]|{P},[KR]|{P},[X]|[D]',
            'trypsin' : '[KR]|{P}',
            'trypsin_cnbr' : '[KR]|{P},[M]|{P}',
            'trypsin_gluc' : '[DEKR]|{P}',
            'trypsin_p' : '[RK]|[X]',
        },

If a parameter is not yet present in uparams.py, you can add a new parameter containing all necessary information (see here).

Example add write_unfiltered_results for filter_csv_1_0_0

'write_unfiltered_results' : {
    'edit_version' : 1.00,
    'available_in_unode' : [
        'filter_csv_1_0_0',
    ],
    'triggers_rerun' : True,
    'ukey_translation' : {
        'filter_csv_style_1' : 'write_unfiltered_results',
    },
    'utag' : [
        'conversion',
    ],
    'uvalue_translation' : {
    },
    'uvalue_type' : 'bool',
    'uvalue_option' : {
    },
    'default_value' : False,
    'description' : \
        'Writes rejected results if True',
},

After changing uparams.py, please run the tests, especially chk_format_node_param_test.py to check for errors.

3. Implementation of the wrapper class

Each UNode has to have a Python wrapper file located in:

  • <ursgal_path>/wrappers/ <unode_name>.py

The UNode has to inherit from the UNode class, which during initialization injects the node related data into the class.

The default structure of the UNode class has to be:

class my_unode_1_0_0(ursgal.UNode):

    META_INFO = {}

    def __init__(self, *args, **kwargs):
        super(my_unode_1_0_0, self).__init__(*args, **kwargs)

    def preflight(self):
        # code that should be run before the UNode is executed
        # e.g. writing a config file
        return

    def postflight(self):
        # code that should be run after the UNode is executed
        # e.g. formatting the output file
        return

where my_unode_1_0_0 is the name of the UNode. The META_INFO is explained here and is available as attribute of each UNode. One can define preflight() and postflight() methods that will be executed by the uNode during preflight and postflight (= before execution of the main executable and after execution).

3.1 Implementation of an engine from a command line tool

For binary executable UNodes, one has to create a command line list (see subprocess) in the preflight() method. The command list is used to run the UNode’s executable with the appropriate command line parameters. It should include the executable path of the engine (accessible via self.exe) and all relevant parameters, available via self.params, containing the original parameters and values. self.params[‘translations’] contains translated values for all node-related parameters. Furthermore self.params[‘translations’][‘_grouped_by_translated_key’] is a dictionary containing all node-related parameters and their corresponding ursgal parameters with the translated values.

The command list is stored in self.params[‘command_list’]. This list should be constructed in the UNode class preflight() method like this:

def preflight(self):

    # retrieve the path of the input file:
    input_file = os.path.join(
        self.params['input_dir_path'],
        self.params['input_file']
    )

    # retrieve the auto-generated output file name:
    output_file = os.path.join(
        self.params['output_dir_path'],
        self.params['output_file'],
    )

    # format parameters and input/output file names into command list:
    self.params['command_list'] = [
        self.exe,
        '-o',
        output_file,
        '-i',
        input_file,
        '--some_parameter',
        '{some_param_in_ursgal}'.format(**self.params['translations']),
        '--another_parameter',
        '{original_engine_parameter}'.format(**self.params['translations']['_grouped_by_translated_key']),
    ]

After preflight(), Ursgal automatically passes the command_list to Python’s built-in subprocess module:

proc = subprocess.Popen(
    self.params['command_list'],
    stdout = subprocess.PIPE,
)

After the execution procedure, the postflight() sequence is executed (if a postflight function was defined as part of the class), e.g.:

def postflight(self):
    '''
    Move the result files to the Kojak folder, since the output files can
    not be specified manually.
    '''
    # kojak_extensions = [
    #     '.kojak.txt',
    #     '.pep.xml',
    #     '.perc.inter.txt',
    #     '.perc.intra.txt',
    #     '.perc.loop.txt',
    #     '.perc.single.txt',
    # ]
    for extension in self.META_INFO['all_extensions']:
        org_path = os.path.join(
            self.params['input_dir_path'],
            '{0}{1}'.format(
                self.params['file_root'],
                extension
            )
        )
        new_path = os.path.join(
            self.params['output_dir_path'],
            '{0}_kojak_{1}{2}'.format(
                self.params['file_root'],
                self.META_INFO['version'],
                extension
            )
        )
        if os.path.exists(org_path):
            shutil.move(
                org_path,
                new_path
            )

Example: ursgal/engines/msgfplus_v9979.py

#!/usr/bin/env python
import ursgal
import os

class msgfplus_v9979( ursgal.UNode ):
    """
    MSGF+ UNode
    Parameter options at https://bix-lab.ucsd.edu/pages/viewpage.action?pageId=13533355

    Reference:
    Kim S, Mischerikow N, Bandeira N, Navarro JD, Wich L, Mohammed S, Heck AJ, Pevzner PA. (2010) The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search.
    """
    META_INFO = {
        'edit_version'                : 1.00,
        'name'                        : 'MSGF+',
        'version'                     : 'v9979',
        'release_date'                : '2010-12-1',
        'engine_type' : {
            'protein_database_search_engine' : True,
        },
        'input_extensions'            : ['.mgf', '.mzML', '.mzXML', '.ms2', '.pkl', '.dta.txt'],
        'output_extensions'           : ['.mzid'],
        'create_own_folder'           : True,
        'in_development'              : False,
        'include_in_git'              : False,
        'utranslation_style'          : 'msgfplus_style_1',
        'engine' : {
            'platform_independent' : {
                'arc_independent' : {
                    'exe'            : 'MSGFPlus.jar',
                    'url'            : 'http://proteomics.ucsd.edu/Software/MSGFPlus/MSGFPlus.zip',
                    'zip_md5'        : '82a3e2204ff698e260ac9f89d3880b59',
                    'additional_exe' : [],
                },
            },
        },
        'citation' : \
            'Kim S, Mischerikow N, Bandeira N, Navarro JD, Wich L, '\
            'Mohammed S, Heck AJ, Pevzner PA. (2010) The Generating Function '\
            'of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: '\
            'Applications to Database Search.',
    }

    def __init__(self, *args, **kwargs):
        super(msgfplus_v9979, self).__init__(*args, **kwargs)
        pass

    def preflight( self ):
        '''
        Formatting the command line via self.params

        Modifications file will be created in the output folder

        Returns:
                dict: self.params
        '''

        translations = self.params['translations']['_grouped_by_translated_key']

        self.params[ 'command_list' ] = [
            'java',
            '-jar',
            self.exe,
        ]

        self.params['translations']['mgf_input_file'] = os.path.join(
            self.params['input_dir_path'],
            self.params['input_file']
        )
        translations['-s']['mgf_input_file'] = self.params['translations']['mgf_input_file']

        self.params['translations']['output_file_incl_path'] = os.path.join(
            self.params['output_dir_path'],
            self.params['output_file']
        )
        translations['-o']['output_file_incl_path'] = self.params['translations']['output_file_incl_path']

        self.params['translations']['modification_file'] = os.path.join(
            self.params['output_dir_path'],
            self.params['output_file'] + '_Mods.txt'
        )
        self.created_tmp_files.append( self.params['translations']['modification_file'] )
        translations['-mod']['modifications'] = self.params['translations']['modification_file']

        mods_file = open( self.params['translations']['modification_file'], 'w', encoding = 'UTF-8' )
        modifications = []

        print('NumMods={0}'.format(translations['NumMods']['max_num_mods']), file = mods_file)

        if self.params['translations']['label'] == '15N':
            for aminoacid, N15_Diff in ursgal.ukb.DICT_15N_DIFF.items():
                existing = False
                for mod in self.params[ 'mods' ][ 'fix' ]:
                    if aminoacid == mod[ 'aa' ]:
                        mod[ 'mass' ] += N15_Diff
                        mod[ 'name' ] += '_15N_{0}'.format(aminoacid)
                        existing = True
                if existing == True:
                    continue
                else:
                    modifications.append( '{0},{1},fix,any,15N_{1}'.format( N15_Diff, aminoacid ) )

        for t in [ 'fix', 'opt' ]:
            for mod in self.params[ 'mods' ][ t ]:
                modifications.append( '{0},{1},{2},{3},{4}'.format(mod[ 'mass' ], mod[ 'aa' ], t, mod[ 'pos' ], mod[ 'name' ] ) )

        for mod in modifications:
            print( mod, file = mods_file )

        mods_file.close()

        translations['-t'] = {
            '-t' : '{0}{1}, {2}{1}'.format(
                translations['-t']['precursor_mass_tolerance_minus'],
                translations['-t']['precursor_mass_tolerance_unit'],
                translations['-t']['precursor_mass_tolerance_plus'],
            )
        }

        command_dict = {}

        for translated_key, translation_dict in translations.items():
            if translated_key == '-Xmx':
                self.params[ 'command_list' ].insert(1,'{0}{1}'.format(
                    translated_key,
                    list(translation_dict.values())[0]
                ))
            elif translated_key in ['label', 'NumMods']:
                continue
            elif len(translation_dict) == 1:
                command_dict[translated_key] = str(list(translation_dict.values())[0])
            else:
                print('The translatd key ', translated_key, ' maps on more than one ukey, but no special rules have been defined')
                print(translation_dict)
                exit(1)
        for k, v in command_dict.items():
            self.params[ 'command_list' ].extend((k, v))

        return self.params

3.2 Implementation of a UNode from Python code

Using sys.argv or the argparse module, any Python code can be executed like a command line tool. Thus, it is possible to include pure Python UNodes using the steps described above. For convenience, it is also possible to import the main function of a Python script using self.import_engine_as_python_function(). This function can then be directly executed by Ursgal, which makes it possible to include Python scripts that don’t use argparse or sys.argv. To skip command line execution and run the main function of a Python script, one has to define the _execute() method of the UNode class. There are several pure Python UNodes in Ursgal, e.g. filter_csv_1_0_0.py, get_ftp_files_1_0_0.py and many others.

Example: ursgal/engines/filter_csv_1_0_0.py

#!/usr/bin/env python
import ursgal
import importlib
import os
import sys
import pickle
import shutil

class filter_csv_1_0_0( ursgal.UNode ):
    """filter_csv_1_0_0 UNode"""
    def __init__(self, *args, **kwargs):
        super(filter_csv_1_0_0, self).__init__(*args, **kwargs)

    def _execute( self ):
        print('[ -ENGINE- ] Executing conversion ..')
        self.time_point(tag = 'execution')

        # import the main function from the UNode's python script
        filter_csv_main = self.import_engine_as_python_function()

        if self.params['output_file'].lower().endswith('.csv') is False:
            raise ValueError('Trying to filter a non-csv file.')

        # receive name of the input file so it can be passed to main function
        input_file  = os.path.join(
            self.params['input_dir_path'],
            self.params['input_file']
        )
        # receive auto-generated filename from UController
        output_file = os.path.join(
            self.params['output_dir_path'],
            self.params['output_file']
        )

        # Sometimes, engine-specific code is required! For instance,
        # filter_csv() can produce a second output file with the columns
        # that were removed:

        if self.params['translations']['write_unfiltered_results'] is False:

            output_file_unfiltered = None
        else:
            file_extension = self.meta_unodes[ self.engine ].META_INFO.get(
                'output_suffix',
                None
            )
            new_file_extension = self.meta_unodes[ self.engine ].META_INFO.get(
                'rejected_output_suffix',
                None
            )
            output_file_unfiltered = output_file.replace(
                file_extension,
                new_file_extension
            )
            shutil.copyfile(
                '{0}.u.json'.format(output_file),
                '{0}.u.json'.format(output_file_unfiltered)
            )
        # Engine-specific code ends here

        # Call the Python script's main() function using the information
        # we collected above:
        filter_csv_main(
            input_file     = input_file,
            output_file    = output_file,
            filter_rules   = self.params['translations']['csv_filter_rules'],
            output_file_unfiltered = output_file_unfiltered,
        )

        self.print_execution_time(tag='execution')
        return output_file