PyFrag
Public Member Functions | Public Attributes | List of all members
qmworks.plams.core.jobrunner.GridRunner Class Reference
Inheritance diagram for qmworks.plams.core.jobrunner.GridRunner:
Inheritance graph
[legend]
Collaboration diagram for qmworks.plams.core.jobrunner.GridRunner:
Collaboration graph
[legend]

Public Member Functions

def __init__ (self, grid='auto', sleepstep=None, kwargs)
 
def call (self, runscript, workdir, out, err, runflags, kwargs)
 
- Public Member Functions inherited from qmworks.plams.core.jobrunner.JobRunner
def __init__ (self, parallel=False, maxjobs=0)
 
def call (self, runscript, workdir, out, err, kwargs)
 

Public Attributes

 sleepstep
 
 settings
 
- Public Attributes inherited from qmworks.plams.core.jobrunner.JobRunner
 parallel
 
 semaphore
 

Detailed Description

Subclass of |JobRunner| that submits the runscript to a job scheduler instead of executing it locally. Besides two new keyword arguments (*grid* and *sleepstep*) and different :meth:`call` method it behaves and is meant to be used just like regular |JobRunner|.

There are many different job schedulers that are popular and widely used nowadays (for example TORQUE, SLURM, OGE). Usually they use different commands for submitting jobs or checking queue status. This class tries to build a common and flexible interface for all those tools. The idea is that commands used to communicate with job scheduler are not rigidly hard-coded but dynamically taken from a |Settings| instance instead. Thanks to that user has almost full control over the behavior of |GridRunner|.

So the behavior of |GridRunner| is determined by the contents of |Settings| instance stored in its ``settings`` attribute. This |Settings| instance can be manually supplied by the user or taken from a collection of predefined behaviors stored as branches of ``config.gridrunner``. The adjustment is done via *grid* parameter that should be either string or |Settings|. If string, it has to be a key occurring in ``config.gridrunner`` (or ``'auto'`` for autodetection). For example, if ``grid='slurm'`` is passed, ``config.gridrunner.slurm`` is linked as ``settings``. If *grid* is ``'auto'``, entries in ``config.gridrunner`` are tested one by one and the first one that works (its submit command is present on your system) is chosen. When a |Settings| instance is passed it gets plugged directly as ``settings``.

Currently two predefined job schedulers are available (see ``plams_defaults.py``): ``slurm`` for SLURM and ``pbs`` for job schedulers following PBS syntax (PBS, TORQUE, Oracle Grid Engine etc.).

The |Settings| instance used for |GridRunner| should have the following structure:
    *   ``.output`` -- flag for specifying output file path.
    *   ``.error`` -- flag for specifying error file path.
    *   ``.workdir`` -- flag for specifying path to working directory.
    *   ``.commands.submit`` -- submit command.
    *   ``.commands.check`` -- queue status check command.
    *   ``.commands.getid`` -- function extracting submitted job's ID from output of submit command.
    *   ``.commands.finished`` -- function checking if submitted job is finished. It should take a single string (job's ID) and return boolean.
    *   ``.commands.special.`` -- branch storing definitions of special |run| keyword arguments.

See :meth:`call` for more technical details and examples.

The *sleepstep* parameter defines how often the job is checked for being finished. It should be an integer value telling how many seconds should the interval between two checks last. If ``None``, the global default from ``config.sleepstep`` is copied.

.. note::
    Usually job schedulers are configured in such a way that output of your job is captured somewhere else and copied to the location indicated by output flag when the job is finished. Because of that it is not possible to have a peek at your output while your job is running (for example to see if your calculation is going well). This limitation can be worked around with ``[Job].settings.runscript.stdout_redirect``. If set to ``True``, the output redirection will not be handled by a job scheduler, but built in the runscript using shell redirection ``>``. That forces the output file to be created directly in *workdir* and updated live as the job proceeds.

Member Function Documentation

◆ call()

def qmworks.plams.core.jobrunner.GridRunner.call (   self,
  runscript,
  workdir,
  out,
  err,
  runflags,
  kwargs 
)
call(runscript, workdir, out, err, runflags, **kwargs)
Submit *runscript* to the job scheduler with *workdir* as the working directory. Redirect output and error streams to *out* and *err*, respectively. *runflags* stores submit command options.

The submit command has the following structure. Underscores denote spaces, parts in pointy brackets correspond to ``settings`` entries, parts in curly brackets to :meth:`call` arguments, square brackets contain optional parts::

    <.commands.submit>_<.workdir>_{workdir}_<.error>_{err}[_<.output>_{out}][FLAGS]_{runscript}

Output part is added if *out* is not ``None``. This is handled automatically based on ``.runscript.stdout_redirect`` value in job's ``settings``.

``FLAGS`` part is built based on *runflags* argument, which is a dictionary storing |run| keyword arguments. For every *(key,value)* pair in *runflags* the string ``_-key_value`` is appended **unless** *key* is a special key occurring in ``.commands.special.``. In that case ``_<.commands.special.key>value`` is used (mind the lack of space in between!). For example, a |Settings| instance defining interaction with SLURM job scheduler stored in ``config.gridrunner.slurm`` has the following entries::

    .workdir = '-D'
    .output  = '-o'
    .error   = '-e'
    .special.nodes    = '-N '
    .special.walltime = '-t '
    .special.queue    = '-p '
    .commands.submit  = 'sbatch'
    .commands.check   = 'squeue -j '

The submit command produced in such case::

    >>> gr = GridRunner(parallel=True, maxjobs=4, grid='slurm')
    >>> j.run(jobrunner=gr, queue='short', nodes=2, J='something', O='')

will be::

    sbatch -D {workdir} -e {err} -o {out} -p short -N 2 -J something -O  {runscript}

In some job schedulers some flags don't have a short form with semantics ``-key value``. For example, in SLURM the flag ``--nodefile=value`` have a short form ``-F value``, but the flag ``--export=value`` does not. One can still use such a flag using special keys mechanism::

    >>> gr = GridRunner(parallel=True, maxjobs=4, grid='slurm')
    >>> gr.settings.special.export = '--export='
    >>> j.run(jobrunner=gr, queue='short', export='value')
    sbatch -D {workdir} -e {err} -o {out} -p short --export=value {runscript}

The submit command produced in the way explained above is then executed and returned output is used to determine submitted job's ID. Function stored in ``.commands.getid`` is used for that purpose, it should take one string (whole output) and return a string with job's ID.

Now the method waits for the job to finish. Every ``sleepstep`` seconds it queries the job scheduler using following algorithm:
    *   if a key ``finished`` exists in ``.commands.`` it is used. It should be a function taking job's ID and returning ``True`` or ``False``.
    *   otherwise a string stored in ``.commands.check`` is concatenated with job's ID (with no space between) and such command is executed. Nonzero exit status indicates that job is no longer in job scheduler hence it is finished.

Since it is difficult (on some systems even impossible) to automatically obtain job's exit code, the returned value is always 0. From |run| perspective it means that a job executed with |GridRunner| is never *crashed*.

.. note::
    This method is used automatically during |run| and should never be explicitly called in your script.

The documentation for this class was generated from the following file: