
                       Administrator's Guide
                   UMN-Cluster PBS Scheduler V.1.2
			    July 2000
	      Copyright (c) 2000 Veridian Systems, Inc.


This document covers the following information:

	o Introduction
	o Summary of features in Ver. 1.0
	o Overview of UMN-Cluster scheduler 
	o Installing the UMN-Cluster scheduler
	o Rebuilding PBS to use UMN-Cluster scheduler
	o Required modifications to existing PBS configuration
	o Configuring the UMN-Cluster scheduler
	o General Comments


Introduction
------------

This package contains the sources for a PBS scheduler (pbs_sched), which
was designed to be run on a cluster of systems with different CPU and
memory configurations. The function of the scheduler is to choose a job
or jobs that fit the resources. When a suitable job is found, the
scheduler will direct PBS to run that job on a specific execution host.
This scheduler assumes a 1:1 correlation between the executions queues
and execution hosts. The name of the queue is taken as the name of the
host that jobs in that queue should be run in. (The required queue
structure is discussed in detail below.)


Summary of features in Ver.1.0
------------------------------

Version of 1.0 of the UMN-Cluster PBS scheduler includes the following
features. These are discussed in more detail below, and in the scheduler's
configuration file.

    o User-Specified Architecture - When users submit a job they can
      specify what system architecture the job should run on. This is
      done via the "-l arch=xxx" option to qsub or within a PBS job
      script. The "arch" values correspond to the values determined
      during the PBS configure/build process for the target architectures.
      There is not currently any command to list the "arch" values for a
      given cluster. However, the scheduler includes the "arch" string in
      its status summary of each node. It is recommended that you grep
      "arch" out of the scheduler logs, and then add the corresponding
      "arch" string to each node in the server's nodes file as a "node
      attribute". Doing so will enable the "arch" strings to be displayed
      via the "pbsnodes" command. (See the General Notes section below for
      more info on "pbsnodes".)

    o Fair-Access Controls - Administrator can specific limits on the
      number of CPUs and amount of memory that a given group can use at
      the same time. This limit is enforced per-group, cluster-wide on
      a per-architecture basis. The administrator specifies these limits
      in the scheduler's configuration file (dicussed below).


Overview of UMN-Cluster scheduler internals
--------------------------------------------

This section provides a high level overview of the workings of the
UMN-Cluster PBS scheduler. 

* Overview Of Operation

Please be sure to read the section titled 'Configuring The Scheduler'
below before attempting to start the scheduler.

The basic mode of operation for the UMN-Cluster scheduler is as follows:

  o Jobs are submitted to the PBS server by users.

  o The scheduler wakes up and performs the following actions:

    + Get the list of jobs from the server. If the scheduler find a job
      without a memory specification, it will set the job's memory limit
      to its originating queue's default.mem value.

    + Get available resource information from each execution host. The
      PBS MOM daemon running on each host is queried for a set of resources
      for the host. Scheduling decisions will be based upon these resources,
      queue limits, time of day (optional), etc, etc.

    + Get information about the queues from the server.  The queues over
      which the scheduler has control are listed in the scheduler's
      configuration files.  The queues may be listed as batch or submit
      queues.

    + A job list is then created from the jobs on the submit queue.

    + Sort the jobs into FCFS order.

    + Loop through all the jobs, attempting to pack the execution hosts
      in order to maximize utilization:

      o If a job fits on a given host/queue and does not violate any policy
	requirements direct PBS to move the job to that queue, and start it
	running. If this succeeds, account for the expected resource
	consumption and continue. 

      o If the job is not runnable at this time, note the time at which
	it will be runnable, and publish this date/time in the job
	comment. In addition, modify the job comment to reflect the
	reason the job was not runnable (see the section on Lazy
	Comments). Note that this reason may change from iteration to
	iteration, and that there may be several reasons that the job is
	not runnable now, however only one reason is reported.

      o If the next-scheduled FCFS-ordered job cannot run, attempt to
	backfill around it with other jobs, in FCFS order.

    + Clean up all allocated resources, and go back to sleep until the next
      round of scheduling is requested.

The PBS server will wake up the scheduler when jobs arrive or terminate,
so jobs should be scheduled immediately if the resources are (or become) 
available for them.  There is also a periodic run every few minutes.


* The Configuration File

The scheduler's configuration file is a flat ASCII file.  Comments are
allowed anywhere, and begin with a '#' character.  Any non-comment lines
are considered to be statements, and must conform to the syntax :

	<option> <argument>

The descriptions of the options below describe the type of argument that
is expected for each of the options.  Arguments must be one of :

	<boolean>	A boolean value.  The strings "true", "yes", "on" and 
			"1" are all true, anything else evaluates to false.
	<hostname>	A hostname registered in the DNS system.
	<integer>	An integral (typically non-negative) decimal value.
	<pathname>	A valid pathname (i.e. "/usr/local/pbs/pbs_acctdir").
	<queue_spec>	The name of a PBS queue.  Either 'queue@exechost' or
			just 'queue'.  If the hostname is not specified, it
			defaults to the name of the local host machine.
	<real>		A real valued number (e.g. the number 0.80).
	<string>	An uninterpreted string passed to other programs.
	<time_spec>	A string of the form HH:MM:SS (i.e. 00:30:00 for
			thirty minutes, 4:00:00 for four hours).
	<variance>	Negative and positive deviation from a value.  The
			syntax is '-mm%,+nn%' (i.e. '-10%,+15%' for minus
			10 percent and plus 15% from some value).

Syntactical errors in the configuration file are caught by the parser, and
the offending line number and/or configuration option/argument is noted in
the scheduler logs.  The scheduler will not start while there are syntax
errors in its configuration file.

Before starting up, the scheduler attempts to find common errors in the
configuration files.  If it discovers a problem, it will note it in the
logs (possibly suggesting a fix) and exit.

The following is a complete list of the recognized options :

    SUBMIT_QUEUE			<queue_spec>
    BATCH_QUEUES			<queue_spec>[,<queue_spec>...]
    EXPRESS_QUEUE			<string>
    ENFORCE_PRIME_TIME			<boolean>
    PRIME_TIME_END			<time_spec>
    PRIME_TIME_START			<time_spec>
    PRIME_TIME_WALLT_LIMIT		<time_spec>
    SCHED_RESTART_ACTION		<string>
    TARGET_LOAD_PCT			<integer>
    TARGET_LOAD_VARIANCE		<variance>
    SORTED_JOB_DUMPFILE			<string>
    SORTED_JOB_DUMPFILE			<string>
    FAIR_ACCESS 			<access_spec>


Key options are described in greater detail below, the rest are discussed
in the configuration file ($PBS_HOME/sched_priv/sched_config).

* Queue and Associated Execution Host Definations

The queues are named on the following comma separated lists. If more space
is required, the list can be split into multiple lines, but each line must
be prefaced by the appropriate configuration option directive.

All queues are associated with a particular execution host. They may be
specified either as 'queuename' or 'queuename@exechost'. If only the queue
name is given, the canonical name of the local host will be automatically
appended to the queue name.

The scheduling algorithm picks jobs off the SUBMIT_QUEUE and attempts to
run them on the BATCH_QUEUES. Jobs are enqueued onto the SUBMIT_QUEUE via
the 'qsub' command (set the default queue name in PBS with the 'set server
default_queue' qmgr command), and remain there until they are rejected,
run, or deleted.  The host attached to the SUBMIT_QUEUE is ignored - it
is assumed to be on the PBS server.

Note that this implies that the SUBMIT_QUEUE's resources_max values must
be the union of all the BATCH_QUEUES' resources.

SUBMIT_QUEUE			pending

BATCH_QUEUES is a list of execution queues onto which the scheduler should
move and run the jobs it chooses from the SUBMIT_QUEUES. 
 
BATCH_QUEUES			hosta@hosta,hostb@hostb

EXPRESS_QUEUE is the name of the "highest priority" queue, jobs from which
will run before all else, checkpointing other jobs as necessary

EXPRESS_QUEUE   special

The following options are used to optimize system load average and scheduler 
effectiveness. It is a good idea to monitor system load as the user community 
grows, shrinks, or changes its focus from porting and debugging to production
runs. These defaults were selected for a 64 processor system with 16gb of
memory. 

Target Load Average (TARGET_LOAD_PCT) refers to a target percentage of the
maximum system load average (1 point for each processor on the machine). It
may vary as much as the +/- percentages listed in TARGET_LOAD_VARIANCE. Jobs
may or may not be scheduled if the load is too high or too low, even if the
resources indicate that doing so would otherwise cause a problem. The values
below attempt to maintain a system load within 75% to 100% of the theoretical
maximum (load average of 48.0 to 64.0 for a 64-cpu machine).
TARGET_LOAD_PCT			90%		
TARGET_LOAD_VARIANCE		-15%,+10%

The next section of options are used to enforce site-specific policies. It
is a good idea to reevaluate these policies as the user community grows,
shrinks, or changes its focus from porting and debugging to production.

Check for Prime Time Enforcement.  Sites with a mixed user base can use 
this option to enforce separate scheduling policies at different times
during the day. If ENFORCE_PRIME_TIME is set to "False", the non-prime-time
scheduling policy (as described in BATCH_QUEUES) will be used for the entire
24 hour period.

ENFORCE_PRIME_TIME		False

Prime-time is defined as a time period each working day (Mon-Fri)
from PRIME_TIME_START through PRIME_TIME_END.  Times are in 24
hour format (i.e. 9:00AM is 9:00:00, 5:00PM is 17:00:00) with hours, 
minutes, and seconds.  Sites can use the prime-time scheduling policy for 
the entire 24 hour period by setting PRIME_TIME_START and PRIME_TIME_END 
back-to-back.  The portion of a job that fits within primetime must be
no longer than PRIME_TIME_WALLT_LIMIT (represented in HH:MM:SS).

#PRIME_TIME_START		9:00:00
#PRIME_TIME_END			17:00:00
#PRIME_TIME_WALLT_LIMIT		1:00:00

The next option allows the site to choose an action to take upon scheduler
startup.  The default is to do no special processing (NONE). In some
instances, a job can end up queued in one of the batch queues, since it
was running before but was stopped by PBS. If the argument is RESUBMIT,
these jobs will be moved back to the queue the job was originally submitted
to, and scheduled as if they had just arrived. If the argument is RERUN,
the scheduler will have PBS run any jobs found enqueued on the execution
queues. This may cause the machine to get somewhat confused, as no limits
checking is done (the assumption being that they were checked when they
were enqueued).

SCHED_RESTART_ACTION		RESUBMIT

Define how long a job should be forced to wait in the queue before being
given "extra" priority to run. The priority given is exceeded only by the
priority of the Express queue jobs. Note that this extra priority is
ignored for jobs from an over-fairshare queue, or if the job owner has
exceeded his/her max running job limit. Value is express in HH:MM:SS
(default is 5 days)

MAX_WAIT_TIME                   120:00:00

If specified, this directive will tell the scheduler to dump an ordered
listing of the jobs to the named file. Useful for users and debugging,
but an expensive operation with LOTS of jobs queued, since the file is
rewritten for each run of the scheduler.

SORTED_JOB_DUMPFILE             /PBS/sched_priv/sorted_jobs

The Fair Access Directives allow the specification, on a per-architecture
basis, of a per-group limit on the maximum number of CPUs and the maximum
amount of memory simultaneously used by running jobs.
Format is: FAIR_ACCESS ARCH:arch1:groupA:num_cpus:MB_of_memory

FAIR_ACCESS ARCH:arch1:groupA:30:800
FAIR_ACCESS ARCH:arch1:default:40:100
FAIR_ACCESS ARCH:arch2:groupB:30:100
FAIR_ACCESS ARCH:arch2:default:20:100


* Lazy Commenting

Because changing the job comment for each of a large group of jobs can be
very time intensive, there is a notion of lazy comments. The function that
sets the comment on a job takes a flag that indicates whether or not the
comment is optional. Most of the "can't run because ..." comments are
considered to be optional.

When presented with an optional comment, the job will only be altered if
the job was enqueued after the last run of the scheduler, if it does not
already have a comment, or the job's 'mtime' (modification time) attribute
indicates that the job has not been touched in MIN_COMMENT_AGE seconds.

This should provide each job with a comment at least once per scheduler
lifetime.  It also provides an upper bound (MIN_COMMENT_AGE seconds + the
scheduling iteration) on the time between comment updates.
This compromise seemed reasonable because the comments themselves are some-
what arbitrary, so keeping them up-to-date is not a high priority.


Installing The UMN-Cluster Scheduler
------------------------------------

The UMN-Cluster scheduler is packaged as an optional scheduler for OpenPBS
v.2.3.  Basic steps are as follows (note that $PBSSRC is the directory into
which you extracted the PBS source tree; this is the directory that contains
the configure and configure.in file, amoung others); $PBSOBJ is the top of
your object tree.


Rebuilding PBS to use UMN-Cluster scheduler
--------------------------------------

While it is not necessary to rebuilt all of PBS, it is necessary
to rerun "configure" and then build the scheduler:

	cd $PBSOBJ
	$PBSSRC/configure [your options] --set-sched-code=umn_cluster
	make
	make install

Note: To run this scheduler on system that have more than 2gb of 
memory, you should build the scheduler in 64-bit mode, using the
configure options specific for your compiler.

Required modifications to existing PBS configuration
----------------------------------------------------

There are several changes that will need to be made to the PBS configuration.

The UMN-Cluster scheduler takes advantage of the server nodes file, which
contains one line per node. (For a detailed explaination of the format of
the "nodes" file, see the PBS Administrator Guide).

1. Edit $PBSHOME/server_priv/nodes, and add one line for each execution
   host, as the following example shows:

   pbsnode1:ts  np=8
   pbsnode2:ts  np=8
   pbsnode3:ts  np=8
   pbsnode4:ts  np=8

Where the first column is the hostname of the node, appended with a
":ts" denoting that it is a timeshared node, and the second column
is a number of processor specification. (This NP value is 
used as the maximum number of cpus on a given node. This allows
the server to reject a job immediately that requests more cpus
than can be provided by the current configuration.)

2. Create an execution queue that will become a holding queue from which
   the scheduler will pull jobs. This queue will need certain minimum
   attributes set, as indicated below:

   first start the server, if not already running:
   #pbs_server

   then create and set the queue attributes (example SUBMIT queue
   called "pending"):

   #qmgr
   create queue pending
   set queue pending queue_type = Execution
   set queue pending resources_default.ncpus = 1
   set queue pending resources_default.walltime = 00:05:00
   set queue pending enabled = True
   set queue pending started = True

3. Create an execution queue that corresponds to each execution host.
   Set the default and maximum attributes for each execution queue. 
   It is suggested you dump the qmgr output to a file, and then edit
   the file (ie  'qmgr -c "p s" > /tmp/somefile' ; edit the file; and
   then load the changed info back into the server: 'qmgr < /tmp/somefile').
   The example below shows the recommended attributes (and changes via qmgr):
   Note that the "from_route_only" directive prevents users from submitting
   directly to the backend execution queues. This is necessary since
   priorities are calculated based on the originating queues (i.e. the
   route queues defined below).

   #qmgr
   create queue pbsnode1
   set queue pbsnode1 queue_type = Execution
   set queue pbsnode1 from_route_only = True
   set queue pbsnode1 resources_max.ncpus = 8
   set queue pbsnode1 resources_max.walltime = 08:00:00
   set queue pbsnode1 resources_max.memory = 240mb
   set queue pbsnode1 resources_default.ncpus = 1
   set queue pbsnode1 resources_default.walltime = 00:05:00
   set queue pbsnode1 enabled = True
   set queue pbsnode1 started = True
   ...

4. Create as many "originating" Route queues as needed by your local
   configuration. These should be route queues with one destination:
   the above defined SUBMIT queue. Jobs will carry their originating
   queue name with them.

   # Create and define queue groupA
   #
   create queue groupA
   set queue groupA queue_type = Route
   set queue groupA route_destinations = pending
   set queue groupA resources_default.mem = 10mb
   set queue groupA enabled = True
   set queue groupA started = True
   #
   # Create and define queue groupB
   #
   create queue groupB
   set queue groupB queue_type = Route
   set queue groupB route_destinations = pending
   set queue groupB resources_default.mem = 20mb
   set queue groupB enabled = True
   set queue groupB started = True


   # and then set the default queue on the server. 
   #
   set server default_queue = pending


 5. Set the cluster-wide maximum number of cpus (e.g. the count of all the
    CPUs on all nodes within the cluster. This info will be used in addition
    to the dynamically queried per-node CPU counts. You will need to update
    this value if you remove or add nodes to your cluster. Do not worry about
    updating it for a temporarily unavailable node.

    #qmgr
    set server resources_max.ncpus = 48  (or whatever the correct value is)
    quit


Configuring the UMN-Cluster scheduler
--------------------------------------

The scheduler configuration file (as discussed above) will need to be
modified. Edit $PBSHOME/sched_priv/sched_config changing in particular
the BATCH_QUEUES line. This should contain the list of all the queues
you have defined, and the associated execution host, eg:

Host "hostA.pbs.com" has an associated queue named "hostA" and
"pbsnode1.pbs.com" is fed by queue "pbsnode1":

BATCH_QUEUES	hostA@hostA.pbs.com,pbsnode1@pbsnode1.pbs.com

However, the full hostname is not required, so for brevity, one could enter:

BATCH_QUEUES	hostA@hostA,pbsnode1@pbsnode1

The FAIR_ACCESS directive will also need to be updated, as described above.

Review the other configuration parameters, and change any as needed.
They are currently set to recommended defaults.


General Notes
-------------

This section has some general comments about this scheduler, and things
to be aware of.

Since this scheduler supports the PBS nodes file, you can use the "pbsnodes"
commands to view node status, take nodes offline, etc. Here is a short
summary of handy

	pbsnodes -l      # lists all down/offline/unavailable nodes

	pbsnodes -a 	 # lists all info for all nodes

	pbsnodes -o node # mark the named node OFFLINE, running jobs will
			 # will continue to run on that node, but no new
			 # jobs will be started on it. Be sure to list all
			 # OFFLINE nodes as any not listed will be assumed
			 # to be up.

	pbsnodes -c node # clear or remove the OFFLINE status on node, 
			 # making it available for running jobs again.

