Monday, August 26, 2013

In mainframe, from inside a job, check if another job is running

Recently, we had an odd issue in our client's production environment. Many years ago, may be even before I knew there is something called mainframe, someone had built two set of jobs. The first job, lets call it LOGGER reads log messages from a MQ queue and off-loads the messages in a mainframe dataset for archive them. Now this job was special as it will never stop unless it abends or it reads a message from the queue which instruct the job to stop. Here comes the second job, lets call it STOPJOB, it will put a special message in the same MQ queue telling the first job to stop. The first job used to start along with the OS and used to stop only when Operator scheduled the second job.

Everything was running fine until one day when an Operator, by mistake, scheduled the STOPJOB twice during a scheduled quarterly maintenance. It caused two stop message in the MQ queue. The first message was read by LOGGER and it stopped normally. After maintenance along with the OS when the LOGGER was started it read the second stop message and stopped itself. Nobody noticed that until MQ job abended due to space issue which caused quite big issue. After lot of investigation the root-cause was found and it was decided to put a fail-safe mechanism as the same situation was very much possible to happen again.
After lot of discussion it was decided that the best simple solution will be to somehow stop the STOPJOB from being executed when LOGGER is not running or in other terms STOPJOB should only put the stop message in the MQ queue when it knows for sure the LOGGER is running. So our problem will be solved if somehow STOPJOB could know that LOGGER is running and then only put the stop message in the MQ queue.

The challenge was, how to know if a job is currently executing and to know it with-in the job?
 
As always Google search was our best consultant here. After searching through IBM Information Center and other libraries, found that there is a REXX interface to SDSF. This is really cool and exciting, anything that you can manually perform inside spool can also be done via a REXX program. For example, searching for jobs currently running, set the job prefix or owner and anything like that.
So we wrote a wee script and used that in STOPJOB just before executing the step that put the stop message in the MQ queue.


JOBSTAT - REXX Script

/* REXX */
/*

Author: Chandan Chowdhury

Purpose
-------
Given a job name this script will go into executing system's spool and check if the job is currently running or not.

Return Code
-----------
RC=00, when job is running.
RC=01, when job is not running.
RC>04, when error

Interface: SDSF for REXX
*/

arg jobid

/* attach to host command environment */
IsfRC = IsfCalls("ON")
If IsfRC <> 0 Then Do
   say "IsfCalls: RC = " IsfRC
   exit IsfRC
End

IsfPrefix = jobid /* Only one jobid */
IsfOwner = "*" /* Any Owner */

/* check the INPUT spool queue */
address sdsf "IsfExec i"
If RC <> 0 Then Do
  "SDSF: RC = " RC
  exit RC
End

JRC = 00
/* 
JNAME stem has the list of jobs mathcing the prefix and owner.
JNAME.0 have the count of jobs found.
*/
If JNAME.0 > 0 Then
   do
     JRC = 00 /* Job is found running, so set RC=00 */
   end
else
   do
      JRC = 01 /* Job is NOT found running, so set RC=01 */
   end
end

/* Detach from host command environment */
IsfRC = IsfCalls("OFF")
If IsfRC <> 0 Then Do
   say "IsfCalls RC = " IsfRC
   exit IsfRC
end

/* exit with a Return Code */
exit JRC

The below job snippet uses the above script
//JOBSTAT EXEC PGM=IRXJCL,PARM='JOBSTAT LOGGER'
//SYSEXEC DD DISP=SHR,DSN=PROD.REXX.LIB
//DDDUMP DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
// IF ( RC = 0 ) THEN
//STOPJOB EXEC=STOPJOB,PARM='LOGGER'
// ENDIF