SAS Grid used LSF for load sharing. LSF will store all the jobs ran in a file lsb.events. This file is stored under your cluster directory. Usually cluster directory will be under lsf work directory.
Below is where LSF stored in our environment.
/app/lsf/work/bank_cluster/logdir/lsb.events
You can get formatted output by running bhist command. For some reason in our environment it doesn't work. So we used the raw file and parsed with custom code.
The file is space delimited and you can get the column info from IBM link.
For lines starting with JOB_NEW below are the column name.
JOB_NEW
A new job has been submitted. The fields in order of occurrence are:
- Version number (%s)
- The version number
- Event time (%d)
- The time of the event
- jobId (%d)
- Job ID
- userId (%d)
- UNIX user ID of the submitter
- options (%d)
- Bit flags for job processing
- numProcessors (%d)
- Number of processors requested for execution
- submitTime (%d)
- Job submission time
- beginTime (%d)
- Start time – the job should be started on or after this time
- termTime (%d)
- Termination deadline – the job should be terminated by this time (%d)
- sigValue (%d)
- Signal value
- chkpntPeriod (%d)
- Checkpointing period
- restartPid (%d)
- Restart process ID
- userName (%s)
- User name
- rLimits
- Soft CPU time limit (%d), see getrlimit(2)
- rLimits
- Soft file size limit (%d), see getrlimit(2)
- rLimits
- Soft data segment size limit (%d), see getrlimit(2)
- rLimits
- Soft stack segment size limit (%d), see getrlimit(2)
- rLimits
- Soft core file size limit (%d), see getrlimit(2)
- rLimits
- Soft memory size limit (%d), see getrlimit(2)
- rLimits
- Reserved (%d)
- rLimits
- Reserved (%d)
- rLimits
- Reserved (%d)
- rLimits
- Soft run time limit (%d), see getrlimit(2)
- rLimits
- Reserved (%d)
- hostSpec (%s)
- Model or host name for normalizing CPU time and run time
- hostFactor (%f)
- CPU factor of the above host
- umask (%d)
- File creation mask for this job
- queue (%s)
- Name of job queue to which the job was submitted
- resReq (%s)
- Resource requirements
- fromHost (%s)
- Submission host name
- cwd (%s)
- Current working directory (up to 4094 characters for UNIX or 255 characters for Windows)
- chkpntDir (%s)
- Checkpoint directory
- inFile (%s)
- Input file name (up to 4094 characters for UNIX or 255 characters for Windows)
- outFile (%s)
- Output file name (up to 4094 characters for UNIX or 255 characters for Windows)
- errFile (%s)
- Error output file name (up to 4094 characters for UNIX or 255 characters for Windows)
- subHomeDir (%s)
- Submitter’s home directory
- jobFile (%s)
- Job file name
- numAskedHosts (%d)
- Number of candidate host names
- askedHosts (%s)
- List of names of candidate hosts for job dispatching
- dependCond (%s)
- Job dependency condition
- preExecCmd (%s)
- Job pre-execution command
- jobName (%s)
- Job name (up to 4094 characters)
- command (%s)
- Job command (up to 4094 characters for UNIX or 255 characters for Windows)
- nxf (%d)
- Number of files to transfer (%d)
- xf (%s)
- List of file transfer specifications
- mailUser (%s)
- Mail user name
- projectName (%s)
- Project name
- niosPort (%d)
- Callback port if batch interactive job
- maxNumProcessors (%d)
- Maximum number of processors
- schedHostType (%s)
- Execution host type
- loginShell (%s)
- Login shell
- timeEvent (%d)
- Time Event, for job dependency condition; specifies when time event ended
- userGroup (%s)
- User group
- exceptList (%s)
- Exception handlers for the job
- options2 (%d)
- Bit flags for job processing
- idx (%d)
- Job array index
- inFileSpool (%s)
- Spool input file (up to 4094 characters for UNIX or 255 characters for Windows)
- commandSpool (%s)
- Spool command file (up to 4094 characters for UNIX or 255 characters for Windows)
- jobSpoolDir (%s)
- Job spool directory (up to 4094 characters for UNIX or 255 characters for Windows)
- userPriority (%d)
- User priority
- rsvId %s
- Advance reservation ID; for example, "user2#0"
- jobGroup (%s)
- The job group under which the job runs
- sla (%s)
- SLA service class name under which the job runs
- rLimits
- Thread number limit
- extsched (%s)
- External scheduling options
- warningAction (%s)
- Job warning action
- warningTimePeriod (%d)
- Job warning time period in seconds
- SLArunLimit (%d)
- Absolute run time limit of the job for SLA service classes
- licenseProject (%s)
- IBM Platform License Scheduler project name
- options3 (%d)
- Bit flags for job processing
- app (%s)
- Application profile name
- postExecCmd (%s)
- Post-execution command to run on the execution host after the job finishes
- runtimeEstimation (%d)
- Estimated run time for the job
- requeueEValues (%s)
- Job exit values for automatic job requeue
- resizeNotifyCmd (%s)
- Resize notification command to run on the first execution host to inform job of a resize event.
- jobDescription (%s)
- Job description (up to 4094 characters).
- submitEXT
- Submission extension field, reserved for internal use.
- Num (%d)
- Number of elements (key-value pairs) in the structure.
- key (%s)
- Reserved for internal use.
- value (%s)
- Reserved for internal use.
- srcJobId (%d)
- The submission cluster job ID
- srcCluster (%s)
- The name of the submission cluster
- dstJobId (%d)
- The execution cluster job ID
- dstCluster (%s)
- The name of the execution cluster
- network (%s)
- Network requirements for IBM Parallel Environment (PE) jobs.
- cpu_frequency(%d)
- CPU frequency at which the job runs.
- options4 (%d)
- Bit flags for job processing
Comments
Post a Comment