How to determine a job's place in a PBS queue? - bash

I'm working with a computation cluster that uses PBS/Torque for job scheduling. The queue can be pretty long at times, for example, I now have a few jobs submitted in a queue of over 800 (as reported by showq which shows a full list of jobs, but as far as I am aware these aren't necessarily in the order of execution).
I would like to find out where in the queue my jobs are located; how many will be processed before mine? I would like to get some output like: Job <id>: 417/862. This way I would have at least some indication of progress and waiting time. However, I have not been able to find out how to do this. Can it be done, and how?

I wasn't sure if I could count on it that queued jobs would be executed in the order presented by showq, but after some more research, it certainly looks like it.
The queue printed by showq has the following format:
ACTIVE JOBS--------
[table headers]
[listing of active jobs]
IDLE JOBS--------
[table headers]
[listing of idle jobs]
BLOCKED JOBS----------
[table headers]
[listing of blocked jobs]
Based on this format, I came up with the following bash script to find a job's place in the idle section of the queue, given the job's id:
job=$1
idlestart=`showq | grep "IDLE JOBS" -n | cut -d: -f1`
jobline=`showq | grep -n $job | cut -d: -f1`
place=`expr $jobline - $idlestart - 2`
echo "Idle Jobs section starts at line $idlestart"
echo "Job $job at line $jobline"
echo "Place in queue: $place"
Example output:
$ ./placeinq 6565618
Idle Jobs section starts at line 343
Job 6565618 at line 387
Place in queue: 42

Related

How do I write a Windows 10 script to increment a counter by 1 each time a print job prints?

I'd like to write a script for Windows 10 to add 1 to a count in a .txt each time a print job completes. Ideally a separate count for each day, so I can see how many print jobs were completed in a day.
Any help in understanding how to go about this is appreciated!
The print service already logs every time it prints - you just need to enable the appropriate event log channel and consume the resulting log events:
# Enable the Microsoft-Windows-PrintService/Operational log channel
wevtutil.exe set-log Microsoft-Windows-PrintService/Operational /enabled:true
Now that the log channel is enabled, the print service will log an event with event ID 307 everytime it executes a local print job. Since the log events all have timestamps, getting a count per day is as simple as using the Group-Object cmdlet:
# Fetch the print job events from the event log
$printJobEvents = Get-WinEvent -FilterHashtable #{ LogName='Microsoft-Windows-PrintService/Operational'; EventId=307 }
# Group by date logged, to get a count-per-day
$printJobEvents |Group-Object { '{0:yyyy-MM-dd}' -f $_.TimeCreated.Date } -NoElement |Sort-Object Name
One technique that might be useful is to query stats for the spooler service like this:
Get-CimInstance 'Win32_PerfFormattedData_Spooler_PrintQueue' |
Format-Table -Property Name,Jobs,TotalJobsPrinted,TotalPagesPrinted -AutoSize
This gives output like this:
Name Jobs TotalJobsPrinted TotalPagesPrinted
---- ---- ---------------- -----------------
Printer1 0 50 212
Printer2 3 13 118
Printer3 1 33 306
_Total 4 96 636
The stats are reset each time the Print Spooler service restarts, so you'll need to take that into account in your final script, which might make this a trickier option than Mathias' event log solution.

Cancelling long-running Elasticsearch tasks times out

My _search requests had been gradually becoming slower and slower to the point of 504 gateway timeouts. Then I saw dozens of super-long running indices:data/read/search tasks with no end in sight so I tried to cancel them using POST _tasks/_cancel?actions=*search (note that I only have one node of interest so I didn't need the &node=... param).
This only resulted in another (cancel) task being registered and now even my GET _tasks and GET _cat/tasks?v requests are timing out.
I'm wondering whether it's possible to
set a cap on the running_time_in_nanos attribute of all search tasks and/or auto-cancel all the exceeding ones
force-cancel the tasks without having to restart the ES service when the Tasks API itself is timing out
Side note: I already have a health-check bash script
if [ `curl -s -m 20 https://my-es-instance | grep "You Know, for Search" | wc -l` -ne 1 ];
then
echo "`date "+%F %T"` app not responding" &>> $my_log_file
...
but it doesn't take into consideration the fact that while the root (GET /) may be running, the _search endpoints are not.
What are the best practices here?

parallel computing in multiple cores for data which is indepedently run with the program

I have a simulation program in fortran which takes the input from a .dat. This file has 100.000 lines which takes really long to run. The program take the first line, run all the simulations and write in a .out the result and pass to the next line. I have a computer with 16 cpu so how can I do to split my data in 16 parts and run it separatly in each of the cpus? I am running in a machine with ubuntu. It is totally independent each line from the other.
For example my data is HeadData10000.dat, then I have a file simulation.ini with the name of the input data in this case: HeadData10000.dat and with the name of the output data. So the file simulation.ini will look like that
HeadData10000.dat
outputdata.out
Then now I have two computer so I split my HeadData10000.dat y two files and I do two simulation.ini for each input data and I run it like this in each computer: ./simulation.exe<./simulation.ini.
Assuming your list of 100,000 jobs is called "jobs.txt" and looks like this:
JobA
JobB
JobC
JobD
You could run this:
parallel 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
If you want to do a dry run to see what that would do without doing anything:
parallel --dry-run 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
Sample Output
printf "JobA\nJobA.out" | ./simulation.exe
printf "JobB\nJobB.out" | ./simulation.exe
printf "JobC\nJobC.out" | ./simulation.exe
printf "JobD\nJobD.out" | ./simulation.exe
If you have multiple servers available, look at using the -S parameter to GNU Parallel to spread the jobs across the machines. Also, look at the --eta and --bar parameters for getting progress reports.
I used printf "line1 \n line2" to generate two lines of input in order to avoid having to create, and later delete 100,000 files.
By default, GNU Parallel will keep 1 job per CPU core running, so there will always be 16 jobs running on your 16-core machine, but you can change that to, say, 8 if you want to with parallel -j 8. You can also specify the number of jobs to run on your second (and subsequent) machines.

Efficient way of sending the same data to multiple dynamic processes

I have a stream of line-buffered data, and many readers from other processes
The readers need to attach to the system dynamically, they are not known to the process writing the stream
First i tried to read every line and simply send them to a lot of pipes
#writer
command | while read -r line; do
printf '%s\n' "$line" | tee listeners/*
done
#reader
mkfifo listeners/1
cat listeners/1
But that's consume a lot of CPU
So i though about writing to a file and cleaning it repeatedly
#writer
command >> file &
while true; do
: > file
sleep 1
done
#reader
tail -f -n0 file
But sometimes, a line is not read by one or more readers before truncation, making a race condition
Is there a better way on how i could implement this?
Sounds like pub/sub to me - see Wikipedia.
Basically, new interested parties come along whenever they like and "subscribe" to your channel. The process receiving the data then "publishes" it, line by line, to that channel.
You can do it with MQTT using mosquitto or with Redis. Both have command-line interfaces/bindings, as well as Python, C/C++, Ruby, PHP etc. Client and server need not be on same machine, some clients could be elsewhere on the network.
Mosquitto example here.
I did a few tests on my Mac with Redis pub/sub. The client code in Terminal to subscribe to a channel called myStream looks like this:
redis-cli SUBSCRIBE myStream
I then ran a process to synthesise 10,000 lines like this:
time seq 10000 | while read a ; do redis-cli PUBLISH myStream "$a" >/dev/null 2>&1 ; done
And that takes 40s, so it does around 250 lines per second, but it has to start a whole new process for each line and create and tear down the connection to Redis... and we don't want to send your CPU mad.
More appropriately for your situation then, here is how you can create a file with 100,000 lines, and read them one at a time, and send them to all your subscribers in Python:
# Make a "BigFile" with 100,000 lines
seq 100000 > BigFile
and read the lines and publish them with:
#!/usr/bin/env python3
import redis
if __name__ == '__main__':
# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)
# Read file line by line...
with open('BigFile', 'r') as infile:
for line in infile:
# Publish the current line to subscribers
r.publish('myStream', line)
The entire 100,000 lines were sent and received in 4s, so 25,000 lines per second. Here is a little recording of it in action. At the top you can see the CPU is not unduly troubled by it. The second window from the top is a client, receiving 100,000 lines and the next window down is a second client. The bottom window shows the server running the Python code above and sending all 100,000 lines in 4s.
Keywords: Redis, mosquitto, pub/sub, publish, subscribe.

Adding Job Array elements in Slurm after submission

I'm trying to use a Slurm-operated cluster to run LS-Dyna (a finite-element simulation program with a limited number of licenses available on my cluster). I am trying to write my batch scripts so that I do not waste processing time due to this license limit (as well as to improve legibility when running 'squeue' commands) by using job arrays -but I'm having trouble making that work.
I want to run identical Bash scripts in a variety of FEM meshes, each of which I have organized into different subfolders.
Given this folder structure on my cluster...
cluster root
|
...
|
|-+ my scratch space's root
|
|-+ this project
|
|--+ lat_-5mm
| |- runCurrentLine.bash
| |- other files
|
|--+ lat_-4.75mm
| |- runCurrentLine.bash
| |- other files
|
|--+ lat_-4.5mm
| |- runCurrentLine.bash
| |- other files
|
...
|
|--+ lat_5mm
| |- runCurrentLine.bash
| |- other files
|
|
|-sendDynaRuns.bash
|-other dependencies
...I'm trying to submit "runCurrentLine.bash" in each folder by running the following script in my login node.
#!/bin/bash
iter=0
for foldernow in */; do
# change to subdirectory for current line iteration
cd "./${foldernow}";
# make Slurm and user happy
echo "sending LS Dyna simulation for ${pos}mm line..."
sleep 1
# first line only: send batch, and get job ID
if [ "${iter}" == 0 ];then
# send the batch...
jobID=$(sbatch -J "Dyna" --array="${iter}"%15 runCurrentLine.bash)
# ...ensure that Slurm's output shows on console (which includes the job ID)...
echo "${jobID}"
# ...and extract the job ID and save as a variable
jobID=$(echo "${jobID}" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?')
# subsequent lines: add current line to job array
else
scontrol update --jobid="${jobID}" --array="${iter}"%15 runCurrentLine.bash
fi
# prepare to move onto next position
iter=$((iter+1))
cd ../
done
This setup properly sends the batch job for the first line, at -0.25mm*. However, for the second line onwards, it doesn't seem to do the same thing... This is what I end up getting on my console:
*: I intended the "lat_xmm" folders to be numerically ordered, but Unix doesn't seem to recognize that
$ ./sendDynaRuns.bash
sending LS Dyna simulation for -0.25mm line...
Submitted batch job 1081040
sending LS Dyna simulation for 0.25mm line...
sbatch: error: Batch job submission failed: Invalid job id specified
sending LS Dyna simulation for -0.5mm line...
sbatch: error: Batch job submission failed: Invalid job id specified
I know that runCurrentLine.bash runs just fine if I manually send it as a batch (and it runs to completion within the time limit I specified in-file, mainly since it doesn't have to compete with other lines for open licenses). What should I do to be able to get my code to work?
Thank you in advance!
As state by #Poshi, you cannot add jobs to an existing array.
I would create a submission script like this one:
#!/bin/bash
#SBATCH --array=1-<nb of folders>%15
# ALL OTHER SLURM SBATCH DIRECTIVES HERE
folders=(lat_*)
foldernow=${folders[$SLURM_TASK_ARRAY_ID]}
cd $foldernow && ./runCurrentLine.bash
The only drawback is that you need setup explicitly the number of jobs the array based on the number of folders.

Resources