Invalid job array specification in slurm - cluster-computing

I am submitting a toy array job in slurm. My command line is
$ sbatch -p development -t 0:30:0 -n 1 -a 1-2 j1
where j1 is script:
#!/bin/bash
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
When I submit this, I get an error:
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/03400/myname)...OK
--> Verifying availability of your work dir (/work/03400/myname)...OK
--> Verifying availability of your scratch dir (/scratch/03400/myname)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (PRJ-1234)...OK
sbatch: error: Batch job submission failed: Invalid job array specification
The same job works fine without the array specification:
$ sbatch -p development -t 0:30:0 -n 1 j1

This post is a bit old, but in case it happens for other people, I have had the same issue but the accepted answer did not suggest what was the problem in my case.
This error (sbatch: error: Batch job submission failed: Invalid job array specification) can also be raised when the array size is too large.
From https://slurm.schedmd.com/slurm.conf.html
MaxArraySize
The maximum job array size. The maximum job array task index value will be one less than MaxArraySize to allow for an index value of zero. Configure MaxArraySize to 0 in order to disable job array use. The value may not exceed 4000001. The value of MaxJobCount should be much larger than MaxArraySize. The default value is 1001.
To check the value, the slurm.conf file should be accessible by all slurm users (still according to 1) and may be found somewhere near /etc/slurm.conf (see https://slurm.schedmd.com/slurm.conf.html#lbAM, in my case I found it at path /etc/slurm/slurm.conf).

The syntax for your array specification is correct. But the printout you paste is not standard Slurm, I guess you are working on Stampede ; they have their own sbatch wrapper.
What you could do is use the -vvv option to sbatch to see exactly what Slurm sees:
$ sbatch -vvv -p development -t 0:30:0 -n 1 -a 1-2 j1 |& grep array
This should return
sbatch: array : 1-2
and if it does not it means the information is somehow lost somewhere.
What you can try is remove the array specification from the submission command line and insert it in the submission script, like this:
$ sbatch -p development -t 0:30:0 -n 1 j1
with j1 being
#!/bin/bash
#SBATCH -a 1-2
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
The next step is to contact the system administrators with the information you will get from running the above tests and ask for help.

Related

Slots command in hostfile for mpirun not recognised

I saw another question that seemed similar mpirun: token slots not supported but their solution did not work for me.
I get the error
token slots not supported at this time
when running the command mpirun -hostfile temp.txt hostname
where temp.txt is
hostname1 slots=2
hostname2 slots=2
I have the mpirun version 2021.5
Release Date: 20211102 (id: 9279b7d62).
It did not work to instead write
hostname1:2
hostname2:2
in that case the command runs but it instead does the number of physical processors that are available, which is default.
EDIT: I am adding the full output
[host RAMSES]$ mpirun -hostfile temp.txt hostname
[mpiexec#host] HYD_hostfile_process_tokens (../../../../../src/pm/i_hydra/libhydra/hostfile/hydra_hostfile.c:47): token slots not supported at this time
[mpiexec#host] HYD_hostfile_unique_parse (../../../../../src/pm/i_hydra/libhydra/hostfile/hydra_hostfile.c:232): unable to process token
[mpiexec#host] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:83): match handler returned error
[mpiexec#host] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec#host] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
[mpiexec#host] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1784): error parsing parameters
So I found that on my version of mpi I had to specify processor placement not in the hostfile, as most of the examples I found do, but rather in the machinefile.
So the new command and file look like:
mpirun -machinefile machine.txt hostname
machine.txt:
host1:2
host2:2

'Wildcards' object has no attribute 'output'

I get an error for a rather simple rule. I have to write a task file for another program, expecting a tsv file. I read a certain number of parameters from my config file and write them to a file with a shell command.
Code:
rule create_tasks:
output:
temp("tasks_{sample}.tsv")
params:
ID="{sample}",
file=lambda wc: samples["path"][wc.sample] ,
bigwig=lambda wc: samples["bigwig"][wc.sample] ,
ambig=lambda wc: samples["ambig"][wc.sample]
shell:
'echo -e "{params.ID}\t{params.file}" > {output}'
When I execute the workflow, I get the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
count jobs
1 create_tasks
1
[Mon Oct 12 14:48:15 2020]
rule create_tasks:
output: tasks_sampleA.tsv
jobid: 0
wildcards: sample=sampleA
echo -e "sampleA /Path/To/sampleA.bed " > tasks_sampleA.tsv
WorkflowError in line 23 of /path/to/workflow.snakefile:
'Wildcards' object has no attribute 'output'
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 1233, in run
I should mention, that two of the variables are empty and that I expect the tabs/whitespaces in the echo command.
Does anybody have an explanation, why snakemake is trying to find output in the wildcards? I am expecially confused, because it is printing the correct command.
I've run into this same problem.
The issue is probably in how you invoked Snakemake from the command line.
For example, this was my Snakefile rule:
rule sort:
input:
"{file}.bam",
output:
"{file}.sorted.bam",
"{file}.sorted.bai",
shell:
"sambamba sort {input}"
I don't even have params or wildcards explicitly anywhere in there.
But when I run it on my Slurm HPC I get the same error:
snakemake -j 10 -c "sbatch {cluster.params}" -u cluster.yaml
The Wildcards (note the capital "W") and params objects weren't from the rule.
They came from the cluster execution of the rule, and the error was thrown when trying to parse the cluster.yaml file.
There was no cluster parameter specification in my cluster.yaml file for the sort rule, so the error was thrown.
I fixed this by adding
sort:
params: "..."
to my cluster.yaml file.
In your case, add cluster submission options under a create_tasks: ... list.
You can also add a __default__: ... list as the default submission parameters for any job, by default, unless it matches another rule.

problem with snakemake submitting jobs with multiple wildcard on SGE

I used snakemake on LSF cluster before and everything worked just fine. However, recently I migrated to SGE cluster and I am getting a very strange error when I try to run a job with more than one wildcard.
When I try to submit a job based on this rule
rule download_reads :
threads : 1
output : "data/{sp}/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh {wildcards.sp} {wildcards.accesion} data/{wildcards.sp}/raw_reads/{wildcards.accesion}"
I get a following error (snakemake_clust.sh details bellow)
./snakemake_clust.sh data/Ecol1/raw_reads/SRA123456_1.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 10
Job counts:
count jobs
1 download_reads
1
[Thu Jul 30 12:08:57 2020]
rule download_reads:
output: data/Ecol1/raw_reads/SRA123456_1.fastq.gz
jobid: 0
wildcards: sp=Ecol1, accesion=SRA123456
scripts/download_reads.sh Ecol1 SRA123456 data/Ecol1/raw_reads/SRA123456
Unable to run job: ERROR! two files are specified for the same host
ERROR! two files are specified for the same host
Exiting.
Error submitting jobscript (exit code 1):
Shutting down, this might take some time.
When I replace the sp wildcard with a constant, it works as expected:
rule download_reads :
threads : 1
output : "data/Ecol1/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh Ecol1 {wildcards.accesion} data/Ecol1/raw_reads/{wildcards.accesion}"
I.e. I get
Submitted job 1 with external jobid 'Your job 50731 ("download_reads") has been submitted'.
I wonder why I might have this problem, I am sure I used exactly the same rule on the LSF-based cluster before without any problem.
some details
The snakemake submitting script looks like this
#!/usr/bin/env bash
mkdir -p logs
snakemake $# -p --jobs 10 --latency-wait 120 --cluster "qsub \
-N {rule} \
-pe smp64 \
{threads} \
-cwd \
-b y \
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
-b y makes the command executed as it is, -cwd changes the working directory on the computing node the the working directory from where the job was submitted. Other flags / specifications are clear I hope.
Also, I am aware of --drmaa flag, but I think out cluster is not well configured for that. --cluster was till now a more robust solution.
-- edit 1 --
When I execute exactly the same snakefile locally (on the fronend, without the --cluster flag), the script gets executed as expected. It seems to be a problem of interaction of snakemake and the scheduler.
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
This is a random guess... More than one wildcards are concatenated by space before replacing them into logs/{rule}.{wildcards}.err. So despite you use double quotes, SGE treats the resulting string as two files and throws the error. What if you use single quotes instead? Like:
-o 'logs/{rule}.{wildcards}.out' \
-e 'logs/{rule}.{wildcards}.err'
Alternatively, you could concatenate the wildcards in the rule and use the result on the command line. E.g.:
rule one:
params:
wc= lambda wc: '_'.join(wc)
output: ...
Then use:
-o 'logs/{rule}.{params.wc}.out' \
-e 'logs/{rule}.{params.wc}.err'
(This second solution, if it works, kind of sucks though)

Shell Scripting to compare the value of current iteration with that of the previous iteration

I have an infinite loop which uses aws cli to get the microservice names, it's parameters like desired tasks,number of running task etc for an environment.
There are 100's of microservices running in an environment. I have a requirement to compare the value of aws ecs metric running task for a particular microservice in the current loop and with that of the previous loop.
Say name a microservice X has the metric running task 5. As it is an infinite loop, after some time, again the loop come for the microservice X. Now, let's assume the value of running task is 4. I want to compare the running task for currnet loop, which is 4 with the value of the running task for the previous run, which is 5.
If you are asking a generic question of how to keep a previous value around so it can be compared to the current value, just store it in a variable. You can use the following as a starting point:
#!/bin/bash
previousValue=0
while read v; do
echo "Previous value=${previousValue}; Current value=${v}"
previousValue=${v}
done
exit 0
If the above script is called testval.sh. And you have an input file called test.in with the following values:
2
1
4
6
3
0
5
Then running
./testval.sh <test.in
will generate the following output:
Previous value=0; Current value=2
Previous value=2; Current value=1
Previous value=1; Current value=4
Previous value=4; Current value=6
Previous value=6; Current value=3
Previous value=3; Current value=0
Previous value=0; Current value=5
If the skeleton script works for you, feel free to modify it for however you need to do comparisons.
Hope this helps.
I dont know how your input looks exactly, but something like this might be useful for you :
The script
#!/bin/bash
declare -A app_stats
while read app tasks
do
if [[ ${app_stats[$app]} -ne $tasks && ! -z ${app_stats[$app]} ]]
then
echo "Number of tasks for $app has changed from ${app_stats[$app]} to $tasks"
app_stats[$app]=$tasks
else
app_stats[$app]=$tasks
fi
done <<< "$( cat input.txt)"
The input
App1 2
App2 5
App3 6
App1 6
The output
Number of tasks for App1 has changed from 2 to 6
Regards!

Parallel processing with dependencies on a SGE cluster

I'm doing some experiments on a computing cluster. My algorithm has two steps. The first one writes its outputs to some files which will be used by the second step. The dependecies are 1 to n meaning one step2 programs needs the output of n step1 program. I'm not sure what to do neither waist cluster resources nor keep the head node busy. My current solution is:
submit script (this runs on the head node)
for different params, p:
run step 1 with p
sleep some time based on the an estimate of how much step 1 takes
for different params, q:
run step 2 with q
step 2 algorithm (this runs on the computing nodes)
while files are not ready:
sleep a few minutes
do the step 2
Is there any better way to do this?
SGE provides both job dependencies and array jobs for that. You can submit your phase 1 computations an array job and then submit the phase 2 computation as a dependent job using the qsub -hold_jid <phase 1 job ID|name> .... This will make the phase 2 job wait until all the phase 1 computations have finished and then it will be released and dispatched. The phase 1 computations will run in parallel as long as there are enough slots in the cluster.
In a submission script it might be useful to specifiy holds by job name and name each array job in a unique way. E.g.
mkdir experiment_1; cd experiment_1
qsub -N phase1_001 -t 1-100 ./phase1
qsub -hold_jid phase1_001 -N phase2_001 ./phase2 q1
cd ..
mkdir experiment_2; cd experiment_2
qsub -N phase1_002 -t 1-42 ./phase1 parameter_file
qsub -hold_jid phase1_002 -N phase2_002 ./phase2 q2
cd ..
This will schedule 100 executions of the phase1 script as the array job phase1_001 and another 42 executions as the array job phase1_002. If there are 142 slots on the cluster, all 142 executions will run in parallel. Then one execution of the phase2 script will be dispatched after all tasks in the phase1_001 job have finished and one execution will be dispatched after all tasks in the phase1_002 job have finished. Again those can run in parallel.
Each taks in the array job will receive a unique $SGE_TASK_ID value ranging from 1 to 100 for the tasks in job phase1_001 and from 1 to 42 for the tasks in job phase1_002. From it you can compute the p parameter.

Resources