PBS keeps aborting my jobs - cluster-computing

I am requesting 14 processors from one one (each has 32) like this:
#PBS -l nodes=1:ppn=14
#PBS -l walltime=12:00:00
And with lower ppn it almost always works, but once I get to numbers higher than 14-ish, the job begins execution and terminates immediately. tracejob is singularly unhelpful:
tracejob 14753.hpc2
Job: 14753.hpc2
01/21/2017 11:12:36 L Considering job to run
01/21/2017 11:12:36 L Job run
01/21/2017 11:12:36 M Resource_List.place = scatter
01/21/2017 11:12:36 M make_cpuset, vnode hpc2[0]: hv_ncpus (2) > mvi_acpus (0) (you are not expected to understand this)
01/21/2017 11:12:36 M start_exec, new_cpuset failed
01/21/2017 11:12:36 M kill_job
01/21/2017 11:12:36 M hpc2 cput= 0:00:00 mem=0kb
01/21/2017 11:12:37 M Obit sent
01/21/2017 11:12:37 M copy file request received
01/21/2017 11:12:37 M staged 2 items out over 0:00:00
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:38 M no active tasks
01/21/2017 11:12:38 M delete job request received
I have at times successfully requested more cpus, so it's not completely deterministic. Is there a way to debug this?
As a side node, any job that requests more than one node sits in the queue forever and is never started, I don't know if that is related.

Are you trying to do a "qrun" and forcefully trying to start this job on the specified vnode?
As a possible solution try restarting your MOM (Machine Oriented Mini-server) or set the sharing as exclusive on the MOM (of course you need to be a privileged user to do that).

Related

sge All queues dropped because of overload or full

I'm going to run a million batch jobs with " sge ".
Approximately 10,000 jobs are well executed, but after an hour of execution, they stop running.
After about an hour's run, the process slows down and eventually stops.
Checking the error message does not confirm any errors.
i can check the message below only.
"All queues dropped because of overload or full"
How do I set up the layout to run normally?
there is one master server and four clients and files share using nfs
and every system run on docker and docker-swirm
do qstat when job execution speed was slow down
$qstat -j
queue instance "peteris.q#sge00" dropped because it is full
queue instance "peteris.q#sge02" dropped because it is full
queue instance "peteris.q#sge03" dropped because it is full
queue instance "peteris.q#sge01" dropped because it is full
All queues dropped because of overload or full
detail messages
$qstat -j 1595799
=============================================================
job_number: 1595799
exec_file: job_scripts/1595799
submission_time: Sun May 27 08:08:10 2018
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_workdir: /data/23andMe
sge_o_host: sge
account: sge
cwd: /data/23andMe
mail_list: root#sge
notify: FALSE
job_name: python3
jobshare: 0
env_list:
job_args: lineage.py,makeShell/1009_user3130_user3600.list
script_file: python3
usage 1: cpu=00:00:02, mem=0.59503 GBs, io=0.03963, vmem=493.180M, maxvmem=493.180M
scheduling info: queue instance "peteris.q#sge00" dropped because it is full
queue instance "peteris.q#sge02" dropped because it is full
queue instance "peteris.q#sge03" dropped because it is full
queue instance "peteris.q#sge01" dropped because it is full
All queues dropped because of overload or full
sge config
algorithm default
schedule_interval 0:0:10
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=100.0
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 2
flush_finish_sec 2
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 0.500000
weight_waiting_time 0.278000
weight_deadline 3600000.000000
weight_urgency 0.500000
weight_priority 0.000000
max_reservation 0
default_duration INFINITY
sge queue config
qname peteris.q
hostlist #allhosts
seq_no 0
load_thresholds NONE
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:00:05
priority 0
min_cpu_interval 00:00:05
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 20
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:01
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Seems like you have hit a practical limit on the number of active jobs that the queue can handle at any given time. I cannot confirm where the maximum is defined by SGE, but seems likely it is:
max_jobs
The number of active (not finished) jobs simultaneously
allowed in Sun Grid Engine is controlled by this parameter.
A value greater than 0 defines the limit. The default value
0 means "unlimited". If the max_jobs limit is exceeded by a
job submission then the submission command exits with exit
status 25 and an appropriate error message.
Changing max_jobs will take immediate effect.
This value is a global configuration parameter only. It can-
not be overwritten by the execution host local configura-
tion.
From: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html?pathrev=V62u5_TAG
If this is correct then value is unlimited; however, SGE will likely not perform well trying to manage ~1 million active jobs, hence the issue you are likely having. I would recommend you use job arrays, as this is the purpose of this type of job ie, to manage and run many near identical tasks.
There are many resources online for job arrays in SGE, such as this one:
http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto
http://talby.rcs.manchester.ac.uk/~ri/_linux_and_hpc_lib/sge_array.html
https://wiki.duke.edu/display/SCSC/SGE+Array+Jobs
I am happy to assist further if you edit your question with specific requirements for each task. For example, does each of the ~ 1 millions tasks require one or more parameters as input?

Hive cross join fails on local map join

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv.
I am trying to perform the following cross join. ipint is a 9GB table, and geoiplite is 270MB.
CREATE TABLE iplatlong_sample AS
SELECT ipintegers.networkinteger, geoiplite.latitude, geoiplite.longitude
FROM geoiplite
CROSS JOIN ipintegers
WHERE ipintegers.networkinteger >= geoiplite.network_start_integer AND ipintegers.networkinteger <= geoiplite.network_last_integer;
I use CROSS JOIN on ipintegers instead of geoiplite because I have read that the rule is for the smaller table to be on the left, larger on the right.
Map and Reduce stages complete to 100% according to HIVE, but then
2015-08-01 04:45:36,947 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 8767.09 sec
MapReduce Total cumulative CPU time: 0 days 2 hours 26
minutes 7 seconds 90 msec
Ended Job = job_201508010407_0001
Stage-8 is selected by condition resolver.
Execution log at: /tmp/myuser/.log
2015-08-01 04:45:38 Starting to launch local task to process map
join; maximum memory = 12221153280
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID: Stage-8
Logs:
/tmp/myuser/hive.log
FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
MapReduce Jobs
Launched: Job 0: Map: 38 Reduce: 1 Cumulative CPU: 8767.09 sec
HDFS Read: 9438495086 HDFS Write: 8575548486 SUCCESS
My hive config:
SET hive.mapred.local.mem=40960;
SET hive.exec.parallel=true;
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate = true;
SET hive.optimize.skewjoin = true;
SET mapred.compress.map.output=true;
SET hive.stats.autogather=false;
I have varied SET hive.auto.convert.join between true and false but with the same result.
Here are the errors in the output log from /tmp/myuser/hive.log
$ tail -12 -f tmp/mysyer/hive.log
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Execution failed with exit status: 3
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Obtaining error information
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) -
Task failed!
Task ID:
Stage-8
Logs:
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) - /tmp/myuser/hive.log
2015-08-01 07:30:46,087 ERROR mr.MapredLocalTask (MapredLocalTask.java:execute(268)) - Execution failed with exit status: 3
2015-08-01 07:30:46,094 ERROR ql.Driver (SessionState.java:printError(419)) - FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
I am running the hive client on the Master, a Google Cloud Platform instance of type n1-highmem-8 type (8 CPU, 52GB) and workers are n1-highmem-4 (4CPU 26GB), but I suspect after MAP and REDUCE that a local join (as implied) takes place on the Master. Regardless, in bdutils I configured the JAVAOPTS for the worker nodes (n1-highmem-4) to: n1-highmem-4
SOLUTION EDIT: The solution is to organize the data the range data into a range tree.
I don't think it is possible to perform this kind of cross join brute force - just multiply the row numbers, it's a little out of hand. You need some optimizations, which I don't think hive is capable yet.
But is this problem can actually be solved in O(N1+N2) time providing you have your data sorted (which hive can do for you) - you just go through both lists simultaneously, on each step getting an ip integer, seeing if any intervals start on this integer, adding them, removing those that ended, emitting matching tuples, and so on. Pseudocode:
intervals=[]
ipintegers = iterator(ipintegers_sorted_file)
intervals = iterator(intervals_sorted_on_start_file)
for x in ipintegers:
intervals = [i for i in intervals if i.end >= x]
while(intervals.current.start<=x):
intervals.append(intervals.current)
intervals.next()
for i in intervals:
output_match(i, x)
Now, if you have an external script/UDF function that knows how to read the smaller table and gets ip integers as input and spits matching tuples as output, you can use hive and SELECT TRANSFORM to stream the inputs to it.
Or you can probably just run this algorithm on a local machine with two input files, because this is just O(N), and even 9 gb of data is very doable.

Julia doesn't like it when I add and remove processes without doing any parallel processing

UPDATE: Confirmed as a bug. For more detail, see the link and details provided by #ViralBShah below.
Julia throws a strange error when I add and remove processes (addprocs and rmprocs), but only if I don't do any parallel processing in between. Consider the following example code:
#Set parameters
numCore = 4;
#Add workers
print("Adding workers... ");
addprocs(numCore - 1);
println(string(string(numCore-1), " workers added."));
#Detect number of cores
println(string("Number of processes detected = ", string(nprocs())));
# Do some stuff (COMMENTED OUT)
# XLst = {rand(10, 1) for i in 1:8};
# XMean = pmap(mean, XLst);
#Remove the additional workers
print("Removing workers... ");
rmprocs(workers());
println("Done.");
println("Subroutine complete.");
Note that I've commented out the only code that actually does any parallel processing (the call to pmap). If I run this code on my machine (Julia 0.2.1, Ubuntu 14.04), I get the following output in the console:
Adding workers... 3 workers added.
Number of processes detected = 4
Removing workers... Done.
Subroutine complete.
fatal error on
In [86]: fatal error on 88: ERROR: 87: ERROR: connect: connection refused (ECONNREFUSED)
in yield at multi.jl:1540
connect: connection refused (ECONNREFUSED) in wait at task.jl:117
in wait_connected at stream.jl:263
in connect at stream.jl:878
in Worker at multi.jl:108
in anonymous at task.jl:876
in yield at multi.jl:1540
in wait at task.jl:117
in wait_connected at stream.jl:263
in connect at stream.jl:878
in Worker at multi.jl:108
in anonymous at task.jl:876
The first four lines are printed by my program, and seem to indicate that it runs to completion. But then I get a fatal error. Any ideas?
The most interesting thing about this error is if I uncomment the code with the call to pmap (ie if I actually do some parallel processing), the fatal error goes away.
This issue is being tracked at https://github.com/JuliaLang/julia/issues/7646 and I reproduce the answer by Amit Murthy:
pid 1 does an addprocs(3)
addprocs returns after it has established connections with all 3 new workers.
However, at this time the the connections between workers may not have been setup, i.e. from pids 3 -> 2, 4 -> 2 and 4 -> 3.
Now pid 1 calls rmprocs(workers()) , i.e., pids 2, 3 and 4.
As pid 2 exits, the connection attempt in 4 to 2, results in an error.
Since we have redirected the output of pid 4, to the stdout of pid 1, we see the same error printed.
The system is still in a consistent state, though the printing of said error messages may suggest something amiss.

Parallel processing with dependencies on a SGE cluster

I'm doing some experiments on a computing cluster. My algorithm has two steps. The first one writes its outputs to some files which will be used by the second step. The dependecies are 1 to n meaning one step2 programs needs the output of n step1 program. I'm not sure what to do neither waist cluster resources nor keep the head node busy. My current solution is:
submit script (this runs on the head node)
for different params, p:
run step 1 with p
sleep some time based on the an estimate of how much step 1 takes
for different params, q:
run step 2 with q
step 2 algorithm (this runs on the computing nodes)
while files are not ready:
sleep a few minutes
do the step 2
Is there any better way to do this?
SGE provides both job dependencies and array jobs for that. You can submit your phase 1 computations an array job and then submit the phase 2 computation as a dependent job using the qsub -hold_jid <phase 1 job ID|name> .... This will make the phase 2 job wait until all the phase 1 computations have finished and then it will be released and dispatched. The phase 1 computations will run in parallel as long as there are enough slots in the cluster.
In a submission script it might be useful to specifiy holds by job name and name each array job in a unique way. E.g.
mkdir experiment_1; cd experiment_1
qsub -N phase1_001 -t 1-100 ./phase1
qsub -hold_jid phase1_001 -N phase2_001 ./phase2 q1
cd ..
mkdir experiment_2; cd experiment_2
qsub -N phase1_002 -t 1-42 ./phase1 parameter_file
qsub -hold_jid phase1_002 -N phase2_002 ./phase2 q2
cd ..
This will schedule 100 executions of the phase1 script as the array job phase1_001 and another 42 executions as the array job phase1_002. If there are 142 slots on the cluster, all 142 executions will run in parallel. Then one execution of the phase2 script will be dispatched after all tasks in the phase1_001 job have finished and one execution will be dispatched after all tasks in the phase1_002 job have finished. Again those can run in parallel.
Each taks in the array job will receive a unique $SGE_TASK_ID value ranging from 1 to 100 for the tasks in job phase1_001 and from 1 to 42 for the tasks in job phase1_002. From it you can compute the p parameter.

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?
I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Resources