Job unexpectedly cancelled due to time limit - time

There are several partitions on the cluster I work on. With sinfo I can see the time limit for each partition. I put my code to work on mid1 partition which has time limit of 8-00:00:00 from which I understand that time limit is 8 days. I had to wait for 1-15:23:41 which means nearly 1 day and 15 hours. However, my code ran for only 00:02:24 which means nearly 2.5 minutes (and the solution was converging). Also, I did not set a time limit in the file submitted with sbatch The reason of my code stopped was given as:
JOB 3216125 CANCELLED AT 2015-12-19T04:22:04 DUE TO TIME LIMIT
So, why my code was stopped if I did not exceed the time limit? I was asking this to the guys who were responsible for the cluster but they did not return.

Look at the value of DefaultTime in the output of scontrol show partitions. This is the maximum time that is allocated to your job in the case you do not specify it by yourself with --time.
Most probably this value is set to 2 minutes to force you to specify a sensible time limit (within the limits of the partition.)

Related

Java 8 parallelstream worker issue

I am running a weekly job using java 8 springboot. I use forkjoin custom pool. With 8 threads, I see that the job takes 3 hours to complete. When I check the logs I see that the performance/throughput or is consistent till about 80% and I see almost 5 to 6 threads are running fine. But after the job completes around 80% I see only one thread running and the performance/throughput is decreased drastically.
Going with initial analysis I feel some how the threads are lost after 80%. Not sure thought.
Question:
1) Any hints on what is going wrong?
2) What is best way to debug this issue and fix it, so the all threads run correctly till job completes.
I think the job should be complete within lesser time than it is now, and I feel threads might be the issue.

aws ec2 instance status check interval less than minute

Minimum value for EC2 instance StatusCheckFailed interval seems to be one minute. Is it possible to reduce this to 2 failures for 15 seconds?
We have a requirement to detect failures quickly in 10-15 seconds range. Are there any other ways to accomplish this?
I don't believe you can set the resolution of the status check to less than 1 minute. One potential workaround would be to implement a lambda function that essentially performs a status check (via your own code) on a more frequent time interval via a cron job.

All map tasks reached 100%, but still in running state

In my MR job, which does bulk loading using HFileOutputFormat, 87 map tasks are spawned and in around 20 mins all the tasks reached 100%. Yet the individual task status is still in 'Running' in the hadoop admin page and none is moved to the completed state. The reducer is always in pending state and never starts. I just waited but it errored out after the 30 mins timeout.
My job has to load around 150+ columns. I tried running same MR job with less number of columns and it gets easily completed. Any idea why the map tasks are not moved to completed state even after reaching 100%?
One probable cause would be that the output data emitted would be huge. Sorting it, writing it back to disk would be a time consuming thing to do. This is typically not the case.
It would be even wise to check the logs and look out for ways to improve your map-reduce code.

Spreading/smoothing periodic tasks out over time

I have a database table with N records, each of which needs to be refreshed every 4 hours. The "refresh" operation is pretty resource-intensive. I'd like to write a scheduled task that runs occasionally and refreshes them, while smoothing out the spikes of load.
The simplest task I started with is this (pseudocode):
every 10 minutes:
find all records that haven't been refreshed in 4 hours
for each record:
refresh it
set its last refresh time to now
(Technical detail: "refresh it" above is asynchronous; it just queues a task for a worker thread pool to pick up and execute.)
What this causes is a huge resource (CPU/IO) usage spike every 4 hours, with the machine idling the rest of the time. Since the machine also does other stuff, this is bad.
I'm trying to figure out a way to get these refreshes to be more or less evenly spaced out -- that is, I'd want around N/(10mins/4hours), that is N/24, of those records, to be refreshed on every run. Of course, it doesn't need to be exact.
Notes:
I'm fine with the algorithm taking time to start working (so say, for the first 24 hours there will be spikes but those will smooth out over time), as I only rarely expect to take the scheduler offline.
Records are constantly being added and removed by other threads, so so we can't assume anything about the value of N between iterations.
I'm fine with records being refreshed every 4 hours +/- 20 minutes.
Do a full refresh, to get all your timestamps in sync. From that point on, every 10 minutes, refresh the oldest N/24 records.
The load will be steady from the start, and after 24 runs (4 hours), all your records will be updating at 4-hour intervals (if N is fixed). Insertions will decrease refresh intervals; deletions may cause increases or decreases, depending on the deleted record's timestamp. But I suspect you'd need to be deleting quite a lot (like, 10% of your table at a time) before you start pushing anything outside your 40-minute window. To be on the safe side, you could do a few more than N/24 each run.
Each minute:
take all records older than 4:10 , refresh them
If the previous step did not find a lot of records:
Take some of the oldest records older than 3:40, refresh them.
This should eventually make the last update time more evenly spaced out. What "a lot" and "some" means You should decide Yourself (possibly based on N).
Give each record its own refreshing interval time, which is a random number between 3:40 and 4:20.

Oracle Database performance related

I am currently working on 9.2.0.8 Oracle database.I Have some questions related to Performace of Database that
too related to Redo logs latches & contention. Answers from real practice will be highly appreciated. please help.
My data is currently having 25 redo log files with 2 members in each file. Each member is of size 100m.
So Is this worth keeping 25 redo log file each with 2 members (100MB each).
My database is 24*7 with a min user of 275 & Max of 650. My database is having mostly SELECT's but very
less INSERT/UPDATE/DELETE's .
And since 1 month i started obsorving that my database is generating archives on an average of 17GB
min to 28GB at MAX.
But the LOGSWITCH is taking place on an average every 5-10 min. some times more frequently.
And even some times 3 times in a min.
But my SPFILE says log_checkpoint_timeout=1800 ( 30 min's).
And About Redo log latches & contention,
when i isssue:- SELECT name, value
FROM v$sysstat
WHERE name = 'redo log space requests';
Output:-
NAME VALUE
-------------------------------------------------------------------- ----------
redo log space requests 20422
(This value is getting increased day by day)
Where as Oracle recommened's to have the redo log space request close to zero.
So i want to know why my database is going for log switch frequently. Is this Because of
data Or Becoze of some thing else.
My doubt was, If i increase REDO LOG Buffer the Problem may resolve. And i increased redo log buffer
from 8MB to 11MB. But i did'nt find much difference.
If i increase the size of REDO LOG FILE from 100MB to 200MB, Will it help. Will it help me to reduce
the log switching time & bring the value of REDO LOG SPACE REQUEST close to zero.
Something about the information you supplied doesn't add up - if you were really generating around 20G/min of archive logs, then you would be switching your 100M log files at least 200 times per minute - not the 3 times/minute worst case that you mentioned. This also isn't consistent with your description of "... mostly SELECT's".
In the real world, I wouldn't worry about log switches every 5-10 minutes on average. With this much redo, none of the init parameters are coming into play for switching - it is happening because of the online redo logs filling up. In this case, the only way to control the switching rate is to resize the logs, e.g. doubling the log size will reduce the switching frequency by half.
17GB of logfiles per minute seems pretty high to me. Perhaps one of the tablespaces in your database is still in online backup mode.
It would probably help to look at which sessions are generating lots of redo, and which sessions are waiting on the redo log space the most.
SQL> l
1 select name, sid, value
2 from v$sesstat s, v$statname n
3 where name in ('redo size','redo log space requests')
4 and n.statistic# = s.statistic#
5 and value > 0
6* order by 1,2

Resources