I have a oozie job that has 3 actions A1,B1 and C1. I am running the three actions in parallel by configuring in a fork. When A1 gets failed due to EL_ERROR the job gets failed. However the status of the other two actions B1 and C1 are still in progress and they don't get completed. What could be the issue.
Related
I have two kinds of tasks in spark : A and B
In spark.scheduler.pool, I have two pools: APool and BPool.
I want task A to be executed aways in APool while B is in BPool.
The resources in APool is preserved to A.
Because task B may take too much resources to execute. Every time when B is executing, A needs to wait. I want no matter when the task is submitted, there will always be some resource for A to execute.
I am using spark with java in standalone mode. I submit the job like javaRDD.map(..).reduce... The javaRDD is a sub-clesse extended form JavaRDD. Task A and B have different RDD class like ARDD and BRDD. They run in the same spark application.
The procedure is like: The app start up -> spark application created, but no job runs -> I click "run A" on the app ui, then ARDD will run. -> I click "run B" on the app ui, then BRDD will run in the same spark application as A.
I've implemented long-running tasks in my Rails app using delayed_job along with delayed_job_web. My delayed_job configuration instructs jobs to be attempted once, and for failures to be retained:
config/initializers/delayed_job.rb:
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false
I tried 2 test jobs that automatically raised errors, in order to see how failures behave. What I get is the following:
My expectation was that Failed jobs would have a count of 2, but that Enqueued / Working / Pending would all be 0. I can't find any documentation on what determines whether a job is Enqueued / Working / Pending, or even what the difference between Working and Pending is (the web interface describes both lists as "contains jobs currently being processed".)
Can anyone provide some clarity?
If you check https://github.com/ejschmitt/delayed_job_web/blob/master/lib/delayed_job_web/application/app.rb , you see the following (starting line 114):
when :working
'locked_at is not null'
when :failed
'last_error is not null'
when :pending
'attempts = 0'
end
Enqueued would be the total number of delayed jobs, i.e. Delayed::Job.count
Working jobs are those that have been locked by the delayed_job process and are currently being worked.
Failed are those that have a last_error
Pending are those jobs that have never been attempted.
Can 1 Tasktracker run multiple JVMs?
Here is the scenario:
Assume there are 2 files (A & B) and 2 Data nodes (D1 & D2).
When you load A, assume it is getting split into A1 & A2 on D1 & D2
and when you load B, assume it is getting split into B1 & B2 on D1 & D2.
For some reason let us assume D1 is busy with some other tasks
and D2 is available and there are a couple of jobs which are submitted,
one using file A and the other one usign File B.
So now D2 is available and has blocks A2 & B2.
Will the JobTracker submit the code to TaskTracker on D2 and run the task for A2 and B2 at a time or
will it first run A2 and after it finishes it will run B2?
If so, again is it possible to run both the tasks in parallel which means 1 TaskTracker and 2 jvms, or will it create/spawn 2 TaskTrackers on D2?
By default Task Tracker spawns one JVM for each task.
You can reuse jvms by setting this configuration parameter: mapred.job.reuse.jvm.num.tasks
A task tracker (TT) can launch multiple map or reduce tasks in parallel on a single machine. By default TT launches 2 maps (mapreduce.tasktracker.map.tasks.maximum) and 2 reduce (mapreduce.tasktracker.reduce.tasks.maximum) tasks. The properties have to be configured in the mapred-default.xml.
All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.
Greetings, your friendly neighborhood Quartz.NET n00b is back!
I have a Windows Service running iStatefulJob instances on a Quartz.NET CronTrigger based schedule scheme... The CRON String used to schedule the job: "0 0/1 * * * ? *"
Everything works great. However, if I have a job that is set to run, say, at the X:00 mark of every minute, and that job happens to run for MORE than a minute, I notice that the subsequent job runs IMMEDIATELY after the job is finished executing, rather than waiting until its next scheduled run, effectively "queuing" up instead of merely skipping the job till it's next scheduled run.
I put in the trigger a CronTrigger MisfireInstruction of DONOTHING, but the exact same thing happens when a job overruns its next scheduled execution schedule.
How do I get an iStatefulJob instance to merely SKIP a scheduled execution trigger if it is currently running, rather than have it delay it until the first execution completes?
I explicitly set the trigger.MisfireInstruction = MisfireInstruction.CronTrigger.DoNothing;
...But instead of "doing nothing", for a job scheduled to run every minute that takes 90 seconds to complete, I experience the following execution log:
Job runs at 9:00:00am, finishes at 9:01:30am <- job runs for 1:30
Job runs at 9:01:30am, finishes at 9:03:00am <- subsequent job that should have run at 9:01:00
Job runs at 9:04:00am, finishes at 9:05:30am <- shouldn't this one have run at 9:03:00?
Job runs at 9:05:30am, finishes at 9:07:00am <- subsequent job that should have run at 9:05:00
Job runs at 9:08:00am, finishes at 9:09:30am <- shouldn't this have run at 9:07:00?
... it seems like it runs correctly the first time, on the minute... delays for 30 seconds as the 90 second job execution time expires, and then, instead of waiting till the NEXT full minute, EXECUTES IMMEDIATELY at the 30 second mark... Doubly odd, is that it then finishes the SECOND job on the minute mark, but waits till the NEXT minute mark to execute instead of running it back-2-back...
Pretty much seems like it works correctly EVERY OTHER RUN, when it is not running on the :30 marks...
What's the best way to get a job not to delay/queue, but to just SKIP until it is idle and the next schedule matures?
EDIT: I tried going back to iJobs instead of iStatefulJobs using the same DONOTHING trigger misfire instruction, but the job executes EVERY MINUTE despite the prior execution being still active. I can't seem to get it to skip a scheduled run if it is currently running with either iJob or iStatefulJob...
EDIT#2: I think that my triggers are NEVER misfiring, which is why DoNothing as a misfire instruction is useless... Given that's the case, I guess I need another mechanism to detect if a job instance of a schedule is running to ensure the job SKIPS its next execution until its following scheduled time rather than delaying it until first instance completion...
EDIT3: I tried adding an element to the iStatefulJob jobdatamap called "IsRunning"... I set it to TRUE when the execute sequence starts, and then return it to false after job completion. Before executing, it checks the element, which is apparently persisted between jobs, and prematurely quits the execution (logging "JOB SKIPPED!") if it detects it to be true... This unfortunately doesn't work, for probably obvious reasons: If the jobs are running following the bulleted schedule above, then the job is never SIMULTANEOUSLY running along with itself, as it is delaying the run till the job ends, so this check is useless. According to documentation, returning to iJob from iStatefulJob would not help here as the jobdatamap is only persisted between jobs in the Stateful job type...
I still haven't solved how to SKIP a scheduled job instead of delaying it till it's current iteration completes... If anyone has ideas, you're a lifesaver! :)
It should be caused by misfireThreshold of RAMJobStore (http://quartznet.sourceforge.net/apidoc/topic2722.html).
The time span by which a trigger must
have missed its next-fire-time, in
order for it to be considered
"misfired" and thus have its misfire
instruction applied.
It is 60 seconds by default. So job isn't considered as "misfired" until it is late for more than misfiredThreshold value.
To resolve the problem just decrease this threshold (below code sets to 1 ms):
...
properties["quartz.jobStore.misfireThreshold"] = "1";
...
schedulerFactory = new StdSchedulerFactory(properties);
It should resolve the issue.