Oozie with Hadoop 2, Job hangs in "RUNNING" - hadoop

I have workflow job with a java action node. Run with Hadoop 2.1.0.2.0.4.0-38 and Oozie 3.3.2.2.0.4.0
When I submit the job I see 2 lines in Hadoop Resource Manager screen.
1. with original job name
2. with Oozie job name.
The task with Oozie job name is hanging in "RUNNING" state
The task with original name is in "Accepted" state.
All that I see in logs is:
>>> Invoking Main class now >>>
Heart beat
Heart beat
Heart beat
Heart beat
...
Thank you

It seems that number of maptasks that can run in parallel are limited. Set the below property to a value higher than current value.
mapred.tasktracker.map.tasks.maximum
50
This might fix your issue.
Thanks,
Sathish.

Related

How to hold Jenkins multiJob execution until chosen nodes are free?

I have a question about jenkins multijob possibilities:
current state:
I have 8 Jenkins nodes for job execution, 2 Linux and 6 Windows.
I have Multijob set up, consisting of 3 subJobs.
MultiJob setting: it has restriction to run only on Linux nodes
SubJob settings: n1 can run only on Win node1, n2 only on Win node2, n3 only on Win node3
Desired state:
When I build the multiJob, I need it to check and wait till Win nodes 1,2,3 are free
I need to execute subJobs 1,2,3 in the same time
this wouldn’t be problem, if all nodes were free...but if at least one of those three node is running some other job, it’s a problem already, because one subJob will be late compared to the other two
Is there any way to set up some pre-build script/another way to run subJobs only if all three chosen nodes are free/to wait for them to be free?
Thanks a lot for all ideas :)
You can check the status of the build executor on particular node as a pre-build action.
If the build executor is idle, that means no job is running but if it's busy, something is running into it.
Simple groovy script can be used for this purpose.
import hudson.model.Node
import hudson.model.Slave
import jenkins.model.Jenkins
Jenkins jenkins = Jenkins.instance
def jenkinsNodes =jenkins.nodes
for (Node node in jenkinsNodes)
{
// Make sure slave is online
if (!node.getComputer().isOffline())
{
//Make sure that the slave busy executor number is 0.
if(node.getComputer().countBusy()==0)
{
...put your logic...
}
}
}
Thanks,
Subhadeep

How to remove/run a pending job in Nomad?

There are some pending jobs in "$ nomad status" output. Is there a way to run a pending job?
$ nomad status
ID Type Priority Status Submit Date
5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 batch 50 pending (stopped) 2020-03-20T14:45:24+09:00
5e74678bdb1df409005677d6-0-build-1004018-z100-solid-octo-potato-17 batch 50 pending 2020-03-20T15:49:48+09:00
5e746884db1df409005677dc-0-build-1004018-z100-solid-octo-potato-19 batch 50 pending 2020-03-20T15:53:56+09:00
5e746a02db1df409005677e3-0-build-1004018-z100-solid-octo-potato-20 batch 50 pending 2020-03-20T16:00:19+09:00
Best regards,
Pending means that it is about to run but something prevents it from it.
Try this:
Check the status of a job, e.g. nomad job status 5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 to see if there are any messages that can make you understand what's happening.
if the previous step doesn't help, Nomad creates allocations (set of tasks in a job should be run on a particular node). Their IDs will be visible in nomad job status output. You can check an allocation's status by nomad alloc status $ID, where $ID is the ID of an allocation.
As for removal of jobs, you can run nomad job stop -purge 5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 to remove the job from the job list.

Preemption with Tez along with the yarn FairShare scheduler supported?

We've been switching our 10 nodes cluster from MapReduce to Tez lately and we are experiencing issues with resource management since then. It seems like preemption does not work as expected :
a very consuming job arrives it gets all free ressources
a second job arrives and wait for resources to be freed by job1
job2 gets a very little resource (5%) over a long time and it keeps increasing very slowly but most of the time never reach the fair share.
I'm assuming the preemption mechanism used by the FairShare yarn scheduler is not working as it should and resources only get assigned to job2 when some job1 containers are done.
I've looked into Tez doc and I could think that Tez would have been developed with the Capacity Scheduler as a defacto scheduler, but can't find any help for the FairShare scheduler.
Some conf variables used that may help :
hive.server2.tez.default.queues=default
hive.server2.tez.initialize.default.sessions=false
hive.server2.tez.session.lifetime=162h
hive.server2.tez.session.lifetime.jitter=3h
hive.server2.tez.sessions.init.threads=16
hive.server2.tez.sessions.per.default.queue=10
hive.tez.auto.reducer.parallelism=false
hive.tez.bucket.pruning=false
hive.tez.bucket.pruning.compat=true
hive.tez.container.max.java.heap.fraction=0.8
hive.tez.container.size=-1
hive.tez.cpu.vcores=-1
hive.tez.dynamic.partition.pruning=true
hive.tez.dynamic.partition.pruning.max.data.size=104857600
hive.tez.dynamic.partition.pruning.max.event.size=1048576
hive.tez.enable.memory.manager=true
hive.tez.exec.inplace.progress=true
hive.tez.exec.print.summary=false
hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
hive.tez.input.generate.consistent.splits=true
hive.tez.log.level=INFO
hive.tez.max.partition.factor=2.0
hive.tez.min.partition.factor=0.25
hive.tez.smb.number.waves=0.5
hive.tez.task.scale.memory.reserve-fraction.min=0.3
hive.tez.task.scale.memory.reserve.fraction=-1.0
hive.tez.task.scale.memory.reserve.fraction.max=0.5
yarn.scheduler.fair.preemption=true
yarn.scheduler.fair.preemption.cluster-utilization-threshold=0.7
yarn.scheduler.maximum-allocation-mb=32768
yarn.scheduler.maximum-allocation-vcores=4
yarn.scheduler.minimum-allocation-mb=2048
yarn.scheduler.minimum-allocation-vcores=1
yarn.resourcemanager.scheduler.address=${yarn.resourcemanager.hostname}:8030
yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.resourcemanager.scheduler.client.thread-count=50
yarn.resourcemanager.scheduler.monitor.enable=false
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy

Hadoop Mapreduce detailed task status queries

I want to write a 3rd party frontend to hadoop mapreduce which needs to query mapreduce on some information and statistics.
Right now I'm able to use hadoop job to query jobs and the map and reduce completion percentages, along with counters, e.g.:
# hadoop job -status job_201212170023_0127
Job: job_201212170023_0127
map() completion: 0.6342382
reduce() completion: 0.0
Counters: 28
Job Counters
SLOTS_MILLIS_MAPS=4537
...
What I would also like are the numbers of each task, as used by the visualisation within the job tracker, i.e.:
I am able to list all the mappers...
# hadoop job -list-attempt-ids job_201212170023_0127 map running
attempt_201212170023_0127_m_000000_0
attempt_201212170023_0127_m_000001_0
attempt_201212170023_0127_m_000002_0
...
..but how would I get the percentage of each of these tasks? Ideally I would want something like this:
# hadoop job -task-status attempt_201212170023_0127_m_000000_0
completion: 0.6342382
start: 2012-12-18T12:23:34Z
... etc.
The current solution would be to scrape the web interface, but I'm not a fan of this if it is at all possible to use the command line output.

hadoop streaming jobs fails to report?

All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.

Resources