All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.
Related
I'm executing the script for 1 hour but it is running for 10 minutes , i also check loop forever, test data is also proper, all the script is running properly without any error , I run the script thrice validate all the things but im not getting why it is happening
How to overwrite the issues and how to run a the script for 1 hour
Normally you can find the reason for termination of a thread or test in jmeter.log file. If it is not there or it's vague - you can increase JMeter logging level to something more verbose
The most common reasons for premature end of the test are:
Not enough loops to cover the anticipated duration of the test in the Thread Group
Not enough test data if CSV Data Set Config is configured to stop thread on EOF
Thread group is configured to stop the thread/test on a sampler error
There is Flow Control Action sampler somewhere configured to stop thread/test
There is a runtime error like OutOfMemoryError or StackOverFlowError
I'm trying to load a dataset (280GB) using the Phoenix csv bulk load tool on a HDInsight Hbase cluster. The job fails with the following error:
18/02/23 06:09:10 INFO mapreduce.Job: Task Id :
attempt_1519326441231_0004_m_000067_0, Status : FAILEDError: Java heap
spaceContainer killed by the ApplicationMaster.Container killed on
request. Exit code is 143Container exited with a non-zero exit code
143
Here's my cluster configuration:
Region Nodes
8 cores, 56 GB RAM, 1.5TB HDD
Master Nodes
4 cores, 28GB, 1.5TB HDD
I tried increasing the value of yarn.nodemanager.resource.memory-mb from 5GB to 38GB, but the job still fails.
Can anyone please help me troubleshoot this issue?
Can you provide more details ? Such as how are you kicking off the job? Are you following the instructions here - https://blogs.msdn.microsoft.com/azuredatalake/2017/02/14/hdinsight-how-to-perform-bulk-load-with-phoenix/ ?
Specifically Can you provide the command you used and also some more info as in is the job failing immediately or does it run for a while and then start to fail? Any other log messages than the one you described above ?
I meet with a “many Active jobs” issue when using direct kafka streaming on YARN. (spark 1.5, hadoop 2.6, CDH5.5.1)
The problem happens when kafka has almost NO traffic.
From application UI, I see many ‘active’ jobs are keep running for hours. And finally the driver “Requesting 4 new executors because tasks are backlogged”.
But, when looking at the driver log of a ‘activity’ job, the log says the job is finished. So, why the application UI shows this job is activity like forever?
Thanks!
Here are related log info about one of the ‘activity’ jobs.
There are two stages: a reduceByKey follows a flatmap. The log says both stages are finished in ~20ms and the job also finishes in 64 ms.
Got job 6567
Final stage: ResultStage 9851(foreachRDD at
Parents of final stage: List(ShuffleMapStage 9850)
Missing parents: List(ShuffleMapStage 9850)
…
Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms
Removed TaskSet 9850.0, whose tasks have all completed, from pool
ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) finished in 0.022 s
…
Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable
…
ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) finished in 0.023 s
Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 0.064372 s
Finished job streaming job 1468592373000 ms.1 from job set of time 1468592373000 ms
I am facing similar issue. Myn is spark streaming applicaiton where in my only action is to write to cassandra table. And, this write fails due to certain ssl authenticaion. Ideally it should show such batches as failed in Streaming, but it remains in active state forever; inside the batch the jobs are completed successfully, ideally it should have been marked failed.
I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.
my hadoop job has a very high ‘Killed Task Attempts’ number on its reducer tasks, I check the status of killed task:
Request received to kill task 'attempt_201308122006_41526_r_000030_1' by user
-------
Task has been KILLED_UNCLEAN by the user
and no stdout and stderr logs
what could cause this ? and how can I solve it?
If you have speculative execution turned on, then you will potentially see a number of map / reduce tasks that will be 'killed'. This is due to hadoop running long running tasks on more than a single task tracker, and the first one to complete 'wins' while the others are killed off.
In general i would only worry about the task attempts that 'failed' in the job tracker
Try turning speculative execution off:
mapred.map.tasks.speculative.execution = false
mapred.reduce.tasks.speculative.execution = false
If not the speculative execution, it could be the Fair Scheduler kicked in claiming task trackers for pool with minMaps and minReduces.