Mapred Task Timeout - hadoop

I have written a Map only job, where data is written from one HBase table to another, after some processing. But in my setup method of mapper, I am loading data from a file which takes more time than my mapred.task.timeout configuration.
I read the explanation given here. My question is,
1) will there be no communication between the task and the task tracker in the middle of a setup phase?
2) How to update status string??

Job wont timeout as long as there is a progress
Progress reporting is important, as Hadoop will not fail a task that’s making progress. All of the following operations constitute progress:
• Reading an input record (in a mapper or reducer)
• Writing an output record (in a mapper or reducer)
• Setting the status description on a reporter (using Reporter’s
setStatus() method)
• Incrementing a counter (using Reporter’s incrCounter() method)
• Calling Reporter’s progress() method
so if you keep on doing any of this at a nominal interval that job wont be killed.

Related

Difference between task counter and job counter

Can anyone please help me understand what is the difference between task counter and job counter in map reduce?
Hadoop,The Definitive guide says that task counters are those which are updated as the task progresses and and job counter are those which are updated as the job progresses.
Is this the only difference or they have any other difference too?
Task Counters
Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. Task counters are sent in full every time, rather than sending the counts since the last transmission, since this guards against errors due to lost messages. Furthermore, during a job run, counters may go down if a task fails for example you dont want to add up the bad_records in a split of a failed tasks. So As the task progesses and completes successfully the overall count of the task statistics is sent over to task tracker which is passed over to job tracker.
Job counters
Job counters are maintained by the jobtracker (or application master in YARN), so they don’t need to be sent across the network, unlike all other counters, They measure job-level statistics, not values that change while a task is running For example, TOTAL_LAUNCHED_MAPS counts the number of total map task launched which is just a statistics about the overall job

"Too many fetch-failures" while using Hive

I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is:
insert overwrite table tablename1 partition(namep)
select id,name,substring(name,5,2) as namep from tablename2;
that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated.
This can be caused by various hadoop configuration issues. Here a couple to look for in particular:
DNS issue : examine your /etc/hosts
Not enough http threads on the mapper side for the reducer
Some suggested fixes (from Cloudera troubleshooting)
set mapred.reduce.slowstart.completed.maps = 0.80
tasktracker.http.threads = 80
mapred.reduce.parallel.copies = sqrt (node count) but in any case >= 10
Here is link to troubleshooting for more details
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
Update for 2020 Things have changed a lot and AWS mostly rules the roost. Here is some troubleshooting for it
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource-1.html
Too many fetch-failures
PDF
Kindle
The presence of "Too many fetch-failures" or "Error reading task output" error messages in step or task attempt logs indicates the running task is dependent on the output of another task. This often occurs when a reduce task is queued to execute and requires the output of one or more map tasks and the output is not yet available.
There are several reasons the output may not be available:
The prerequisite task is still processing. This is often a map task.
The data may be unavailable due to poor network connectivity if the data is located on a different instance.
If HDFS is used to retrieve the output, there may be an issue with HDFS.
The most common cause of this error is that the previous task is still processing. This is especially likely if the errors are occurring when the reduce tasks are first trying to run. You can check whether this is the case by reviewing the syslog log for the cluster step that is returning the error. If the syslog shows both map and reduce tasks making progress, this indicates that the reduce phase has started while there are map tasks that have not yet completed.
One thing to look for in the logs is a map progress percentage that goes to 100% and then drops back to a lower value. When the map percentage is at 100%, this does not mean that all map tasks are completed. It simply means that Hadoop is executing all the map tasks. If this value drops back below 100%, it means that a map task has failed and, depending on the configuration, Hadoop may try to reschedule the task. If the map percentage stays at 100% in the logs, look at the CloudWatch metrics, specifically RunningMapTasks, to check whether the map task is still processing. You can also find this information using the Hadoop web interface on the master node.
If you are seeing this issue, there are several things you can try:
Instruct the reduce phase to wait longer before starting. You can do this by altering the Hadoop configuration setting mapred.reduce.slowstart.completed.maps to a longer time. For more information, see Create Bootstrap Actions to Install Additional Software.
Match the reducer count to the total reducer capability of the cluster. You do this by adjusting the Hadoop configuration setting mapred.reduce.tasks for the job.
Use a combiner class code to minimize the amount of outputs that need to be fetched.
Check that there are no issues with the Amazon EC2 service that are affecting the network performance of the cluster. You can do this using the Service Health Dashboard.
Review the CPU and memory resources of the instances in your cluster to make sure that your data processing is not overwhelming the resources of your nodes. For more information, see Configure Cluster Hardware and Networking.
Check the version of the Amazon Machine Image (AMI) used in your Amazon EMR cluster. If the version is 2.3.0 through 2.4.4 inclusive, update to a later version. AMI versions in the specified range use a version of Jetty that may fail to deliver output from the map phase. The fetch error occurs when the reducers cannot obtain output from the map phase.
Jetty is an open-source HTTP server that is used for machine to machine communications within a Hadoop cluster

Query regarding shuffling in map reduce

How does a node processing running the mapper knows that it has to send some key-value output to node A (running the reducer) & some to node B (running another reducer)?
Is there somewhere a reducer node list is maintained by the the JobTracker?
If yes, how does it chooses a node to run the reducer?
A Mapper doesn't really know where to send the data, it focuses on 2 things:
Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which Reducer it will be sent to.
Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step:
The mappers do not send the data to the reducers, rather the reducers pull the data from the task trackers where successful map tasks ran.
The Job Tracker, when allocating a reducer task to a task tracker, knows where the successful map tasks ran, and can compile a list of task tracker and map attempt task results to pull.

How to increment a hadoop counter from outside a mapper or reducer?

I want to add something to the hadoop counters from outside a mappper.
So, I want to access the getCounter on context object like this:
context.getCounter(counter, key).increment(amount)
I'm not able to get the context object from where I start the job. I can only do
job.getCounters().findCounter()
which doesn't let me add something to the hadoop counters.
You can only use/write to the counters from within the mapper/reducer tasks. The job tracker has built in capabilities to interactz with the counters and you don't really want to interfere with what is already a complex setup.
I had exactly this issue a few months ago, trying to use the counters to store interim information, but I decided to write the informtion I needed to a defined hdfs directory and read that once my job was complete.
EDIT: why and whatdo you want to use the counter for outside of the mapper?
EDIT #2: if you want stats from a finished job, then counters are not the right place for that, as a) they don't seem to bewritable once the job tracker is done collecting data and b) they are intended to be used for aggregating metrics across tasks. I had a similar need recently and endedup doing my stats sums in the job set-up class (on my edge node) and then writing the data to the logs.

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources