Can anyone please help me understand what is the difference between task counter and job counter in map reduce?
Hadoop,The Definitive guide says that task counters are those which are updated as the task progresses and and job counter are those which are updated as the job progresses.
Is this the only difference or they have any other difference too?
Task Counters
Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. Task counters are sent in full every time, rather than sending the counts since the last transmission, since this guards against errors due to lost messages. Furthermore, during a job run, counters may go down if a task fails for example you dont want to add up the bad_records in a split of a failed tasks. So As the task progesses and completes successfully the overall count of the task statistics is sent over to task tracker which is passed over to job tracker.
Job counters
Job counters are maintained by the jobtracker (or application master in YARN), so they don’t need to be sent across the network, unlike all other counters, They measure job-level statistics, not values that change while a task is running For example, TOTAL_LAUNCHED_MAPS counts the number of total map task launched which is just a statistics about the overall job
Related
I am not sure if this is something that has been fixed for newer releases of Hadoop, but I'm currently locked into running Hadoop 0.20 (legacy code).
Here's the issue: when I launch a Hadoop job, there is "Job setup" task that needs to run first. It seems to me that Hadoop randomly picks this task to be either a map task or a reduce task.
We have more capacity for map tasks configured than reduce tasks, so whenever I get unlucky and have a reduce startup task, it takes forever long for my job to even start running. Any ideas how to overcome this?
Hadoop job first complete all your mapper task. Once all the mapper task is completed then it will go across the network and do shuffling and sorting and only after then your reducer task will start processing. So i guess there could possibly be some other for this delay.
I have written a Map only job, where data is written from one HBase table to another, after some processing. But in my setup method of mapper, I am loading data from a file which takes more time than my mapred.task.timeout configuration.
I read the explanation given here. My question is,
1) will there be no communication between the task and the task tracker in the middle of a setup phase?
2) How to update status string??
Job wont timeout as long as there is a progress
Progress reporting is important, as Hadoop will not fail a task that’s making progress. All of the following operations constitute progress:
• Reading an input record (in a mapper or reducer)
• Writing an output record (in a mapper or reducer)
• Setting the status description on a reporter (using Reporter’s
setStatus() method)
• Incrementing a counter (using Reporter’s incrCounter() method)
• Calling Reporter’s progress() method
so if you keep on doing any of this at a nominal interval that job wont be killed.
I am running a Pig job that loads around 8 million rows from HBase (several columns) using HBaseStorage. The job finishes successfully and seems to produce the right results but when I look at the job details in the job tracker it says 50 map tasks were created of which 28 where successful and 22 were killed. The reduce ran fine. By looking at the logs of the killed map tasks there is nothing obvious to me as to why the tasks were killed. In fact the logs of successful and failed tasks are practically identical and both tasks are taking some reasonable time. Why are all these map tasks created and then killed? Is it normal or is it a sign of a problem?
This sounds like Speculative Execution in Hadoop. It runs the same task on several nodes and kills them when at least one completes. See the explanation this this book: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/task-execution
How does a node processing running the mapper knows that it has to send some key-value output to node A (running the reducer) & some to node B (running another reducer)?
Is there somewhere a reducer node list is maintained by the the JobTracker?
If yes, how does it chooses a node to run the reducer?
A Mapper doesn't really know where to send the data, it focuses on 2 things:
Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which Reducer it will be sent to.
Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step:
The mappers do not send the data to the reducers, rather the reducers pull the data from the task trackers where successful map tasks ran.
The Job Tracker, when allocating a reducer task to a task tracker, knows where the successful map tasks ran, and can compile a list of task tracker and map attempt task results to pull.
I don't see a straightforward way of setting a counter value of a MapReduce job in the beginning? Also is the counter increment atomic operation among map/reduce tasks?
Not sure what you mean by setting a counter value in the beginning - do you mean initializing a counter value at something other than 0 (what's your use case for doing this?).
As for atomic operation, the counters are accumulated in isolation for each task. As tasks complete, the counter values are committed to the global totals (only the committed tasks are committed, so if you have two tasks running speculatively, only the successful task counters are committed.
Either set the value while you create the counter like :
private AtomicInteger pages = new AtomicInteger(0); // total pages fetched
OR use incrCounter() method in a loop if you want to do it at some point later.
(The first one is better.)
Counters are maintained by the task with which they are associated, and periodically
sent to the tasktracker and then to the jobtracker, so they can be globally aggregated. So each map task / reduce task will have its own copy of counter variable. If the job is successful, the total of all the counters is made and provided in output summary.