Setting MapReduce Counter value to a certain value - hadoop

I don't see a straightforward way of setting a counter value of a MapReduce job in the beginning? Also is the counter increment atomic operation among map/reduce tasks?

Not sure what you mean by setting a counter value in the beginning - do you mean initializing a counter value at something other than 0 (what's your use case for doing this?).
As for atomic operation, the counters are accumulated in isolation for each task. As tasks complete, the counter values are committed to the global totals (only the committed tasks are committed, so if you have two tasks running speculatively, only the successful task counters are committed.

Either set the value while you create the counter like :
private AtomicInteger pages = new AtomicInteger(0); // total pages fetched
OR use incrCounter() method in a loop if you want to do it at some point later.
(The first one is better.)
Counters are maintained by the task with which they are associated, and periodically
sent to the tasktracker and then to the jobtracker, so they can be globally aggregated. So each map task / reduce task will have its own copy of counter variable. If the job is successful, the total of all the counters is made and provided in output summary.

Related

Accessing Hadoop Counters in MapReduce

I am having a problem accessing Counters from a different Configuration. Is there any way to access Hadoop Counters from different Configurations while implementing map reduce on java, or are the counters Configuration specific?
Counters are at two levels. Job level and task level.
You need to use the configuration and context object if you want to track the job level aggregations.
If you want to count at the task level for example, if you want to count number of times map method is called , you can declare a global variable in Mapper method and increment it when map method is called and write it to context object in the cleanup method.

Difference between task counter and job counter

Can anyone please help me understand what is the difference between task counter and job counter in map reduce?
Hadoop,The Definitive guide says that task counters are those which are updated as the task progresses and and job counter are those which are updated as the job progresses.
Is this the only difference or they have any other difference too?
Task Counters
Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. Task counters are sent in full every time, rather than sending the counts since the last transmission, since this guards against errors due to lost messages. Furthermore, during a job run, counters may go down if a task fails for example you dont want to add up the bad_records in a split of a failed tasks. So As the task progesses and completes successfully the overall count of the task statistics is sent over to task tracker which is passed over to job tracker.
Job counters
Job counters are maintained by the jobtracker (or application master in YARN), so they don’t need to be sent across the network, unlike all other counters, They measure job-level statistics, not values that change while a task is running For example, TOTAL_LAUNCHED_MAPS counts the number of total map task launched which is just a statistics about the overall job

Query regarding shuffling in map reduce

How does a node processing running the mapper knows that it has to send some key-value output to node A (running the reducer) & some to node B (running another reducer)?
Is there somewhere a reducer node list is maintained by the the JobTracker?
If yes, how does it chooses a node to run the reducer?
A Mapper doesn't really know where to send the data, it focuses on 2 things:
Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which Reducer it will be sent to.
Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step:
The mappers do not send the data to the reducers, rather the reducers pull the data from the task trackers where successful map tasks ran.
The Job Tracker, when allocating a reducer task to a task tracker, knows where the successful map tasks ran, and can compile a list of task tracker and map attempt task results to pull.

Pass the maximum key encountered across all mappers as parameter to the next job

I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?
There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.
I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources