Accessing Hadoop Counters in MapReduce - hadoop

I am having a problem accessing Counters from a different Configuration. Is there any way to access Hadoop Counters from different Configurations while implementing map reduce on java, or are the counters Configuration specific?

Counters are at two levels. Job level and task level.
You need to use the configuration and context object if you want to track the job level aggregations.
If you want to count at the task level for example, if you want to count number of times map method is called , you can declare a global variable in Mapper method and increment it when map method is called and write it to context object in the cleanup method.

Related

Hadoop mapper task detailed execution time

For a certain Hadoop MapReduce mapper task, I have already had the mapper task's complete execution time. In general, a mapper has three steps: (1)read input from HDFS or other source like Amazon S3; (2)process input data; (3)write intermediate result to local disk. Now, I am wondering if it's possible to know the time spent by each step.
My purpose is to get the result of (1) how long does it take for mappers to read input from HDFS or S3. The result just indicate how fast a mapper could read. It's more like a I/O performance for a mapper; (2) how long does it take for the mapper to process these data, it's more like the computing capability of the task.
Anyone has any idea for how to acquire these results?
Thanks.
Just implement a read-only mapper that does not emit anything. This will then give an indication of how long it takes for each split to be read (but not processed).
You can as a further step define a variable passed to the job at runtime (via the job properties) which allows you to do just one of the following (by e.g. parsing the variable against an Enum object and then switching on the values):
just read
just read and process (but not write/emit anything)
do it all
This of course assumes that you have access to the mapper code.

Setting parameter in MapReduce Job configuration

Is there any way to set a parameter in job configuration from Mapper and is accessible from Reducer.
I tried the below code
In Mapper: map(..) : context.getConfiguration().set("Sum","100");
In reducer: reduce(..) : context.getConfiguration().get("Sum");
But in reducer value is returned as null.
Is there any way to implement this or any thing missed out from my side?
As far as I know, this is not possible. The job configuration is serialized to XML at run-time by the jobtracker, and is copied out to all task nodes. Any changes to the Configuration object will only affect that object, which is local to the specific task JVM; it will not change the XML at every node.
In general, you should try to avoid any "global" state. It is against the MapReduce paradigm and will generally prevent parallelism. If you absolutely must pass information between the Map and Reduce phase, and you cannot do it via the usual Shuffle/Sort step, then you could try writing to the Distributed Cache, or directly to HDFS.
If you are using the new API your code should ideally work. Have you created this "Sum" property at the start of the job creation? For example like this
Configuration conf = new Configuration();
conf.set("Sum", "0");
Job job = new Job(conf);
If not you better use
context.getConfiguration().setIfUnset("Sum","100");
In your mapper class to fix the issue. This is the only thing I can see.

How to increment a hadoop counter from outside a mapper or reducer?

I want to add something to the hadoop counters from outside a mappper.
So, I want to access the getCounter on context object like this:
context.getCounter(counter, key).increment(amount)
I'm not able to get the context object from where I start the job. I can only do
job.getCounters().findCounter()
which doesn't let me add something to the hadoop counters.
You can only use/write to the counters from within the mapper/reducer tasks. The job tracker has built in capabilities to interactz with the counters and you don't really want to interfere with what is already a complex setup.
I had exactly this issue a few months ago, trying to use the counters to store interim information, but I decided to write the informtion I needed to a defined hdfs directory and read that once my job was complete.
EDIT: why and whatdo you want to use the counter for outside of the mapper?
EDIT #2: if you want stats from a finished job, then counters are not the right place for that, as a) they don't seem to bewritable once the job tracker is done collecting data and b) they are intended to be used for aggregating metrics across tasks. I had a similar need recently and endedup doing my stats sums in the job set-up class (on my edge node) and then writing the data to the logs.

Lifespan of mapper class instance

A mapper class instance will be created and used for one InputSplit (or a mapper task)? Or multiple mapper class instances can be handling one InputSplit (or a mapper task)?
Each input split is handed to a mapper, and a mapper will only process a single input split.
However if you have mapper speculative execution turned on, then a input split can be run by two mappers on different nodes in parallel (there are certain conditions that will trigger speculative execution, you should be able to google them).
Also, if a map task fails, then the input split will be scheduled to run on another cluster node as another map task.

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Resources