Setting parameter in MapReduce Job configuration - hadoop

Is there any way to set a parameter in job configuration from Mapper and is accessible from Reducer.
I tried the below code
In Mapper: map(..) : context.getConfiguration().set("Sum","100");
In reducer: reduce(..) : context.getConfiguration().get("Sum");
But in reducer value is returned as null.
Is there any way to implement this or any thing missed out from my side?

As far as I know, this is not possible. The job configuration is serialized to XML at run-time by the jobtracker, and is copied out to all task nodes. Any changes to the Configuration object will only affect that object, which is local to the specific task JVM; it will not change the XML at every node.
In general, you should try to avoid any "global" state. It is against the MapReduce paradigm and will generally prevent parallelism. If you absolutely must pass information between the Map and Reduce phase, and you cannot do it via the usual Shuffle/Sort step, then you could try writing to the Distributed Cache, or directly to HDFS.

If you are using the new API your code should ideally work. Have you created this "Sum" property at the start of the job creation? For example like this
Configuration conf = new Configuration();
conf.set("Sum", "0");
Job job = new Job(conf);
If not you better use
context.getConfiguration().setIfUnset("Sum","100");
In your mapper class to fix the issue. This is the only thing I can see.

Related

Set result from previous Reducer as configuration parameter

As part of the calculation logic , In a Mapreduce workflow i need to take the result from a reducer as parameter for the next reducer in the chain.
Path plc =new Path(args[1]+"/3");--> Outputpath from the previous reducer
Configuration c4= new Configuration();
c4.set("denom", GetLineC.extCount(plc));---> GetLineC.extCount is a function that returns a value
ControlledJob cJob4= new ControlledJob(c4);
Im using JobControl to create the dependency between the jobs and all the configuration. When the program is executed it gives "No such file or directory".In the flow when the control reaches this part the file will be present in this location. But since the configuration is instantiated in the beginning this error is showing up.
Is there a way to set the single line output from the previous reducer as a parameter directly?
Well, I think you mean previous job instead of previous reducer. If you're executing the two jobs using the same driver class, you already know the output of the last job, which is a directory. Clearly you're using only one reducer and it will write its output in a part-r-00000 file inside the output path. To set it as a configuration parameter to the next job, you will have to read this file manually.
Are you considering that in GetLineC.extCount(Path path)?

Hadoop: force 1 mapper task per node from jobconf

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

How to run hadoop multithread way in single JVM?

I have 4 core desktop and want to use all my cores for local data processing with hadoop.
(i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).
By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow.
I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine
PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.
You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.
Also there is an another way, you should:
set MultithreadedMapper as your mapper class in Job-typed object
call MultithreadedMapper.setMapperClass with you actual mapper class
call MultithreadedMapper.setNumberOfThreads with desirable concurrency level
But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.
Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.
Configuration conf = new Configuration();
Job job = new Job(conf, "SolerRandomHit");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MultithreadedMapper.class);

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Global variables in hadoop

My program follows a iterative map/reduce approach. And it needs to stop if certain conditions are met. Is there anyway i can set a global variable that can be distributed across all map/reduce tasks and check if the global variable reaches the condition for completion.
Something like this.
While(Condition != true){
Configuration conf = getConf();
Job job = new Job(conf, "Dijkstra Graph Search");
job.setJarByClass(GraphSearch.class);
job.setMapperClass(DijkstraMap.class);
job.setReducerClass(DijkstraReduce.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
}
Where condition is a global variable that is modified during/after each map/reduce execution.
Each time you run a map-reduce job, you can examine the state of the output, the values contained in the counters, etc, and make a decision at the node that is controlling the iteration on whether you want one more iteration or not. I guess I don't understand where the need for a global state comes from in your scenario.
More generally -- there are two main ways state is shared between executing nodes (although it should be noted that sharing state is best avoided since it limits scalability).
Write a file to HDFS that other nodes can read (make sure the file gets cleaned up when the job exits, and that speculative execution won't cause weird failures).
Use ZooKeeper to store some data in dedicated ZK tree nodes.
You can use Configuration.set(String name, String value) to set a value you will be able to access in your Mappers/Reducers/etc:
In your driver:
conf.set("my.dijkstra.parameter", "value");
And e.g. in your mapper:
public void configure(JobConf job) {
myParam = job.get("my.dijkstra.parameter");
}
But this will not likely help you to look on the output of previous jobs to decide whether to start one more iteration. I.e. this value will not be pushed back after job execution.
You can also use Hadoop's DistributedCache to store files that will be distributed among all nodes. This is a bit better than simply store something on HDFS if a value you are going to pass this way is something small.
Of course counters can also be used for this purpose. But they don't look too reliable for purposes of making decisions in the algorithm. Looks like in some cases they can be incremented twice (if some task was executed more then once, e.g. in case of failure or speculative execution) - I am not sure.
This is how it works in Hadoop 2.0
In your driver:
conf.set("my.dijkstra.parameter", "value");
And in your Mapper:
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
strProp = conf.get("my.dijkstra.parameter");
// and then you can use it
}
You can use Cascading to organize multiple Hadoop jobs. Specify a HDFS path where you want to keep the global state variable and initialize with dummy contents. On each iteration, read the current contents of this HDFS path, delete those contents, perform any number of map/reduce steps, and finally perform a global reduce that updates the global state variable. Depending on the nature of your task, you may need to disable speculative execution and allow for many retries.

Resources