shared variable in map reduce - hadoop

I need a variable that shared between reduce tasks and each of reduce tasks can read and write on it atomically.
The reason that I need such a variable is to give a unique identifier to each files that created by reduce task (number of files which created by reduce tasks is not deterministic).
Thanks

In my understanding ZooKeeper is specially built to maintain atomic access to the cluster wide variables.

I would recommend using FileSystem.createNewFile().
Have a look here:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html#createNewFile%28org.apache.hadoop.fs.Path%29

All the outout files produced by the reducers already have unique names part-r-00001 and such.
There is a partition number you can read in case you need that number from your code.
Centralized counters that must be guaranteed unique break a lot of the scalability of Hadoop.
So if you need something different then I would use something like a Sha1 of the task id of the reducer to get something that is unique over multiple jobs.

Related

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

hadoop get actual number of mappers

In the map phase of my program, I need to know the total number of mappers that are created. This will help me in the key creation process of the map (I want to emit as many key-value pairs for each object as the number of mappers).
I know that setting the number of mappers is just a hint, but what is the way to get the actual number of mappers.
I tried the following in the configure method of my Mapper:
public void configure(JobConf conf) {
System.out.println("map tasks: "+conf.get("mapred.map.tasks"));
System.out.println("tipid: "+conf.get("mapred.tip.id"));
System.out.println("taskpartition: "+conf.get("mapred.task.partition"));
}
But I get the results:
map tasks: 1
tipid: task_local1204340194_0001_m_000000
taskpartition: 0
map tasks: 1
tipid: task_local1204340194_0001_m_000001
taskpartition: 1
which means (?) that there are two map tasks, and not just one, as printed (which is quite natural, since I have two small input files). Shouldn't the number after map tasks be 2?
For now, I just count the number of files in the input folder, but this is not a good solution, since a file could be larger than the block size and result in more than one input splits and hence mappers. Any suggestions?
Finally, it seems that conf.get("mapred.map.tasks")) DOES work after all, when I generate an executable jar file and run my program in the cluster/locally. Now the output of "map tasks" is correct.
It did not work only when running my mapreduce program locally on hadoop from the eclipse-plugin. Maybe it is an eclipse-plugin's issue.
I hope this will help someone else having the same issue. Thank you for your answers!
I don't think there is an easy way to do this. I've implemented my own InputFormat class, if you do that you can implement a method to count the number of InputSplits which you can request in the process that starts the job. If you put that number in some Configuration setting, you can read it in your mapper process.
btw the number of input files is not always the number of mappers, as large files can be split.

How to define a shared (global) variable in Hadoop?

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.
This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

when does data to output file is written in a map reduce architecture?How can i use processed reducer output data in reducer?

I am using hadoop version:1.0.0
After processing each reducer input key i am collecting the output.But it is not written to actual output file. I am trying to use processed intermediate output for processing further input keys.How can i do this?
Could you please suggest me how to use that intermediate data.When does mapreduce write data to output file?.
What you ask is something against the MR paradigm. And, as any deviation from the concept has consiquences.
Technically data is passed to the OutputFormat and it is his discretion to push it to output. I think it is written during the job, but You might have some delay before seeing it.
I think it would be easier for you to exolicitely accumulate processed data in reducer and use it, although this solution has inhenrent problem. You can face out of memory if there is enought keys.
I would suggest using two MR jobs, or some other techinques to make reducer stateless or at least limit amount of data it can accumulate.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Resources