How to define a shared (global) variable in Hadoop? - hadoop

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.

This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

Related

Can HDFS block size be changed during job run? Custom Split and Variant Size

I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728

Does hadoop Behave differently in local and distributed mode for static variables

Suppose I am having a static variables assigned to a class variables in my mapper, the value of the static variable depends upon the job, hence it is same of a set of input splits being executed in a job tracker node for that Job and hence I can assign the Job Specific Variables directly as static Variables in my Mapper (The JVM running in the Job Tracker Node).
For Some Different Job, these values will change as it is a different Job and have different Class Path Variables for its own Job, but I believe it will not impact the former mentioned job as they are running in different JVMs(Jobtracker).
Now If i try this in the local mode, the above Different Job will be runnig inthe same JVM, hence when this Diferent Job will try to overrire the Job Specific Class Variables which my formar Job had set, it will cause a problem for my former Job.
So can we say that the behavior of same code in Local and Distributed mode in not same always.
The Class Variables I am setting is nothing but some resource level and distributed cache values.
I know the use case is not good, but just wanted to know if this is what will happen when it comes to static variables.
Thanks.
The usage of static variables is not encouraged for the same reason you mentioned. The behavior is surely different based on the mode in which Hadoop is running. if the static is just a resource name and you are just reading it, the usage is fine. But if trying to modify, it will impact in standalone mode. Also, as you know, the standalone and psuedo is just for beginners and learning. Usecases should not dictate our learning :) Happy learning.

How to check if the parameter is set or not on hadoop?

How to check the value of a particular parameter(say io.sort.mb) on hadoop while I'm running a benchmark(say teragen)?
I know you can always go to configuration files and see that but I have many configuration files plus some parameters get overwritten(like number of map tasks).
I don't have GUI. Is there any command to see this?
Thanks!
Whatever you set in the configuration files should be available in your job.xml. This should be available in your job tracker in the specific job entry.
Through a code these values will be available in the Configuration object.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

shared variable in map reduce

I need a variable that shared between reduce tasks and each of reduce tasks can read and write on it atomically.
The reason that I need such a variable is to give a unique identifier to each files that created by reduce task (number of files which created by reduce tasks is not deterministic).
Thanks
In my understanding ZooKeeper is specially built to maintain atomic access to the cluster wide variables.
I would recommend using FileSystem.createNewFile().
Have a look here:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html#createNewFile%28org.apache.hadoop.fs.Path%29
All the outout files produced by the reducers already have unique names part-r-00001 and such.
There is a partition number you can read in case you need that number from your code.
Centralized counters that must be guaranteed unique break a lot of the scalability of Hadoop.
So if you need something different then I would use something like a Sha1 of the task id of the reducer to get something that is unique over multiple jobs.

Resources