mapreduce how to share global const variable - hadoop

How can I allow all of my mappers to have access to one variable, for example a TreeMap object, without having each mapper to re-construct the TreeMap every time? The object will never be modified again once it's constructed.

Consider putting the contents of the TreeMap object in the Distributed Cache. If it is a small amount of data you can place the object contents in your configuration object:
conf.set("key", "value");
then use the JobConf object to access it in your mapper.

Related

How to pass different set of data to two different mappers of the same job

I have one Single Mapper , say SingleGroupIdentifierMapper.java
Now this is a generic mapper which does all the filtration on a single line of mapper-input value/record based on property file (containing filters and key-value field indexes) passed to it from the driver class using cache.
Only the reducer business logic is is different and has been implemented keeping the mapper logic generic and to be implemented using the PropertyFile as mentioned above.
Now my problem statement is I have input from multiple Sources now, having different formats. That means I have to do some thing like
MultipleInputs.addInputPath(conf, new Path("/inputA"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
But the cached property file which I pass from the driver class to the mapper for implementing filter based on field indexes is common, So how can I pass two different property file to the same mapper, where if it processes, say Input A, then it will use PropertyFileA (to filter and create key value pair) and if it processes, say Input B then it will use PropertyFileB (to filter and create key value pair).
It is possible to change the Generic Code of the Mapper to take care of this scenario BUT how to approach this problem in the Generic Class and how to identify in the same Mapper Class if the input is from inputA/inputB and accordingly apply the propertyFile Configuration on the data.
Can we pass arguments to the constructor of this mapper class to specify it is from inputB or it needs to read which property file in cache.?
Eg Something like :
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, args[], SingleGroupIdentifierMapper.class);
where args[] is passed to the SingleGroupIdentifierMapper class's constructor which we define to take as input and set it as a attribure.
Any thoughts or expertise is most welcomed.
Hope I was able to express my problem clearly, kindly ask me in case there needs to be more clarity in the question.
Thanks in Advance,
Cheers :)
Unfortunately MultipleInputs is not that flexible. But there is a workaround which matches InputSplit paths to the property files in the setup method of the Mapper. If you are not using any sort of Combine*Format, than a single mapper will process a single split from a single file:
When adding prop files into cache use /propfile_1#PROPS_A and /propfile_2#PROPS_B
Add input path into job.getConfiguration().set("PROPS_A", "/inputA") and job.getConfiguration().set("PROPS_B", "/inputB")
In the Mapper.setup(Context context) method, use context.getInputSplit().toString() to get the path of the split. Than match it to the paths saved in the context.getConfiguration().get("PROPS_A") or PROPS_B
If you are using some Combine*Format, than you would need to extend it, override getSplits that use information from the JobContext to build the PathFilter[] and call createPool, which will create splits that contain files from the same group (inputA or inputB).

Hadoop's passage of parameter

I've known that a writable object can be passed to mapper using something like:
DefaultStringifier.store(conf, object ,"key");
object = DefaultStringifier.load(conf, "key", Class );
My question is:
In a mapper I read out the object then change the value of this object,
for example: object=another .
How to do to make sure that the change of the object's value
could be known by the next time of mapper task?
Is there any better way to pass parameter to mapper?
Use the file system instead. Write the value in HDFS, and replace the file with a different content. Neither config nor DistributedCache are not appropiate for mutable state.

How do we pass objects of some custom class as a parameter to mapper in mapReduce programs??

How do we pass objects of some custom class as a parameter to mapper in mapReduce programs??
JobConf has 'set' methods for boolean, string, int and long. What if I want to pass a Document object as a parameter to my mapper? Can any one help me out?
I have given a tip to someone who wanted a whole map to pass to a mapper.
Hadoop: How to save Map object in configuration
The idea is the same, you have to serialize your object into a string and put it into the configuration. JSON works very well, because the configuration is serialized as XML, thus having no problem while deserializing.
If your object implements Writable, you can serialize it to a Byte array, base64 encode the byte array and then save off the resultant string to the configuration. To decode do the opposite.
Of course, i wouldn't recommend this if your object has a very large footprint - in this case you're better off serializing it to a file in HDFS and using the distributed cache.

Does hive instantiate a new UDF object for each record?

Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.
In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.
Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?
For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?
If the second example is true, in what alternate way should I structure this?
Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.
Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.
This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.

hadoop CustomWritables

I have more of a design question regarding the necessity of a CustomWritable for my use case:
So I have a document pair that I will process through a pipeline and write out intermediate and final data to HDFS. My key will be something like ObjectId - DocId - Pair - Lang. I do not see why/if I will need a CustomWritable for this use case. I guess if I did not have a key, I would need a CustomWritable? Also, when I write data out to HDFS in the Reducer, I use a Custom Partitioner. So, that would kind of eliminate my need for a Custom Writable?
I am not sure if I got the concept of the need for a Custom Writable right. Can someone point me in the right direction?
Writables can be used for de/serializing objects. For example a log entry can contain a timestamp, an user IP and the browser agent. So you should implement your own WritableComparable for a key that identifies this entry and you should implement a value class that implements Writable that reads and writes the attributes in your log entry.
These serializations are just a handy way to get the data from a binary format to an object. Some Frameworks like HBase still require byte arrays to persist the data. So you'll have a lot of overhead in transfering this by yourself and messes up your code.
Thomas' answer explains a bit. Its way too late but I'd like to add the following for prospective readers:
Partitioner only comes into play between the map and reduce phase and has no role to play in writing from reducer to output files.
I don't believe writing INTERMEDIATE data to hdfs is a requirement in most cases, although there are some hacks that can be applied to do the same.
When you write from a reducer to hdfs, the keys will automatically be sorted and each reducer will write to ONE SEPARATE file. Based on their compareTo method, keys are sorted. So if you want to sort based on multiple variables, go for a Custom key class that extends WritableComparable, and implement the write, readFields and compareTo methods. You can now control the way the keys are sorted, based on the compareTo implementation

Resources