I've known that a writable object can be passed to mapper using something like:
DefaultStringifier.store(conf, object ,"key");
object = DefaultStringifier.load(conf, "key", Class );
My question is:
In a mapper I read out the object then change the value of this object,
for example: object=another .
How to do to make sure that the change of the object's value
could be known by the next time of mapper task?
Is there any better way to pass parameter to mapper?
Use the file system instead. Write the value in HDFS, and replace the file with a different content. Neither config nor DistributedCache are not appropiate for mutable state.
Related
I have one Single Mapper , say SingleGroupIdentifierMapper.java
Now this is a generic mapper which does all the filtration on a single line of mapper-input value/record based on property file (containing filters and key-value field indexes) passed to it from the driver class using cache.
Only the reducer business logic is is different and has been implemented keeping the mapper logic generic and to be implemented using the PropertyFile as mentioned above.
Now my problem statement is I have input from multiple Sources now, having different formats. That means I have to do some thing like
MultipleInputs.addInputPath(conf, new Path("/inputA"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
But the cached property file which I pass from the driver class to the mapper for implementing filter based on field indexes is common, So how can I pass two different property file to the same mapper, where if it processes, say Input A, then it will use PropertyFileA (to filter and create key value pair) and if it processes, say Input B then it will use PropertyFileB (to filter and create key value pair).
It is possible to change the Generic Code of the Mapper to take care of this scenario BUT how to approach this problem in the Generic Class and how to identify in the same Mapper Class if the input is from inputA/inputB and accordingly apply the propertyFile Configuration on the data.
Can we pass arguments to the constructor of this mapper class to specify it is from inputB or it needs to read which property file in cache.?
Eg Something like :
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, args[], SingleGroupIdentifierMapper.class);
where args[] is passed to the SingleGroupIdentifierMapper class's constructor which we define to take as input and set it as a attribure.
Any thoughts or expertise is most welcomed.
Hope I was able to express my problem clearly, kindly ask me in case there needs to be more clarity in the question.
Thanks in Advance,
Cheers :)
Unfortunately MultipleInputs is not that flexible. But there is a workaround which matches InputSplit paths to the property files in the setup method of the Mapper. If you are not using any sort of Combine*Format, than a single mapper will process a single split from a single file:
When adding prop files into cache use /propfile_1#PROPS_A and /propfile_2#PROPS_B
Add input path into job.getConfiguration().set("PROPS_A", "/inputA") and job.getConfiguration().set("PROPS_B", "/inputB")
In the Mapper.setup(Context context) method, use context.getInputSplit().toString() to get the path of the split. Than match it to the paths saved in the context.getConfiguration().get("PROPS_A") or PROPS_B
If you are using some Combine*Format, than you would need to extend it, override getSplits that use information from the JobContext to build the PathFilter[] and call createPool, which will create splits that contain files from the same group (inputA or inputB).
How can I allow all of my mappers to have access to one variable, for example a TreeMap object, without having each mapper to re-construct the TreeMap every time? The object will never be modified again once it's constructed.
Consider putting the contents of the TreeMap object in the Distributed Cache. If it is a small amount of data you can place the object contents in your configuration object:
conf.set("key", "value");
then use the JobConf object to access it in your mapper.
How do we pass objects of some custom class as a parameter to mapper in mapReduce programs??
JobConf has 'set' methods for boolean, string, int and long. What if I want to pass a Document object as a parameter to my mapper? Can any one help me out?
I have given a tip to someone who wanted a whole map to pass to a mapper.
Hadoop: How to save Map object in configuration
The idea is the same, you have to serialize your object into a string and put it into the configuration. JSON works very well, because the configuration is serialized as XML, thus having no problem while deserializing.
If your object implements Writable, you can serialize it to a Byte array, base64 encode the byte array and then save off the resultant string to the configuration. To decode do the opposite.
Of course, i wouldn't recommend this if your object has a very large footprint - in this case you're better off serializing it to a file in HDFS and using the distributed cache.
To use counters I need to have an access to Reporter object.
The Reporter object is passed as parameter to map() and reduce(), hence I can do:
reporter.incrCounter(NUM_RECORDS, 1);
But I need to use counters inside the class MultipleOutputFormat ( I am using method
generateFileNameForKeyValue )
Question: how to access the Reporter object inside the MultipleOutputFormat class?
You Could create your own MultipleOutputFormat class, MyMultipleOutputFormat (which is kinda sounds like you are doing) and create a function that takes in a Reporter (as well as the other parameters) that then calls the base generateFileNameForKeyValue.
If there is a way you can access the Job from where you need to record it. You can get the context from the job (context.getConfiguration() and then increment the counter (context.getCounter(YOUR_COUNTER.HERE).increment(1);)
I don't know your exact situation, but attempting to use a counter inside a function that should be acting on/for a single record seems unnecessary and likely could be done 'outside' where access to the Reporter/Context is easy. I could be wrong and your situation/use of the counter is needed there, but I'd suggest checking if you really need it inside that function, or if it could be done outside as well.
Edit: To respond to the couple of points that were unclear;
Creating a function that takes in a Reporter: Since you are extending the MultipleOutputFormat you can add additional functions. If you add a function definition of generateFileNameForKeyValueAndTrack(K key, V value, String name, Reporter reporter) you can then do the counter incrementation in that function and have it call generateFileNameForKeyValue passing along key, value and name.
Using a counter inside seems unneeded: I'm assuming you are calling generateFileNameForKeyValue inside the map function. Replace map with whatever function if that assumption is wrong. Create a collection (don't care what type, as long as it can do what I describe) that you store the generated File Name. Everytime a filename is generated you can check if it exists in the collection and increment the appropriate counter.
I can see the appeal of doing it inside the generate... function to avoid duplicating data, so I'd (off the top of my head) probably go with creating the additional function (specified above).
I hope that helps clarify what I was suggestion.
To keep communication flowing properly (and me being notified) if you have comments/questions relating to this post, please use add a comment to this post instead of adding an answer.
I was curious if anyone had insight on what is the best way for an object to load data from a file in Ruby. Is there a convention? There are two ways I can think of accomplishing this:
Have the initialize method accept a path or file and parse the data within the initialize method, setting the object variables as well.
Have the main "runner" code open the file and parse it, then pass the correct arguments to your constructor.
I am also aware that I could support both methods through an options hash or *args and looking at its size, but I do not have any need to implement both.
I would use the second option combined with providing the path info as an argument to the main code. This makes it more portable and keeps the object de-coupled from the source of the data