I am working on simple map reduce program. I want to create different files after reducer for each different word in the key. For example, after executing Mapreduce I have something like
Priority1 x 2
Priority1 y 2
Priority1 z 2
priority2 x 2
priority2 y 2
Now I want different files after reduce phase, saying Priority1 and Priority2 which have all these values according to the priority. I am using java and want to know what should be written in reducer for having this kind of output?
I just want to know if this is even possible or if it is how to approach or solve this?
I am using Hadoop 0.20.203 and hence multipleoutputs doesn't work.
Any pointers will be helpful.
Thanks for the help!
Atul
You need to create a partioner class first, that partions based on your criteria.
You then need to create your own outputformat class and a recordwriter class.
The recordwriter class, needs to write to different files as per your needs. Further if you need to sort your values create comparator class for your key field.
Have a look at MultipleOutputs.
Related
In Hadoop Cascading Flow, i have a number of tuples which is processed and finally i have sunk into a destination.
Now my requirement is: To sink that tuples in destination file with certain defined constant String values at beginning and at the end.
For example: I have following input tuples
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Now i need to have like this output:
Certain data before those data
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Certain data after those data
Little bit i have searched of repository class DelimitedParser and its methods like joinLine, joinFirstLine but due to poor documentation i am unable to get exact point of it.
It may depend on what "Certain data before those data" means ?
If you are using TextDelimited, then you can dump the header values in the sink. By default header values are not written as per the documentation hence you will need to enable it. Another thing to remember is that the header values represents the output fields.
-Amit
I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.
The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.
Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.
How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?
If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.
WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.
Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.
I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.
Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference
1) WholeFileInputFormat
2) WholeFileRecordReader
Also including the grepcode link to MD5FileUtils
My map function emits two different kinds of key value pairs for the same data. Naturally I would need 2 independent reduce functions to handle this. Is it possible?
Like, can I have multiple output.collect() statements at the end of map with an additional parameter specifying the reducer?
I tried looking it up but couldn't find anything.
You should consider using MultipleOutputs class. It has nice and self-explanatory documentation.
In my case I do not need a reduce function. So I am assuming that the map function should not worry about choosing and splitting the input text file to some key-value pair.
Yes it does. Mappers always output a key,value pair. If you don't want to use a reducer, you can write Map output directly to file system, or you can use an Identity Reducer. If you're not Interested in what the key is, you can maybe just assign some default key. If you can share some more details about what you're trying to do, we can probably help you out better.
I have found several tutorials on how to create my own non-distributed recommender but none how to create my own distributed recommender job (any link is welcome if you know one).
In the book “Mahout in Action” there are some examples of how to write Mappers/Reducers using Mahout’s objects, but it does not seem to show how to put these jobs together?
However there is item/RecommenderJob in mahout-core which gives an idea of how this can be done. My actual intent is to replace the first mapper so that I don't have to prepare my data outside of mahout (lines look like "userid,itemid1,itemid2,itemid3..." and using item.RecommenderJob I obviously need lines like "itemid1,itemid2", "itemid1,itemid3", ...).
Now would it be a good idea to just copy over the RecommenderJob class and change what I need?
I have tried it, but since this class uses variables that are in package scope (e. g. UserVectorSplitterMapper.USERS_FILE) I have to replace these – which does not feel good.
Should I rather create a new class extending AbstractJob and pick out the things I need from RecommenderJob? Then what are the elements in RecommenderJob that I really need?
Your alternatives are to precede the job with your own job that translates your input into a form the job wants, or, indeed to just modify the job. I don't think it's a big deal to copy the job and modify and customize it if you need non-trivial changes that aren't (and wouldn't make sense to be) supported as some kind of config parameter.