I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.
The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.
Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.
How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?
If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.
WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.
Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.
I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.
Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference
1) WholeFileInputFormat
2) WholeFileRecordReader
Also including the grepcode link to MD5FileUtils
Related
In Hadoop Cascading Flow, i have a number of tuples which is processed and finally i have sunk into a destination.
Now my requirement is: To sink that tuples in destination file with certain defined constant String values at beginning and at the end.
For example: I have following input tuples
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Now i need to have like this output:
Certain data before those data
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Certain data after those data
Little bit i have searched of repository class DelimitedParser and its methods like joinLine, joinFirstLine but due to poor documentation i am unable to get exact point of it.
It may depend on what "Certain data before those data" means ?
If you are using TextDelimited, then you can dump the header values in the sink. By default header values are not written as per the documentation hence you will need to enable it. Another thing to remember is that the header values represents the output fields.
-Amit
I need to move complicated value (implements Writable) from output of 1st map-reduce job to input of other map-reduce job. Results of 1st job saved to file. File can store Text data or BytesWritable (with default output \ input formats). So I need some simple way to convert my Writable to Text or To BytesWritable and from it. Does it exists? Any alternative way to do this?
Thanks a lot
User irW is correct, use SequenceFileOutputFormat. SequenceFile solves this exact problem, without converting to Text Writable. When setting up your job, use job.setOutputKeyClass and job.setOutputValueClass to set the Writable subclasses you are using:
job.setOutputKeyClass(MyWritable1.class);
job.setOutputValueClass(MyWritable2.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
This will use the Hadoop SequenceFile format to store your Writables. Then in your next job, use SequenceFileInputFormat:
job.setInputFormatClass(SequenceFileInputFormat.class);
Then the input key and value for the mapper in this job will be the two Writable classes you originally specified as output in the previous job.
Note, it is crucial that your complex Writable subclass is implemented correctly. Beyond the fact that you must have an empty constructor, the write and readFields methods must be implemented such that any Writable fields in the class also write and read their information.
In my case I do not need a reduce function. So I am assuming that the map function should not worry about choosing and splitting the input text file to some key-value pair.
Yes it does. Mappers always output a key,value pair. If you don't want to use a reducer, you can write Map output directly to file system, or you can use an Identity Reducer. If you're not Interested in what the key is, you can maybe just assign some default key. If you can share some more details about what you're trying to do, we can probably help you out better.
Summary: can I specify some action to be executed on each output file after it's written with hadoop streaming?
Basically, this is follow-up to Easiest efficient way to zip output of hadoop mapreduce question. I want for each key X its value written to X.txt file, compressed into X.zip archive. But when we write zip output stream, it's hard to tell something about a key or a name of resulting file, so we end up with X.zip archive containing default-name.txt.
It'd be very simple operation to rename archive contents, but where can I place it? What I don't want to do is download all zips from S3 and upload them back then.
Consider using a custom MultipleOutputFormat:
Basic use cases:
This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys.
It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value).
This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it.
This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name
You may also control which key goes to which reducer (Partitioner)
I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?
e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.
How to do that?
Or is there a better solution than mine?
Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.
If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.
On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.
This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.