Save and read complicated Writable value in Hadoop job - hadoop

I need to move complicated value (implements Writable) from output of 1st map-reduce job to input of other map-reduce job. Results of 1st job saved to file. File can store Text data or BytesWritable (with default output \ input formats). So I need some simple way to convert my Writable to Text or To BytesWritable and from it. Does it exists? Any alternative way to do this?
Thanks a lot

User irW is correct, use SequenceFileOutputFormat. SequenceFile solves this exact problem, without converting to Text Writable. When setting up your job, use job.setOutputKeyClass and job.setOutputValueClass to set the Writable subclasses you are using:
job.setOutputKeyClass(MyWritable1.class);
job.setOutputValueClass(MyWritable2.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
This will use the Hadoop SequenceFile format to store your Writables. Then in your next job, use SequenceFileInputFormat:
job.setInputFormatClass(SequenceFileInputFormat.class);
Then the input key and value for the mapper in this job will be the two Writable classes you originally specified as output in the previous job.
Note, it is crucial that your complex Writable subclass is implemented correctly. Beyond the fact that you must have an empty constructor, the write and readFields methods must be implemented such that any Writable fields in the class also write and read their information.

Related

Mapreduce XML input format - to build custom format

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

Multiple outputformats in single map reduce

I have a mapreduce job which reads text file and creates parquet file from it and at the same time writes to simple text file as output. I have used multiple output format for that. But multiple output format object can be initialize for either writing parquet file or text file at a time. I need to accommodate both in single mapper. Any help is highly appreciated.
Not sure it's the best way, but you can just initialize a StringBuilder in our mapper's setup() method, append all text values to it during the map() method and then write it to disk in the cleanup method. Depends on the size of your text output and if you have enough memory or not. That way the text file doesn't need to be a mapper output at all, and your mapper output can be the Parquet data only.
You could use context.getInputSplit() or something similar as the text output file names so that each mapper outputs a unique file name and you know which output correponds to which input.

Deciding key value pair for deduplication using hadoop mapreduce

I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.
The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.
Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.
How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?
If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.
WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.
Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.
I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.
Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference
1) WholeFileInputFormat
2) WholeFileRecordReader
Also including the grepcode link to MD5FileUtils

Hadoop: Modify output file after it's written

Summary: can I specify some action to be executed on each output file after it's written with hadoop streaming?
Basically, this is follow-up to Easiest efficient way to zip output of hadoop mapreduce question. I want for each key X its value written to X.txt file, compressed into X.zip archive. But when we write zip output stream, it's hard to tell something about a key or a name of resulting file, so we end up with X.zip archive containing default-name.txt.
It'd be very simple operation to rename archive contents, but where can I place it? What I don't want to do is download all zips from S3 and upload them back then.
Consider using a custom MultipleOutputFormat:
Basic use cases:
This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys.
It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value).
This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it.
This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name
You may also control which key goes to which reducer (Partitioner)

How to convert from text file to sequence file?

I have a large .txt file of records that I need to convert into a (hadoop) sequence format for efficiency. I have found some answers to this online (such as How to convert .txt file to Hadoop's sequence file format), but I'm new to hadoop and don't really understand them. If you could explain these a little more, or if you have another solution, that'd be great. If it helps, the records are separated by line.
Thanks in advance.
Since you said you were new to hadoop, do you know the basic idea of Mapper and Reducer? Both of them have KEY_IN_CLASS, VALUE_IN_CLASS, KEY_OUT_CLASS, VALUE_OUT_CLASS, so in your case, you can simple use mapper to do the convert,
for KEY_IN_CLASS, you can use the default LongWritable,
VALUE_IN_CLASS you need to use Text, since Text class deals with text input.
For KEY_OUT_CLASS, you can use NullWritable, it's a null key if you don't have a specific key.
For VALUE_OUT_CLASS, use SequenceFileOutputFormat.
I believe in order to use SequenceFileOutputFormat, you need to tell SequenceFileOutputFormat what key class and value class you use.

Resources