How to convert from text file to sequence file? - hadoop

I have a large .txt file of records that I need to convert into a (hadoop) sequence format for efficiency. I have found some answers to this online (such as How to convert .txt file to Hadoop's sequence file format), but I'm new to hadoop and don't really understand them. If you could explain these a little more, or if you have another solution, that'd be great. If it helps, the records are separated by line.
Thanks in advance.

Since you said you were new to hadoop, do you know the basic idea of Mapper and Reducer? Both of them have KEY_IN_CLASS, VALUE_IN_CLASS, KEY_OUT_CLASS, VALUE_OUT_CLASS, so in your case, you can simple use mapper to do the convert,
for KEY_IN_CLASS, you can use the default LongWritable,
VALUE_IN_CLASS you need to use Text, since Text class deals with text input.
For KEY_OUT_CLASS, you can use NullWritable, it's a null key if you don't have a specific key.
For VALUE_OUT_CLASS, use SequenceFileOutputFormat.
I believe in order to use SequenceFileOutputFormat, you need to tell SequenceFileOutputFormat what key class and value class you use.

Related

Mapreduce XML input format - to build custom format

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

How to write a Custom Input Format

I am a newbie to Hadoop and I have a situation where only one line per 4 lines of the input text is relevant. Currently I am using the default TextInputFormat and a conditional logic to skip all the other three lines which is irrelevant.
How can I use a Custom Input Format to handle this. Since Am new to hadoop I don't know much about CustomInputFormat. Any help would be appreciated. Thanks !
I think you can use NLineInputFormat where you can specify how many line constructs one record. This could be easy & ready to use solution.
If you want to implement your own input format then it you would probably implement custom input format & record reader to specify what constructs your one record.
below is one of of the example
http://deep-developers.blogspot.in/2014/06/custom-input-split-and-custom.html

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Save and read complicated Writable value in Hadoop job

I need to move complicated value (implements Writable) from output of 1st map-reduce job to input of other map-reduce job. Results of 1st job saved to file. File can store Text data or BytesWritable (with default output \ input formats). So I need some simple way to convert my Writable to Text or To BytesWritable and from it. Does it exists? Any alternative way to do this?
Thanks a lot
User irW is correct, use SequenceFileOutputFormat. SequenceFile solves this exact problem, without converting to Text Writable. When setting up your job, use job.setOutputKeyClass and job.setOutputValueClass to set the Writable subclasses you are using:
job.setOutputKeyClass(MyWritable1.class);
job.setOutputValueClass(MyWritable2.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
This will use the Hadoop SequenceFile format to store your Writables. Then in your next job, use SequenceFileInputFormat:
job.setInputFormatClass(SequenceFileInputFormat.class);
Then the input key and value for the mapper in this job will be the two Writable classes you originally specified as output in the previous job.
Note, it is crucial that your complex Writable subclass is implemented correctly. Beyond the fact that you must have an empty constructor, the write and readFields methods must be implemented such that any Writable fields in the class also write and read their information.

WholeFileInputFormat with multiple files Input

How can i use WholeFileInputFormat with many files as input?
Many files as one file...
FileInputFormat.addInputPaths(job, String ...); doesnt seem to work properly
You need to set "isSplittable" in your InputFormat to "false" so that the input file doesn't get split and get processed by just 1 mapper. One small suggestion though, you could give Sequence File a try. Combine multiple files, you are trying to process, into a single Sequence File and then process it. It would be more efficient as Sequence Files are already in key/value form.

Resources