Mapreduce XML input format - to build custom format - hadoop

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath

Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

Related

ESQL code for CSV to XML conversion in IIB v10

I want to convert input CSV file to XML file using ESQL in IIB v10. Can you please help me with the ESQL code to achieve the same. I've provided the Input CSV file sample and Output XML file sample as below:
Input CSV file
Output XML file
Your question is fundamentally wrong. Using ESQL only to do it on Integration Bus is like using a knife to cut down a tree (when you have the choice with a chainsaw). If you want to convert a csv file to an xml, the proper solution is the following :
1) Define a new DFDL schema to parse the CSV file
2) Define your xsd for the output XML
3) Use the DFDL parser when you read the CSV, and use the structure you created (on the fileInput node for example, I don't know your exact case)
4) Use a mapping node to map from your DFDL structure to your XML structure (defined in the xsd)
Note : the last step can be done with alternatives solution, like compute Nodes (ESQL, Java, C#, php).
If you have any additional questions, feel free to contact me

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Save and read complicated Writable value in Hadoop job

I need to move complicated value (implements Writable) from output of 1st map-reduce job to input of other map-reduce job. Results of 1st job saved to file. File can store Text data or BytesWritable (with default output \ input formats). So I need some simple way to convert my Writable to Text or To BytesWritable and from it. Does it exists? Any alternative way to do this?
Thanks a lot
User irW is correct, use SequenceFileOutputFormat. SequenceFile solves this exact problem, without converting to Text Writable. When setting up your job, use job.setOutputKeyClass and job.setOutputValueClass to set the Writable subclasses you are using:
job.setOutputKeyClass(MyWritable1.class);
job.setOutputValueClass(MyWritable2.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
This will use the Hadoop SequenceFile format to store your Writables. Then in your next job, use SequenceFileInputFormat:
job.setInputFormatClass(SequenceFileInputFormat.class);
Then the input key and value for the mapper in this job will be the two Writable classes you originally specified as output in the previous job.
Note, it is crucial that your complex Writable subclass is implemented correctly. Beyond the fact that you must have an empty constructor, the write and readFields methods must be implemented such that any Writable fields in the class also write and read their information.

How to convert from text file to sequence file?

I have a large .txt file of records that I need to convert into a (hadoop) sequence format for efficiency. I have found some answers to this online (such as How to convert .txt file to Hadoop's sequence file format), but I'm new to hadoop and don't really understand them. If you could explain these a little more, or if you have another solution, that'd be great. If it helps, the records are separated by line.
Thanks in advance.
Since you said you were new to hadoop, do you know the basic idea of Mapper and Reducer? Both of them have KEY_IN_CLASS, VALUE_IN_CLASS, KEY_OUT_CLASS, VALUE_OUT_CLASS, so in your case, you can simple use mapper to do the convert,
for KEY_IN_CLASS, you can use the default LongWritable,
VALUE_IN_CLASS you need to use Text, since Text class deals with text input.
For KEY_OUT_CLASS, you can use NullWritable, it's a null key if you don't have a specific key.
For VALUE_OUT_CLASS, use SequenceFileOutputFormat.
I believe in order to use SequenceFileOutputFormat, you need to tell SequenceFileOutputFormat what key class and value class you use.

how to access and manipulate pdf file's datas in Hadoop?

I want to read the PDF file using hadoop, how it is possible?
I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.
Give me some suggestion.
An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.
Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.

Resources