How to write a Custom Input Format - hadoop

I am a newbie to Hadoop and I have a situation where only one line per 4 lines of the input text is relevant. Currently I am using the default TextInputFormat and a conditional logic to skip all the other three lines which is irrelevant.
How can I use a Custom Input Format to handle this. Since Am new to hadoop I don't know much about CustomInputFormat. Any help would be appreciated. Thanks !

I think you can use NLineInputFormat where you can specify how many line constructs one record. This could be easy & ready to use solution.
If you want to implement your own input format then it you would probably implement custom input format & record reader to specify what constructs your one record.
below is one of of the example
http://deep-developers.blogspot.in/2014/06/custom-input-split-and-custom.html

Related

Spring Batch: Reading Multi-Line Record from Flat File

I have the following problem to solve:
There is a flat file to read, but the information is unfortunately spread over two rows. So i need to merge these two rows.
I thought about creating an incomplete object first and then add the information from the next row. Then move to the next couple. But i don't really see how to manage that.
Is there a way to read two lines and then process, or to remember an object from one to another step. I'm quite confused.
Any hint would be appreciated. Thanks.
This is a perfect use case for using a SingleItemPeekableItemReader. Check out this older answer for an example.

Mapreduce XML input format - to build custom format

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

How to split a large csv file into multiple files in GO lang?

I am a novice Go lang programmer,trying to learn Go lang features.I wanted to split a large csv file into multiple files in GO lang, each file containing the header.How do i do this? I have searched everywhere but couldnt get the right solution.Any help in this regard will be greatly appreciated.
Also please suggest me a good book for reference.
Thanking You
Depending on your shell fu this problem might be better suited for common shell utilities but you specifically mentioned go.
Let's think through the problem.
How big is this csv file? Are we talking 100 lines or is it 5G ?
If it's smallish I typically use this:
http://golang.org/pkg/io/ioutil/#ReadFile
However, this package also exists:
http://golang.org/pkg/encoding/csv/
Regardless - let's return to the abstraction of the problem. You have a header (which is the first line) and then the rest of the document.
So what we probably want to do (if ignoring csv for the moment) is to read in our file.
Then we want to split the file body by all the newlines in it.
You can use this to do so:
http://golang.org/pkg/strings/#Split
You didn't mention but do you know how many files you want to split by or would you rather split by the line count or byte count? What's the actual limitation here?
Generally it's not going to be file count but if we pretend it is we simply want to divide our line count by our expected file count to give lines/file.
Now we can take slices of the appropriate size and write the file back out via:
http://golang.org/pkg/io/ioutil/#WriteFile
A trick I use sometime to help think me threw these things is to write down our mission statement.
"I want to split a large csv file into multiple files in go"
Then I start breaking that up into pieces but take the divide/conquer approach - don't try to solve the entire problem in one go - just break it up to where you can think about it.
Also - make gratiutious use of pseudo-code until you can comfortably write the real code itself. Sometimes it helps to just write a short comment inline with how you think the code should flow and then get it down to the smallest portion that you can code and work from there.
By the way - many of the golang.org packages have example links where you can literally run in your browser the example code and cut/paste that to your own local environment.
Also, I know I'll catch some haters with this - but as for books - imo - you are going to learn a lot faster just by trying to get things working rather than reading. Action trumps passivity always. Don't be afraid to fail.
Here is a package that might help. You can set a necessary chunk size in bytes and a file will be split on an appropriate amount of chunks.

Pig removing parentheses when storing output

I'm new in programming Pig and currently I'm trying to implement my Hadoop jobs with pig.
So far my Pig programs work. I've got some output files stored as *.txt with semicolon as delimiter.
My problem is that Pig adds parentheses around the tuple's...
Is it possible to store the output in a file without these parentheses? Only storing the values? Maybe by overwriting the PigStorage method with an UDF?
Does anyone have a hint for me?
I want to read my output files into a RDBMS (Oracle) without the parentheses.
You probably need to write your own custom Storer. See: http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
Shouldn't be too difficult to just write it as a plain CSV or whatever. There's also a pre-existing DBStorage class that you might be able to use to write directly to Oracle if you want.
For people who find find this topic first, question is answered here:
Remove brackets and commas in output from Pig
use the FLATTEN command in your script like this:
output = FOREACH [variable] GENERATE FLATTEN (($1, $2, $3));<br>
STORE output INTO '[path]' USING PigStorage(,);
notice the second set of parentheses for the output you want to flatten.

How to convert from text file to sequence file?

I have a large .txt file of records that I need to convert into a (hadoop) sequence format for efficiency. I have found some answers to this online (such as How to convert .txt file to Hadoop's sequence file format), but I'm new to hadoop and don't really understand them. If you could explain these a little more, or if you have another solution, that'd be great. If it helps, the records are separated by line.
Thanks in advance.
Since you said you were new to hadoop, do you know the basic idea of Mapper and Reducer? Both of them have KEY_IN_CLASS, VALUE_IN_CLASS, KEY_OUT_CLASS, VALUE_OUT_CLASS, so in your case, you can simple use mapper to do the convert,
for KEY_IN_CLASS, you can use the default LongWritable,
VALUE_IN_CLASS you need to use Text, since Text class deals with text input.
For KEY_OUT_CLASS, you can use NullWritable, it's a null key if you don't have a specific key.
For VALUE_OUT_CLASS, use SequenceFileOutputFormat.
I believe in order to use SequenceFileOutputFormat, you need to tell SequenceFileOutputFormat what key class and value class you use.

Resources