I have to process (by Hadoop) variable-length files without delimiter.
The format of these files is:
(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)
There is no delimiter between the records (the file is in one line).
There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).
I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.
So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.
Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?
Thanks in advance.
Regards
Related
I'm new to Hadoop and wondering how many types of InputFormat are there in Hadoop such as TextInputFormat? Is there a certain InputFormat that I can use to read files via http requests to remote data servers?
Thanks :)
There are many classes implementing InputFormat
CombineFileInputFormat, CombineSequenceFileInputFormat,
CombineTextInputFormat, CompositeInputFormat, DBInputFormat,
FileInputFormat, FixedLengthInputFormat, KeyValueTextInputFormat,
MultiFileInputFormat, NLineInputFormat, Parser.Node,
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat,
SequenceFileInputFilter, SequenceFileInputFormat, TextInputFormat
Have a look at this article on when to use which type of Inputformat.
Out of these, most frequently used formats are:
FileInputFormat : Base class for all file-based InputFormats
KeyValueTextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either line feed or carriage-return are used to signal end of line. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.
TextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
NLineInputFormat : NLineInputFormat which splits N lines of input as one split. In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
SequenceFileInputFormat : An InputFormat for SequenceFiles.
Regarding second query, get the files from remote servers first and use appropriate InputFileFormat depending on contents in file. Hadoop works best for data locality.
Your first question - how many types of InputFormat are there in Hadoop such as TextInputFormat?
TextInputFormat - each line will be treated as value
KeyValueTextInputFormat - First value before delimiter is key and rest is value
FixedLengthInputFormat - Each fixed length value is considered to be value
NLineInputFormat - N number of lines is considered one value/record
SequenceFileInputFormat - For binary
Also there is DBInputFormat to read from databases
You second question - there is no input format to read files via http requests.
I have been trying to learn hadoop. In the examples I saw (such as the word counting example) the key parameter of the map function is not used at all. The map function only uses the value part of the pair. So it seems to be that the key parameter is unnecessary, but it should not be. What am I missing here? Can you give me example map functions which use the key parameter?
Thanks
To understand about the use of key, you need to know various input formats available in Hadoop.
TextInputFormat -
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line.
Keys are the position in the file, and values are the line of text..
NLineInputFormat-
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
(Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
KeyValue TextInputFormat -
An InputFormat for plain text files. Files are broken into lines.
Either linefeed or carriage-return are used to signal end of line. E
ach line is divided into key and value parts by a separator byte.
If no such a byte exists, the key will be the entire line and value will be empty.
SequenceFileAsBinaryInputFormat-
InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsTextInputFormat-
This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
In wordcount example : As we want to count the occurrence of each word in the file.
we used the follwing method:
In Mapper -
Key is the offset of the text file.
Value - Line in text file.
For example.
file.txt
Hi I love Hadoop.
I code in Java.
Here
Key - 0 , value - Hi I love Hadoop.
Key - 17 , value - I code in Java.
(key - 17 is offset from start of file.)
Basically the offset for key is default and we do not need it especially in Wordcount.
Now later logic is I guess you will get here and many more available links.
Just in case:
In Reducer
Key is the Word
Value is 1 which is its count.
I have a file which I need to process which contains records which contains variable number of lines.
For example I have the following file:-
100,abc,123101,abc,123120,abc,123100,abc,123111,abc,123123,abc,123120,abc,123100,abc,123111,abc,123120,abc,123100,abc,123114,abc,123120,abc,123
So bold and non-bold above show each record.
So each of the record as you can see from above starts with 100 and ends with 120.But each of the record contains variable number of lines like 3 or 4 etc. Now I know this could be solved using custom input format and custom record reader where I can reuse linerecordreader to handle variable lines. But with that approach the problem is that if the record(starting with 100 line and ending with 120) is itself too large to contain in map as single record.So in such cases this will fail. So I need some better solution by which this could be solved using default inputformat and recordreader and doing something in mapper or reducer etc. More than one job is also welcome if the problem could be solved somehow.
I'm using CLPB_IMPORT function module to get clipboard to internal table. it's ok. I'am copying two column Excel data. So it fills the table with delimiter '#', like;
4448#3000
4449#4000
4441#5000
But the problem is splitting these strings. I'm trying;
LOOP AT foytab.
SPLIT foytab-tab AT '#' INTO temp1 temp2.
ENDLOOP.
But it doesn't split. it puts whole line into temp1. I think the delimiter is not what I thought ('#'). Because when I write a string manually with delimiter '#' it splits.
Do you have any idea how to split this ?
You should not use CLPB_IMPORT since it's explicitly marked as obsolete. Use CL_GUI_FRONTEND_SERVICES=>CLIPBOARD_IMPORT instead.
The data is probably not separated by # but by a tab character. You can check this in the hex view of the debugger. # is just a replacement symbol the UI uses for any unprintable character. If the delimiter is the tab character, you can use the constant CL_ABAP_CHAR_UTILITIES=>HORIZONTAL_TAB.
I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?
I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is:
Processing paraphragraphs in text files as single records with Hadoop
You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.
Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?