Hadoop custom split of TextFile - hadoop

I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?

I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is:
Processing paraphragraphs in text files as single records with Hadoop

You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.

Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?

Related

Hadoop: InputFormat for Variable-Length files without delimiter

I have to process (by Hadoop) variable-length files without delimiter.
The format of these files is:
(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)
There is no delimiter between the records (the file is in one line).
There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).
I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.
So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.
Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?
Thanks in advance.
Regards

The "key" parameter of Hadoop map function is not used

I have been trying to learn hadoop. In the examples I saw (such as the word counting example) the key parameter of the map function is not used at all. The map function only uses the value part of the pair. So it seems to be that the key parameter is unnecessary, but it should not be. What am I missing here? Can you give me example map functions which use the key parameter?
Thanks
To understand about the use of key, you need to know various input formats available in Hadoop.
TextInputFormat -
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line.
Keys are the position in the file, and values are the line of text..
NLineInputFormat-
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
(Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
KeyValue TextInputFormat -
An InputFormat for plain text files. Files are broken into lines.
Either linefeed or carriage-return are used to signal end of line. E
ach line is divided into key and value parts by a separator byte.
If no such a byte exists, the key will be the entire line and value will be empty.
SequenceFileAsBinaryInputFormat-
InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsTextInputFormat-
This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
In wordcount example : As we want to count the occurrence of each word in the file.
we used the follwing method:
In Mapper -
Key is the offset of the text file.
Value - Line in text file.
For example.
file.txt
Hi I love Hadoop.
I code in Java.
Here
Key - 0 , value - Hi I love Hadoop.
Key - 17 , value - I code in Java.
(key - 17 is offset from start of file.)
Basically the offset for key is default and we do not need it especially in Wordcount.
Now later logic is I guess you will get here and many more available links.
Just in case:
In Reducer
Key is the Word
Value is 1 which is its count.

How to write mapreduce program for counting lines in text file?

I have a .dat file which has n number of lines with multiple fields in one line.
each field is separated by '|'. Now i would like to write a map reduce program to count
number of lines for particular field(same i can do in hive using count(Column_name)).
i am very new to map reduce programming. Any help would be appreciated.
You should first learn the example "word count", then you can know how to deal with your problem.
Here is the example http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

Writing to hdfs sequence file

I have a requirement that my MAP should read a big HDFS text file and writes it to the sequence file as "text_file_name text_file_contents" as key:value pair in a single line.
My Mapper then sends the path of this sequence file to Reducer.
Currently what I am doing is :
read all lines from a text file and keep appending them to Text() variable (e.g. "contents").
once done reading the whole text file, write "contents" into sequence file
However, I am not sure whether Text() is able to store a big file. Hence want to do the following :
read a single line from text file
write it to sequence file using (writer.append(key, value) where "writer" is SequenceFile.Writer)
do above until the whole text file is written.
The problem with this approach is, it writes the "key" with every line I am writing to the sequence file.
So, just want to know,
if Text() can store a file of any size if I keep on appending it?
how can I avoid writing "key" in writer.append() in all writes but the first?
can writer.appendRaw() be used. I did not get sufficient documentation on this function.
To answer to your questions:
Text() can store upto a maximum of 2GB.
You can avoid writing a key by either writing a NullWritable or set key.ignore to false.
But when you go with the second approach first time also you cannot write your key. So better use NullWritable

Process variable numbers of lines in a Record using mapreduce

I have a file which I need to process which contains records which contains variable number of lines.
For example I have the following file:-
100,abc,123101,abc,123120,abc,123100,abc,123111,abc,123123,abc,123120,abc,123100,abc,123111,abc,123120,abc,123100,abc,123114,abc,123120,abc,123
So bold and non-bold above show each record.
So each of the record as you can see from above starts with 100 and ends with 120.But each of the record contains variable number of lines like 3 or 4 etc. Now I know this could be solved using custom input format and custom record reader where I can reuse linerecordreader to handle variable lines. But with that approach the problem is that if the record(starting with 100 line and ending with 120) is itself too large to contain in map as single record.So in such cases this will fail. So I need some better solution by which this could be solved using default inputformat and recordreader and doing something in mapper or reducer etc. More than one job is also welcome if the problem could be solved somehow.

Resources