I have been trying to learn hadoop. In the examples I saw (such as the word counting example) the key parameter of the map function is not used at all. The map function only uses the value part of the pair. So it seems to be that the key parameter is unnecessary, but it should not be. What am I missing here? Can you give me example map functions which use the key parameter?
Thanks
To understand about the use of key, you need to know various input formats available in Hadoop.
TextInputFormat -
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line.
Keys are the position in the file, and values are the line of text..
NLineInputFormat-
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
(Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
KeyValue TextInputFormat -
An InputFormat for plain text files. Files are broken into lines.
Either linefeed or carriage-return are used to signal end of line. E
ach line is divided into key and value parts by a separator byte.
If no such a byte exists, the key will be the entire line and value will be empty.
SequenceFileAsBinaryInputFormat-
InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsTextInputFormat-
This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
In wordcount example : As we want to count the occurrence of each word in the file.
we used the follwing method:
In Mapper -
Key is the offset of the text file.
Value - Line in text file.
For example.
file.txt
Hi I love Hadoop.
I code in Java.
Here
Key - 0 , value - Hi I love Hadoop.
Key - 17 , value - I code in Java.
(key - 17 is offset from start of file.)
Basically the offset for key is default and we do not need it especially in Wordcount.
Now later logic is I guess you will get here and many more available links.
Just in case:
In Reducer
Key is the Word
Value is 1 which is its count.
Related
I have to process (by Hadoop) variable-length files without delimiter.
The format of these files is:
(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)
There is no delimiter between the records (the file is in one line).
There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).
I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.
So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.
Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?
Thanks in advance.
Regards
I'm new to Hadoop and wondering how many types of InputFormat are there in Hadoop such as TextInputFormat? Is there a certain InputFormat that I can use to read files via http requests to remote data servers?
Thanks :)
There are many classes implementing InputFormat
CombineFileInputFormat, CombineSequenceFileInputFormat,
CombineTextInputFormat, CompositeInputFormat, DBInputFormat,
FileInputFormat, FixedLengthInputFormat, KeyValueTextInputFormat,
MultiFileInputFormat, NLineInputFormat, Parser.Node,
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat,
SequenceFileInputFilter, SequenceFileInputFormat, TextInputFormat
Have a look at this article on when to use which type of Inputformat.
Out of these, most frequently used formats are:
FileInputFormat : Base class for all file-based InputFormats
KeyValueTextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either line feed or carriage-return are used to signal end of line. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.
TextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
NLineInputFormat : NLineInputFormat which splits N lines of input as one split. In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
SequenceFileInputFormat : An InputFormat for SequenceFiles.
Regarding second query, get the files from remote servers first and use appropriate InputFileFormat depending on contents in file. Hadoop works best for data locality.
Your first question - how many types of InputFormat are there in Hadoop such as TextInputFormat?
TextInputFormat - each line will be treated as value
KeyValueTextInputFormat - First value before delimiter is key and rest is value
FixedLengthInputFormat - Each fixed length value is considered to be value
NLineInputFormat - N number of lines is considered one value/record
SequenceFileInputFormat - For binary
Also there is DBInputFormat to read from databases
You second question - there is no input format to read files via http requests.
I have a requirement that my MAP should read a big HDFS text file and writes it to the sequence file as "text_file_name text_file_contents" as key:value pair in a single line.
My Mapper then sends the path of this sequence file to Reducer.
Currently what I am doing is :
read all lines from a text file and keep appending them to Text() variable (e.g. "contents").
once done reading the whole text file, write "contents" into sequence file
However, I am not sure whether Text() is able to store a big file. Hence want to do the following :
read a single line from text file
write it to sequence file using (writer.append(key, value) where "writer" is SequenceFile.Writer)
do above until the whole text file is written.
The problem with this approach is, it writes the "key" with every line I am writing to the sequence file.
So, just want to know,
if Text() can store a file of any size if I keep on appending it?
how can I avoid writing "key" in writer.append() in all writes but the first?
can writer.appendRaw() be used. I did not get sufficient documentation on this function.
To answer to your questions:
Text() can store upto a maximum of 2GB.
You can avoid writing a key by either writing a NullWritable or set key.ignore to false.
But when you go with the second approach first time also you cannot write your key. So better use NullWritable
I have a file which I need to process which contains records which contains variable number of lines.
For example I have the following file:-
100,abc,123101,abc,123120,abc,123100,abc,123111,abc,123123,abc,123120,abc,123100,abc,123111,abc,123120,abc,123100,abc,123114,abc,123120,abc,123
So bold and non-bold above show each record.
So each of the record as you can see from above starts with 100 and ends with 120.But each of the record contains variable number of lines like 3 or 4 etc. Now I know this could be solved using custom input format and custom record reader where I can reuse linerecordreader to handle variable lines. But with that approach the problem is that if the record(starting with 100 line and ending with 120) is itself too large to contain in map as single record.So in such cases this will fail. So I need some better solution by which this could be solved using default inputformat and recordreader and doing something in mapper or reducer etc. More than one job is also welcome if the problem could be solved somehow.
I'm trying to setup a MapReduce task that utilizes the Parallel Scan feature by dynamodb.
Basically, I want each Mapper class to take a tuple as the input value.
Every example I've seen so far sets this :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Can I set the input format for the job to be hashMap instead?
I think you want to read your file as a key-value pair not as a standard way to read inputSlipt(Line number as a key and Line as a value). If it is waht you asked then you can use KeyValueTextInputFormat below description can be found on Hadoop: definitive guide
KeyValueTextInputFormat
TextInputFormat’s keys, being simply the offset within the file, are not normally
very useful. It is common for each line in a file to be a key-value pair,
separated by a delimiter such as a tab character. For example, this is the output
produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such
files correctly, KeyValueTextInputFormat is appropriate.
You can specify the separator via the key.value.separator.in.input.line property.
It is a tab character by default. Consider the following input file,
where → represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)