I have a requirement that my MAP should read a big HDFS text file and writes it to the sequence file as "text_file_name text_file_contents" as key:value pair in a single line.
My Mapper then sends the path of this sequence file to Reducer.
Currently what I am doing is :
read all lines from a text file and keep appending them to Text() variable (e.g. "contents").
once done reading the whole text file, write "contents" into sequence file
However, I am not sure whether Text() is able to store a big file. Hence want to do the following :
read a single line from text file
write it to sequence file using (writer.append(key, value) where "writer" is SequenceFile.Writer)
do above until the whole text file is written.
The problem with this approach is, it writes the "key" with every line I am writing to the sequence file.
So, just want to know,
if Text() can store a file of any size if I keep on appending it?
how can I avoid writing "key" in writer.append() in all writes but the first?
can writer.appendRaw() be used. I did not get sufficient documentation on this function.
To answer to your questions:
Text() can store upto a maximum of 2GB.
You can avoid writing a key by either writing a NullWritable or set key.ignore to false.
But when you go with the second approach first time also you cannot write your key. So better use NullWritable
Related
I have 2 (large) files. The first one is about 200k lines, the second one about 30 millions lines.
I want to check if each line of the first one is in the second one using Perl.
Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays?
You have File A and File B. You want to check if lines in File A appear in File B.
If you have enough memory to hold the contents of File B in a hash using one entry per line, that's the simplest. Go ahead.
However, if you do not, I recommend you put both files in tables in an SQL database. SQLite might be enough to start. Then, your problem is reduced to a simple JOIN. If line length is an issue, use a fast hash such as xxHash. If implemented correctly, the 64-bit version is blazing fast on a 64-bit machine, especially if you enabled optimizations in your Perl. Store two columns, hash and the actual line. If hashes match, check if the lines match. Make sure to index on the hash column.
You say:
In fact, my files are like : File A : name number (per line) File B : name date location number (per line) And I have to check if File B contains the lines matching datas of File A (ignoring date and location for example) So it's not an exact match ...
In that case, you are set. You do not even have to worry about the hash stuff (which I am leaving here for reference). Put the interesting bits of data on which you need to match against in separate columns in an SQLite database. Write a join. ... Profit.
Alternatively, you could use BerkeleyDB which gives you the conceptual simplicity of having an in memory hash while storing the table on disk. If you have multiple attributes on which to match, this will not scale well.
Store the first file's lines in a hash, then iterate through the second file without storing it in memory.
It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash.
use feature 'say';
my ($path_1, $path_2) = #ARGV;
open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);
open my $fh2,"<",$path_2;
while (<$fh2>) {
if (my $f1_line = $f1{$_}) {
say "file 1 line $f1_line appears in file 2 line $.";
}
}
Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first.
Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary.
I have been trying to learn hadoop. In the examples I saw (such as the word counting example) the key parameter of the map function is not used at all. The map function only uses the value part of the pair. So it seems to be that the key parameter is unnecessary, but it should not be. What am I missing here? Can you give me example map functions which use the key parameter?
Thanks
To understand about the use of key, you need to know various input formats available in Hadoop.
TextInputFormat -
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line.
Keys are the position in the file, and values are the line of text..
NLineInputFormat-
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
(Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
KeyValue TextInputFormat -
An InputFormat for plain text files. Files are broken into lines.
Either linefeed or carriage-return are used to signal end of line. E
ach line is divided into key and value parts by a separator byte.
If no such a byte exists, the key will be the entire line and value will be empty.
SequenceFileAsBinaryInputFormat-
InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsTextInputFormat-
This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
In wordcount example : As we want to count the occurrence of each word in the file.
we used the follwing method:
In Mapper -
Key is the offset of the text file.
Value - Line in text file.
For example.
file.txt
Hi I love Hadoop.
I code in Java.
Here
Key - 0 , value - Hi I love Hadoop.
Key - 17 , value - I code in Java.
(key - 17 is offset from start of file.)
Basically the offset for key is default and we do not need it especially in Wordcount.
Now later logic is I guess you will get here and many more available links.
Just in case:
In Reducer
Key is the Word
Value is 1 which is its count.
Instead of giving a single file as input , i wanted to give a directory which can contain any no files in it. I wanted the output to be written in a way
Input :
File 1 File 2 File3
Output :
File 1 File 2 File3
while each file should have its word count in the corresponding file. To identify which file is used in the map i can use context.getInputSplit() . But how can i make it to write output in the way i wanted.
You could use the input splits from your mapper to identify the files they came from, and use that combined with MultipleOutputs to write out to separate files from your reducer.
However, you will need to pass the file it came from to the reducer, so you may need to make a composite key object and write a custom Partitioner and WritableComparator to carry across the both the filename and original key together. See: Hadoop - composite key
Using the Read from Text File Function I am able to easily read the first line of my file. However I now want it to read the second line. It would be great to just a for loop or something if I could specify the line number somewhere. Is there a way to do so? Thanks!
First, you can read the entire file as lines by right-clicking on the Read From Text File node and selecting "Read Lines". One read will return an array containing one element for each line and you can work with the lines with regular array handling methods. If you want to read each line individually, you can by wiring a 1 into the Count input and looping. Each iteration will return an array with one element (the current line read). You can get/set the offset (in bytes) to specify where in the file you want to read, but that's not necessary if I read your question correctly.
I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?
I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is:
Processing paraphragraphs in text files as single records with Hadoop
You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.
Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?