in Hadoop, How can you give whole file as input to mapper? - hadoop

An interviewer recently asked me this question:
I said by configuring block size or split size equal to file size.
He said it is wrong.

Well if you told it like that I think that he didn't like the "configuring block size" part.
EDIT : Somehow I think changing block size is a bad idea because it is global to HDFS.
On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.
A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:
public class NoSplitFileInputFormat extends FileInputFormat
{
#Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}

The interviewer wanted to hear that you can make isSplitable to return false by gzip-compressing the input file.
In this case, MapReduce will do the right thing and not try to split the gzipped file,
since it knows that the input is gzip-compressed (by looking at the filename extension)
and that gzip does not support splitting.
This will work, but at the expense of locality: a single map will process all HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

Related

how to design 1 mapper for 1 text file in Mapreduce

I am running Mapreduce on hadoop 2.9.0.
My problem:
I have a number of text files (about 10- 100 text files). Each file is very small in terms of size, but due to my logical problem, I need 1 mapper to handle 1 text file. The result of these mappers will be aggregated by my reducers.
I need to design so that the number of mappers always equals number of files. How to do that in Java code? What kind of function that I need to extend?
Thanks a lot.
I've had to do something very similar, and faced similar problems to you.
The way I achieved this was to feed in a text file containing the path's to each file, for example the text file would contain this kind of information:
/path/to/filea
/path/to/fileb
/a/different/path/to/filec
/a/different/path/to/another/called/filed
I'm not sure what exactly you want your mapper's to do, but when creating your job, you want to do the following:
public static void main( String args[] ) {
Job job = Job.getInstance(new Configuration(), 'My Map reduce application');
job.setJarByClass(Main.class);
job.setMapperClass(CustomMapper.class);
job.setInputFormatClass(NLineInputFormat.class);
...
}
Your CustomMapper.class will want to extend Mapper like so:
public class CustomMapper extends Mapper<LongWritable, Text, <Reducer Key>, <Reducer Value> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration configuration = context.getConfiguration();
ObjectTool tool = new ObjectTool(configuration, new Path(value.toString()));
context.write(<reducer key>, <reducer value>);
}
}
Where ObjectTool is another class which deals with what you want to actually do with your files.
So let me explain broadly what this is doing, the magic here is job.setInputFormatClass(NLineInputFormat.class), but what is it doing exactly?
It's essentially taking your input and splitting the data by each line, and sends each line to a mapper. By having a text file containing each file by a new line, you then create a 1:1 relationship between mappers and files. A great addition to this setup is it allows you to create advanced tooling for the files you want to deal with.
I used this to create a compression tool in HDFS, when I was researching on approaches to this, a lot of people were essentially reading the file to stdout and compressing it that way, however, when it came to doing a checksum on the original file and the file being compressed and decompressed, the results were different. This was due to the type of data in these files, and there was no easy way to implement bytes writeable. (Information on the cat'ing of files to std out can be seen here).
That link also quotes the following:
org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask
Hope this helps!

How to just output value in context.write(k,v)

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().
You should use NullWritable for this purpose.

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Hadoop searching words from one file in another file

I want to build a hadoop application which can read words from one file and search in another file.
If the word exists - it has to write to one output file
If the word doesn't exist - it has to write to another output file
I tried a few examples in hadoop. I have two questions
Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?
How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?
Thank you.
How I would do it:
split value in 'map' by words, emit (<word>, <source>) (*1)
you'll get in 'reduce': (<word>, <list of sources>)
check source-list (might be long for both/all sources)
if NOT all sources are in the list, emit every time (<missingsource>, <word>)
job2: job.setNumReduceTasks(<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)
You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.
(*1) Howto find out the source in map (0.20):
private String localname;
private Text outkey = new Text();
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
super.setup(context);
localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
...
outkey.set(...);
outvalue.set(localname);
context.write(outkey, outvalue);
}
Are you using Hadoop/MapReduce for a specific reason to solve this problem? This sounds like something more suited to a Lucene based application than Hadoop.
If you have to use Hadoop I have a few suggestions:
Your 'documents' will need to be in a format that MapReduce can deal with. The easiest format to use would be a CSV based file with each word in the document on a line. Having PDF etc will not work.
To take a set of words as input to you MapReduce job to compare against the data that the MapReduce processes you could use the Distributed Cache to enable each mapper to build a set of words you want to find in the input. However if your list of words to find it large (you mention 200MB) I doubt this would work. This method is one of the main ways you can do a join in MapReduce however.
The indexing method mentioned in another answer here does also offer possibilities. Again though, the terms indexing a document just make me think Lucene and not hadoop. If you did use this method you would need to make sure the key value contains a document identifier as well as the word, so that you have the word counts contained within each document.
I don't think i've ever produced multiple output files from a MapReduce job. You would need to write some (and it would be very simple) code to process the indexed output into multiple files.
You'll want to do this in two stages, in my opinion. Run the wordcount program (included in the hadoop examples jar) against the two initial documents, this will give you two files, each containing a unique list (with count) of the words in each document. From there, rather than using hadoop do a simple diff on the two files which should answer your question,

Resources