Querying Data from DBMS in Hadoop Mapper Before Mapping - hadoop

I'm kind of new to MapReduce in Hadoop. I'm trying to process entries from many log files. The mapper process is quite similar with the one in WordCount tutorial.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
The thing is instead of putting the word as the key for the reducer, I want to put a related data from a table in RDBMS. For example, the processed text are like this
apple orange duck apple giraffe horse lion, lion grape
And there is a table
name type
apple fruit
duck animal
giraffe animal
grape fruit
orange fruit
lion animal
So, instead of counting the word, I want to count the type. The output would be like
fruit 4
animal 5
Let's say in the previous code, it will be like this
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String object = tokenizer.nextToken();
//========================================
String type = SomeClass.translate(object);
//========================================
word.set(type);
output.collect(word, one);
}
}
The SomeClass.translate will translate the object name to the type by querying from a RDBMS.
My questions
Is this doable? (and how to do that?)
What are the concerns? I came into understanding that the mapper will be run in more than one machines. So let's say there are apple words in more than one machines, how to reduce the number of database look-up for apple?
Or is there a very good alternative without doing translation in the mapper? Or maybe there is a common way to do this? (or is this whole question a really stupid question?)
UPDATE
I'm implementing it using Apache Hadoop on Amazon Elastic MapReduce and the translation table is stored in Amazon RDS/MySQL. I would really appreciate if you could provide some sample codes or links.

If you're worried about minimizing DB queries, you could do this in two MR jobs: first do a standard word count, then use the output of that job to do the translation to type, and re-summing.
Alternatively, if your mapping table is small enough to fit in memory, you could start by serializing it, adding it to the DistributedCache, and then loading it into memory as part of the Mapper's setup method. Then there's no need to worry about doing the translation too many times, as it's just a cheap memory lookup.

To summarize the requirement, a join is done between the data in table and a file and count is done on the joined data. Based on the input size of the data there are different ways (M or MR only) join can be done. For more details on joining go through Data-Intensive Text Processing with MapReduce - Section 3.5.

Related

How to skip reading the file header in hadoop mapreduce

I am learning hadoop mapreduce using java,I have a sample file with data as below, how do I skip processing the header line in this file..because when I see the mapper input, it is considering the header also..
roll no|school name|name|age|Gender|class|subject|marks
1|xyz|pqr|abc|10|M|1|science|98
Because you already know what header looks like, you can just skip the header. This approach makes the application more slower.
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
String[] row = value.toString();
if( row.equals( "roll no|school name|name|age|Gender|class|subject|marks") )
return;
//NOW YOU ARE HEADER FREE
//do some operations depending on your needs..
}
If you are running with single mapper, you can use the counter in an if condition. If you are running more than one mapper, check for header string in an if condition.

What determines how many times map() will get called?

I have a text file and a parser that will parse each line(s) and store into my customSplitInput, I do the parsing in my custom FileInputFormat phase so my splits are custom. Right now, I have 2 splits and within each split contains a list of my data.
But right now, my mapper function is getting called repeatedly on the same split. I thought the mapper function will only get called based on the number of splits you have?
I don't know if this applies but my custom InputSplit returns a fixed number for getLength() and an empty string array for getLocation(). I am unsure of what to put in for these.
#Override
public RecordReader<LongWritable, ArrayWritable> createRecordReader(
InputSplit input, TaskAttemptContext taskContext)
throws IOException, InterruptedException {
logger.info(">>> Creating Record Reader");
CustomRecordReader recordReader = new CustomRecordReader(
(EntryInputSplit) input);
return recordReader;
}
map() is called once for every record from the RecordReader in (or referenced by) your InputFormat. For example, TextInputFormat calls map() for every line in the input, even though there are usually many lines in a split.

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Map Reduce - Not able to get the rigth key

Hi I am writing map reduce code finding the maximum temperature. The problem is that I am getting the maximum temperature but without the corresponding key.
public static class TemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
Text year=new Text();
int maxTemperature=Integer.MIN_VALUE;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for(IntWritable valTemp:values) {
maxTemperature=Math.max(maxTemperature, valTemp.get());
}
//System.out.println("The maximunm temperature is " + maxTemperature);
context.write(year, new IntWritable(maxTemperature));
}
}
mapper imagin like
1955 52
1958 7
1985 22
1999 32
and so on.
It is overwriting the keys and printing all the data. I want only maximum temperature and its year.
I see a couple of things wrong with your code sample:
Reset the maxTemperature inside the reduce method (as the first statement), at the moment you have a bug in that it will output the maximum temperature seen for all preceding key/values
Where are you configuring the contents of year? in fact you don't need to, just call context.write(key, new IntWritable(maxTemperature); as the input key is the year
You might want to create a IntWritable instance variable and re-use it rather than creating a new IntWritable when writing out the output value (this is an efficiency thing rather than a potential cause of your problem)

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Resources