How to skip reading the file header in hadoop mapreduce - hadoop

I am learning hadoop mapreduce using java,I have a sample file with data as below, how do I skip processing the header line in this file..because when I see the mapper input, it is considering the header also..
roll no|school name|name|age|Gender|class|subject|marks
1|xyz|pqr|abc|10|M|1|science|98

Because you already know what header looks like, you can just skip the header. This approach makes the application more slower.
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
String[] row = value.toString();
if( row.equals( "roll no|school name|name|age|Gender|class|subject|marks") )
return;
//NOW YOU ARE HEADER FREE
//do some operations depending on your needs..
}

If you are running with single mapper, you can use the counter in an if condition. If you are running more than one mapper, check for header string in an if condition.

Related

cluster.getJob is returning null in hadoop

public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
org.apache.hadoop.mapreduce.Cluster cluster = new org.apache.hadoop.mapreduce.Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
mapperCounter = currentJob.getCounters().findCounter(TEST).getValue();
}
I wrote the following code to get the value of a counter that I am incrementing in my mapper function. The problem is that the currentJob returned by cluster.getJob is turning out to be null. Does anyone know how I can fix this?
My question is different cause I am trying to access my counter in the reducer not after all the map reduce tasks are done. This code that I have pasted here belongs in my reducer class.
It seems that cluster.getJob(context.getJobID()); does not work in hadoop's Standalone Operation.
Try running your Program with YARN in hadoop's Single Node Cluster mode like described in the documentation: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

What determines how many times map() will get called?

I have a text file and a parser that will parse each line(s) and store into my customSplitInput, I do the parsing in my custom FileInputFormat phase so my splits are custom. Right now, I have 2 splits and within each split contains a list of my data.
But right now, my mapper function is getting called repeatedly on the same split. I thought the mapper function will only get called based on the number of splits you have?
I don't know if this applies but my custom InputSplit returns a fixed number for getLength() and an empty string array for getLocation(). I am unsure of what to put in for these.
#Override
public RecordReader<LongWritable, ArrayWritable> createRecordReader(
InputSplit input, TaskAttemptContext taskContext)
throws IOException, InterruptedException {
logger.info(">>> Creating Record Reader");
CustomRecordReader recordReader = new CustomRecordReader(
(EntryInputSplit) input);
return recordReader;
}
map() is called once for every record from the RecordReader in (or referenced by) your InputFormat. For example, TextInputFormat calls map() for every line in the input, even though there are usually many lines in a split.

output HBase Increment in MR reducer

I have a mapreduce job that writes to HBase. I know you can output Put and Delete from the reducer using the TableMapReduceUtil.
Is it possible emit Increment to increment values in an HBase table instead out emitting Puts and Gets? If yes, how to do it and if not then why?
I'm using CDH3
public static class TheReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
///....DO SOME STUFF HERE
Increment increment = new Increment(row);
increment.addColumn(col,qual,1L);
context.write(null, increment); //<--- I want to be able to do this
}
}
Thanks
As far as I know you can't use Increment in the context - but you can always open a connection to HBase and write Increments anywhere (mapper, mapper cleanup, reducer etc.)
Do note that increments are not idempotent so the result might be problematic on partial success of the map/reduce job and/or if you have speculative execution for M/R (i.e. multiple mappers doing the same work)

Querying Data from DBMS in Hadoop Mapper Before Mapping

I'm kind of new to MapReduce in Hadoop. I'm trying to process entries from many log files. The mapper process is quite similar with the one in WordCount tutorial.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
The thing is instead of putting the word as the key for the reducer, I want to put a related data from a table in RDBMS. For example, the processed text are like this
apple orange duck apple giraffe horse lion, lion grape
And there is a table
name type
apple fruit
duck animal
giraffe animal
grape fruit
orange fruit
lion animal
So, instead of counting the word, I want to count the type. The output would be like
fruit 4
animal 5
Let's say in the previous code, it will be like this
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String object = tokenizer.nextToken();
//========================================
String type = SomeClass.translate(object);
//========================================
word.set(type);
output.collect(word, one);
}
}
The SomeClass.translate will translate the object name to the type by querying from a RDBMS.
My questions
Is this doable? (and how to do that?)
What are the concerns? I came into understanding that the mapper will be run in more than one machines. So let's say there are apple words in more than one machines, how to reduce the number of database look-up for apple?
Or is there a very good alternative without doing translation in the mapper? Or maybe there is a common way to do this? (or is this whole question a really stupid question?)
UPDATE
I'm implementing it using Apache Hadoop on Amazon Elastic MapReduce and the translation table is stored in Amazon RDS/MySQL. I would really appreciate if you could provide some sample codes or links.
If you're worried about minimizing DB queries, you could do this in two MR jobs: first do a standard word count, then use the output of that job to do the translation to type, and re-summing.
Alternatively, if your mapping table is small enough to fit in memory, you could start by serializing it, adding it to the DistributedCache, and then loading it into memory as part of the Mapper's setup method. Then there's no need to worry about doing the translation too many times, as it's just a cheap memory lookup.
To summarize the requirement, a join is done between the data in table and a file and count is done on the joined data. Based on the input size of the data there are different ways (M or MR only) join can be done. For more details on joining go through Data-Intensive Text Processing with MapReduce - Section 3.5.

Map Reduce - Not able to get the rigth key

Hi I am writing map reduce code finding the maximum temperature. The problem is that I am getting the maximum temperature but without the corresponding key.
public static class TemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
Text year=new Text();
int maxTemperature=Integer.MIN_VALUE;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for(IntWritable valTemp:values) {
maxTemperature=Math.max(maxTemperature, valTemp.get());
}
//System.out.println("The maximunm temperature is " + maxTemperature);
context.write(year, new IntWritable(maxTemperature));
}
}
mapper imagin like
1955 52
1958 7
1985 22
1999 32
and so on.
It is overwriting the keys and printing all the data. I want only maximum temperature and its year.
I see a couple of things wrong with your code sample:
Reset the maxTemperature inside the reduce method (as the first statement), at the moment you have a bug in that it will output the maximum temperature seen for all preceding key/values
Where are you configuring the contents of year? in fact you don't need to, just call context.write(key, new IntWritable(maxTemperature); as the input key is the year
You might want to create a IntWritable instance variable and re-use it rather than creating a new IntWritable when writing out the output value (this is an efficiency thing rather than a potential cause of your problem)

Resources