I'm trying to run a simple map reduce operation a TSV dataset and I'm a bit confused about what goes wrong when I'm trying a simple map operation. Following is my modification of the sample Word Count problem of the map class.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text node = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String tokens[] = line.split('t');
node.set(tokens[1]);
int weight = Integer.parseInt(tokens[2]);
output.collect(node, new Writable(weight));
}
}
The input can be visualized as a TSV file having three columns. I get an error method.java.lang.String.split being not applicable for the above code in the line where the line is split into tokens. Any ideas where I may be going wrong?
String tokens[] = line.split('t');
Change it to
String tokens[] = line.split('\t');
String tokens[] = line.split('t');
It should be:
String tokens[] = line.split("\t");
Using single-quotes is for char type and would raise an exception.
Related
I am new to hadoop, so i am having difficulty in understanding the programs a little. So, If someone can help me in understanding this mapper program ?
package SearchTxn;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String Txn = value.toString();
String TxnParts[] = Txn.split(",");
Double Amt = Double.parseDouble(TxnParts[3]);
String Uid = TxnParts[2];
if(Uid.equals("4000010") && Amt>100)
{
context.write(null, value);
}
}
}
The code basically filters lines in which Uid (second column in your csv) is "4000010" and Amt (I guess for amount, third column in your csv) is greater than 100.
Along with answer from #Thomas Jungblut, below line of your program says about Mapper class overall input and output. Here nothing is retuned as a key but Text as a value.
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
So are the parameters in write method.
context.write(null, value);
Its not always necessary to write key for serialization from Mapper class. As per your use case , either key or value or both can be written to context.write method.
Please help me how to create UserInputFormat Class in mapreduce to produce key value pair based upon my need.I need to store the first character of the string as a key and the entire string as valuw. how to achieve it
public static class UserInputFormat extends Mapper<Object, Text, Text, Text>{ //define datatype of key:value = Text:Text
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String raw_String = value.toString();
if (raw_String.length() > 0)
{
Text key_str = new Text(raw_String.substring(0, 1)); //get the first char of raw_String as key
context.write(key_str, value); //key is the first character and value is the entire string
}
}
}
I think the above is what you need. It is a map task, will receive a string as input, and will output a pair of key:value -> [1stchar]:[entire string].
If not please make your question clearer.
In one of my class im using HashMap.Im calling that class inside my mapper. So now each mapper has its own HashMap. Now can i use all the HashMaps into a single reducer? Actually my HashMap contains Key as my filename and value is the Set.So each HashMap contains a filename and a Set. Now i want to use all the HashMap caontaining same filename and want to club all the values(Sets) and then write that HashMap into my Hdfs file
Yes you can do that. If your mapper is giving an output in the form of hashmap then you can use Hadoop's MapWritable as your value of mapper.
For e.g.
public class MyMapper extends Mapper<LongWritable, Text, Text, MapWritable>
you have to convert your Hashmap into MapWritable format:
MapWritable mapWritable = new MapWritable();
for (Map.Entry<String,String> entry : yourHashMap.entrySet()) {
if(null != entry.getKey() && null != entry.getValue()){
mapWritable.put(new Text(entry.getKey()),new Text(entry.getValue()));
}
}
Then provide the mapwritable to your context:
ctx.write(new Text("my_key",mapWritable);
For Reducer class you have take MapWritable as your input value
public class MyReducer extends Reducer<Text, MapWritable, Text, Text>
public void reduce(Text key, Iterable<MapWritable> values, Context ctx) throws IOException, InterruptedException
Then iterate through the map and extract the values the way you want. For e.g:
for (MapWritable entry : values) {
for (Entry<Writable, Writable> extractData: entry.entrySet()) {
//your logic for the data will go here.
}
}
This is a question regarding the performance of writable variables and allocation within a map reduce step. Here is a reducer:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
context.write(key, new Text(val));
}
}
}
Or is this better performance-wise:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
private Text myText = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
myText.set(val);
context.write(key, myText);
}
}
}
In the Hadoop Definitive Guide all the examples are in the first form but I'm not sure if that is for shorter code samples or because it's more idiomatic.
The book may use the first form because it is more concise. However, it is less efficient. For large input files, that approach will create a large number of objects. This excessive object creation would slow down your performance. Performance-wise, the second approach is preferable.
Some references that discuss this issue:
Tip 7 here,
On Hadoop object re-use, and
This JIRA.
Yeah, second approach is preferable if reducer has large data to process. The first approach, will keep creating references and cleaning it up depends on the garbage collector.
I need to implement a functionality using map reduce.
Requirement is mentioned below.
Input for the mapper is a file containing two columns productId , Salescount
Reducers output , sum of salescount
Requirement is I need to calculate salescount / sum(salescount).
For this I am planing to use nested map reduce.
But for the second mapper I need to use first reducers output and first map's input.
How Can I implement this. Or is there any alternate way ?
Regards
Vinu
You can use ChainMapper and ChainReducer to PIPE Mappers and Reducers the way you want. Please have a look at here
The following will be similar to the code snippet you would need to implement
JobConf mapBConf = new JobConf(false);
JobConf reduceConf = new JobConf(false);
ChainMapper.addMapper(conf, FirstMapper.class, FirstMapperInputKey.class, FirstMapperInputValue.class,
FirstMapperOutputKey.class, FirstMapperOutputValue.class, false, mapBConf);
ChainReducer.setReducer(conf, FirstReducer.class, FirstMapperOutputKey.class, FirstMapperOutputValue.class,
FirstReducerOutputKey.class, FirstReducerOutputValue.class, true, reduceConf);
ChainReducer.addMapper(conf, SecondMapper.class, FirstReducerOutputKey.class, FirstReducerOutputValue.class,
SecondMapperOutputKey.class, SecondMapperOutputValue.class, false, null);
ChainReducer.setReducer(conf, SecondReducer.class, SecondMapperOutputKey.class, SecondMapperOutputValue.class, SecondReducerOutputKey.class, SecondReducerOutputValue.class, true, reduceConf);
or if you don't want to use multiple Mappers and Reducers you can do the following
public static class ProductIndexerMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {
private static Text productId = new Text();
private static LongWritable salesCount = new LongWritable();
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
String[] values = value.toString().split("\t");
productId.set(values[0]);
salesCount.set(Long.parseLong(values[1]));
output.collect(productId, salesCount);
}
}
public static class ProductIndexerReducer extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable> {
private static LongWritable productWritable = new LongWritable();
#Override
public void reduce(Text key, Iterator<LongWritable> values,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
List<LongWritable> items = new ArrayList<LongWritable>();
long total = 0;
LongWritable item = null;
while(values.hasNext()) {
item = values.next();
total += item.get();
items.add(item);
}
Iterator<LongWritable> newValues = items.iterator();
while(newValues.hasNext()) {
productWritable.set(newValues.next().get()/total);
output.collect(key, productWritable);
}
}
}
`
With the usecase in hand, I believe we don't need two different mappers/mapreduce jobs to achieve this. (As an extension to the answer given in above comments)
Lets assume you have a very large input file split into multiple blocks in HDFS. When you trigger a MapReduce job with this file as input, multiple mappers(equal to the number of input blocks) will start execution in parallel.
In your mapper implementation, read each line from input and write the productId as key and the saleCount as value to context. This data is passed to the Reducer.
We know that, in a MR job all the data with the same key is passed to the same reducer. Now, in your reducer implementation you can calculate the sum of all saleCounts for a particular productId.
Note: I'm not sure about the value 'salescount' in your numerator.
Assuming that its the count of number of occurrences of a particular product, please use a counter to add and get the total sales count in the same for loop where you are calculating the SUM(saleCount). So, we have
totalCount -> Count of number of occurrences of a product
sumSaleCount -> Sum of saleCount value for each product.
Now, you can directly divide the above values: totalCount/sumSaleCount.
Hope this helps! Please let me know if you have a different use case in mind.