Read values wrapped in Hadoop ArrayWritable - hadoop

I am new to Hadoop and Java. My mapper outputs text and Arraywritable. I having trouble to read ArrayWritable values. Unbale to cast .get() values to integer. Mapper and reducer code are attached. Can someone please help me to correct my reducer code in order to read ArrayWritable values?
public static class Temp2Mapper extends Mapper<LongWritable, Text, Text, ArrayWritable>{
private static final int MISSING=9999;
#Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line = value.toString();
String date = line.substring(07,14);
int maxTemp,minTemp,avgTemp;
IntArrayWritable carrier = new IntArrayWritable();
IntWritable innercarrier[] = new IntWritable[3];
maxTemp=Integer.parseInt(line.substring(39,45));
minTemp=Integer.parseInt(line.substring(47,53));
avgTemp=Integer.parseInt(line.substring(63,69));
if (maxTemp!= MISSING)
innercarrier[0]=new IntWritable(maxTemp); // maximum Temperature
if (minTemp!= MISSING)
innercarrier[1]=new IntWritable(minTemp); //minimum temperature
if (avgTemp!= MISSING)
innercarrier[2]=new IntWritable(avgTemp); // average temperature of 24 hours
carrier.set(innercarrier);
context.write(new Text(date), carrier); // Output Text and ArrayWritable
}
}
public static class Temp2Reducer
extends Reducer<Text, ArrayWritable, Text, IntWritable>{
#Override public void reduce(Text key, Iterable<ArrayWritable> values, Context context )
throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
int[] arr= new int[3];
for (ArrayWritable val : values) {
arr = (Int) val.get(); // Error: cannot cast Writable to int
max = Math.max(max, arr[0]);
}
context.write( key, new IntWritable(max) );
}
}

ArrayWritable#get method returns an array of Writable.
You can't cast an array of Writable to int. What you can do is:
iterate over this array
cast each item (which will be of type Writable) of the array to IntWritable
use IntWritable#get method to get the int value.
for (ArrayWritable val: values) {
for (Writable writable: val.get()) { // iterate
IntWritable intWritable = (IntWritable)writable; // cast
int value = intWritable.get(); // get
// do your thing with int value
}
}

Related

What exactly is output of mapper and reducer function

This is a follow up question of Extracting rows containing specific value using mapReduce and hadoop
Mapper function
public static class MapForWordCount extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable saleValue = new IntWritable();
private Text rangeValue = new Text();
public void map(Object key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(",");
for(String word: words )
{
if(words[3].equals("40")){
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
con.write( rangeValue , saleValue );
}
}
}
}
Reducer function
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
{
for(IntWritable value : values)
{
result.set(value.get());
con.write(word, result);
}
}
}
Output obtained is
40 105
40 105
40 105
40 105
EDIT 1 :
But the Expected output is
40 102
40 104
40 105
What am I doing wrong ?
What exactly is happening here in mapper and reducer function ?
In the context of the original question - you don't need the loop not in the mapper nor in the reducer as you are duplicating entries:
public static class MapForWordCount extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable saleValue = new IntWritable();
private Text rangeValue = new Text();
public void map(Object key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(",");
if(words[3].equals("40")){
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
con.write(rangeValue , saleValue );
}
}
}
And in the reducer, as suggested by #Serhiy in the original question you need only one line of code:
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
{
con.write(word, null);
}
Regrading "Edit 1" - I will leave it a trivial practice :)
What exactly is happening
You are consuming lines of comma-delimited text, splitting the commas, and filtering out some values. con.write() should only be called once per line if all you are doing is extracting only those values.
The mapper will group all the "40" keys that you output and form a list of all the values that were written with that key. And that is what the reducer is reading over.
You should probably try this for your map function.
// Set the values to write
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
// Filter out only the 40s
if(words[3].equals("40")) {
// Write out "(40, safeValue)" words.length times
for(String word: words )
{
con.write( rangeValue , saleValue );
}
}
If you don't want duplicate values for the length of the split string, then get rid of the for loop.
All your reducer is doing is just printing out what it received from the mapper.
Mapper output would be something like this :
<word,count>
Reducer output would be like this :
<unique word, its total count>
Eg: A line is read and all words in it are counted and put in a <key,value> pair:
<40,1>
<140,1>
<50,1>
<40,1> ..
here 40,50,140, .. are all keys and the value is the count of number of occurrences of that key in a line. This happens in the mapper.
Then, these key,valuepairs are sent to the reducer where similar keys are all reduced to a single key and all the values associates with that key is summed to give a value to the key-value pair. So, the result of the reducer would be something like:
<40,10>
<50,5>
...
In your case, the reducer isn't doing anything. The unique values/words found by the mapper are just given out as the output.
Ideally, you are supposed to reduce & get an output like : "40,150" was found 5 times on the same line.

Part of key changes when iterating through values when using composite key - Hadoop

I have implemented Secondary sort on Hadoop and I don't really understand the behavior of the framework.
I have created a composite key which contains original key and part of value, that is used for sorting.
To achieve this I have implemented my own partitioner
public class CustomPartitioner extends Partitioner<CoupleAsKey, LongWritable>{
#Override
public int getPartition(CoupleAsKey couple, LongWritable value, int numPartitions) {
return Long.hashCode(couple.getKey1()) % numPartitions;
}
My own group comparator
public class GroupComparator extends WritableComparator {
protected GroupComparator()
{
super(CoupleAsKey.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
CoupleAsKey c1 = (CoupleAsKey)w1;
CoupleAsKey c2 = (CoupleAsKey)w2;
return Long.compare(c1.getKey1(), c2.getKey1());
}
}
And defined the couple in the following way
public class CoupleAsKey implements WritableComparable<CoupleAsKey>{
private long key1;
private long key2;
public CoupleAsKey() {
}
public CoupleAsKey(long key1, long key2) {
this.key1 = key1;
this.key2 = key2;
}
public long getKey1() {
return key1;
}
public void setKey1(long key1) {
this.key1 = key1;
}
public long getKey2() {
return key2;
}
public void setKey2(long key2) {
this.key2 = key2;
}
#Override
public void write(DataOutput output) throws IOException {
output.writeLong(key1);
output.writeLong(key2);
}
#Override
public void readFields(DataInput input) throws IOException {
key1 = input.readLong();
key2 = input.readLong();
}
#Override
public int compareTo(CoupleAsKey o2) {
int cmp = Long.compare(key1, o2.getKey1());
if(cmp != 0)
return cmp;
return Long.compare(key2, o2.getKey2());
}
#Override
public String toString() {
return key1 + "," + key2 + ",";
}
}
And here is the driver
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(SSDriver.class);
job.setMapperClass(SSMapper.class);
job.setReducerClass(SSReducer.class);
job.setMapOutputKeyClass(CoupleAsKey.class);
job.setMapOutputValueClass(LongWritable.class);
job.setPartitionerClass(CustomPartitioner.class);
job.setGroupingComparatorClass(GroupComparator.class);
FileInputFormat.addInputPath(job, new Path("/home/marko/WORK/Whirlpool/input.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/marko/WORK/Whirlpool/output"));
job.waitForCompletion(true);
Now, this works, but what is really strange is that while iterating in reducer for a key, second part of the key (the value part) changes in each iteration. Why and how?
#Override
protected void reduce(CoupleAsKey key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
for (LongWritable value : values) {
//key.key2 changes during iterations, why?
context.write(key, value);
}
}
Definition says that "if you want all your relevant rows within a partition of data sent to a single reducer you must implement a grouping comparator". This only ensures that those set of keys will be sent to a single reduce call, and not that the key will change from composite (or whatever) to something that only contains that part of key on which grouping was done.
However, when you iterate over values, the corresponding keys will also change. We normally do not observe this happening, as by default the values are grouped on the same (non-composite) key, and thus, even when the value changes, the (value of-) key remains the same.
You can try printing the object reference of the key, and you shall notice that with every iteration, the object reference of the key is also changing (like this:)
IntWritable#1235ft
IntWritable#6635gh
IntWritable#9804as
Alternatively, you can also try applying a group-comparator on an IntWritable in a following way (you will have to write your own logic to do so):
Group1:
1 a
1 b
2 c
Group2:
3 c
3 d
4 a
and you shall see that with every iteration of value, your key is also changing.

hadoop input data problems

I'm having trouble with the map functions:
The original data is stored in the tsv file:
I just want the last two columns saved:
the first is the original node(383), second is the target(4575), third is the weight(1)
383 4575 1
383 4764 1
383 5458 1
383 5491 1
public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
String[] tokens = line.split("t");
int weight = Integer.parseInt(tokens[2]);
int target = Integer.parseInt(tokens[0]);
}
Here is my code:
public void map(LongWritable key, Text value, Context context) throws IOException InterruptedException
{
String line = value.toString();
//split the tsv file
String[] tokens = line.split("/t");
//save the weight and target
private Text target = Integer.parsetxt(tokens[0]);
int weight = Integer.parseInt(tokens[2]);
context.write(new Text(target), new Intwritable(weight) );
}
}
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
#Override
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException
{
//initialize the count variable
int weightsum = 0;
for (IntWritable value : values) {
weightsum += value.get();
}
context.write(key, new IntWritable(weightsum));
}
}
String[] tokens = line.split("t");
should be
String[] tokens = line.split("\t");
split with spaces.
String[] tokens = line.split("\\s+");
private Text target = Integer.parsetxt(tokens[1]);
int weight = Integer.parseInt(tokens[2]);

Can I get a Partition number of Hadoop?

I am a hadoop newbie.
I want to get a partition number on output file.
At first, I made a customized partitioner.
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
}
}
It works. But, I want to output partition numbers 'visually' on Reducer.
How can I get a partiton number ??
Below is my reducer source.
public static class MyReducer extends Reducer<Text, LongWritable, Text, Text>{
private Text textList = new Text();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
String list = new String();
for(LongWritable value: values) {
list = new String(list + "\t" + value.toString());
}
textList.set(list);
context.write(key, textList);
}
}
I want to put a partition number on 'list' respectively. There will be '0' or '1'.
list = new String(list + "\t" + value.toString() + "\t" + ??);
It would be great if someone helps me.
+
Thanks to the answer, I got a solution. But, It didn't work and I think I did something wrong.
Below is the modified MyPartitioner.
public static class MyPartitioner extends Partitioner {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
private int bring_num = 0;
public void configure(JobConf job) {
bring_num = jobConf.getInt(numOfChars & numReduceTasks);
}
}
}
Add the below code to the Reducer class to get the partition number in a class variable which can be later used in the reducer method.
String partition;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
partition = conf.get("mapred.task.partition");
}

Shouldn't Hadoop group <key, (list of values) in reducer only based on hashCode?

I decided to create my own WritableComparable class to learn how Hadoop works with it. So I create an Order class with two instance variables (orderNumber cliente) and implemented the required methods. I also used Eclipse generators for getters/setters/hashCode/equals/toString.
In compareTo, I've decided to use only the orderNumber variable.
I created a simple MapReduce job only to count the occurrences of an order in a dataset. By mistake one of my test records is Ita instead of Itá, as you can see here:
123 Ita
123 Itá
123 Itá
345 Carol
345 Carol
345 Carol
345 Carol
456 Iza Smith
As I understand the first record should be treated as a different order, because record 1 hashCode is de different from record 2 and 3 hashCodes.
But in reduce phase the 3 records are grouped together. As you can see here:
Order [cliente=Ita, orderNumber=123] 3
Order [cliente=Carol, orderNumber=345] 4
Order [cliente=Iza Smith, orderNumber=456] 1
I thought it should have a line for Itá records with count 2 and Ita should have count 1.
Well as I used only orderNumber in compareTo, I tried to use the String cliente in this method (commented on code below). And then, it worked as I was expecting.
So, is that an expected result? Shouldn't hadoop use only hashCode to group key and its values?
Here is the Order class (I ommited the getters/setters):
public class Order implements WritableComparable<Order>
{
private String cliente;
private long orderNumber;
#Override
public void readFields(DataInput in) throws IOException
{
cliente = in.readUTF();
orderNumber = in.readLong();
}
#Override
public void write(DataOutput out) throws IOException
{
out.writeUTF(cliente);
out.writeLong(orderNumber);
}
#Override
public int compareTo(Order o) {
long thisValue = this.orderNumber;
long thatValue = o.orderNumber;
return (thisValue < thatValue ? -1 :(thisValue == thatValue ? 0 :1));
//return this.cliente.compareTo(o.cliente);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((cliente == null) ? 0 : cliente.hashCode());
result = prime * result + (int) (orderNumber ^ (orderNumber >>> 32));
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Order other = (Order) obj;
if (cliente == null) {
if (other.cliente != null)
return false;
} else if (!cliente.equals(other.cliente))
return false;
if (orderNumber != other.orderNumber)
return false;
return true;
}
#Override
public String toString() {
return "Order [cliente=" + cliente + ", orderNumber=" + orderNumber + "]";
}
Here is the MapReduce code:
public class TesteCustomClass extends Configured implements Tool
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Order, LongWritable>
{
LongWritable outputValue = new LongWritable();
String[] campos;
Order order = new Order();
#Override
public void configure(JobConf job)
{
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Order, LongWritable> output, Reporter reporter) throws IOException
{
campos = value.toString().split("\t");
order.setOrderNumber(Long.parseLong(campos[0]));
order.setCliente(campos[1]);
outputValue.set(1L);
output.collect(order, outputValue);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Order, LongWritable, Order,LongWritable>
{
#Override
public void reduce(Order key, Iterator<LongWritable> values,OutputCollector<Order,LongWritable> output, Reporter reporter) throws IOException
{
LongWritable value = new LongWritable(0);
while (values.hasNext())
{
value.set(value.get() + values.next().get());
}
output.collect(key, value);
}
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(),TesteCustomClass.class);
conf.setMapperClass(Map.class);
// conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setJobName("Teste - Custom Classes");
conf.setOutputKeyClass(Order.class);
conf.setOutputValueClass(LongWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),new TesteCustomClass(),args);
System.exit(res);
}
}
The default partitioner is the HashPartitioner, which uses the hashCode method to determine which reducer to send the K,V pair to.
Once in the reducer (or if you're using a Combiner which is run map side), the compareTo method is used to sort the keys and then also used (by default) to compare whether sequential keys should be grouped together and their associated values reduced in the same iteration.
If you don't use the cliente Key variable and only your orderNumber variable in your compareTo method, then any key with the same orderNumber will have its values reduced together - regardless of the cliente value (which is what you're currently observing)

Resources