mapreduce - coding for multiple key and values - hadoop

In need to emit two keys and two values from my mapper. could you please provide me info , how to write code and data type for that. for example :-
key = { store_id : this.store_id,
product_id : this.product_id };
value = { quantity : this.quantity,
price : this.price,
count : this.count };
emit(key, value);
regards

As per the given example, A B B C A R A D S D A C A R S D F A B
From the mapper emit
key - A
value A, AB
key - B
value B,BB
key - B
value B, BC
key - C
value C, CA
and so on...
In the reducer, you get the grouped values
key - A
values A, AB, A, AR, A, AD, A, AC and so on
key - B
value - B, BB,B,BC and so on
Add a delimiter of your choice between the 2 words/alphabets
for each key in reducer, you can use a hashmap/mapwritable to track the occurrence count of each value
ie for example
A - 5 times
AB - 7 times
etc etc
Then you can calculate the ratio
Sample Mapper Implementation
public class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] valueSplits = value.toString().split(" ");
for(int i=0;i<valueSplits.length;i++){
if(i!=valueSplits.length-1){
context.write(new Text(valueSplits[i]),new Text(valueSplits[i]+"~"+valueSplits[i+1]));
}
context.write(new Text(valueSplits[i]), new Text(valueSplits[i]));
}
}
}
Sample Reducer Implementation
public class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String,Integer> countMap= new HashMap<String,Integer>();
for(Text t : values){
String value = t.toString();
int count =0;
if(countMap.containsKey(value)){
count = countMap.get(value);
count+=1;
}else{
count =1;
}
countMap.put(value, count);
}
for(String s : countMap.keySet()){
if(s.equalsIgnoreCase(key.toString())){
}else{
int keyCount = countMap.get(s.split("~")[0]);
int occurrence = countMap.get(s);
context.write(new Text(key.toString()+" , "+s), new Text(String.valueOf((float)occurrence/(float)keyCount)));
}
}
}
}
For an input of
A A A B
the reducer would emit
A , A~A 0.6666667
A , A~B 0.33333334
AA appears 2 times, AB 1 time and A 3 times.
AA is hence 2/3
AB is hence 1/3

Related

MapReduce sort by value in descending order

I'm trying to write in pseudo code a MapReduce task that returns the items sorted in descending order. For example: for the wordcount task, instead of getting:
apple 1
banana 3
mango 2
I want the output to be:
banana 3
mango 2
apple 1
Any ideas of how to do it? I know how to do it in ascending order (replace the keys and value in the mapper job) but not in descending order.
Here you can take help of below reducer code to achieve sorting in descending order .
Assuming you have written mapper and driver code where mapper will produce output as (Banana,1) etc
In reducer we will sum all values for a particular key and put final result in a map then sort the map on the basis of values and write final result in cleanup function of reduce.
Please see below code for further understadnind:
public class Word_Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
// Change access modifier as per your need
public Map<String , Integer > map = new LinkedHashMap<String , Integer>();
public void reduce(Text key , Iterable<IntWritable> values ,Context context)
{
// write logic for your reducer
// Enter reduced values in map for each key
for (IntWritable value : values ){
// calculate "count" associated with each word
}
map.put(key.toString() , count);
}
public void cleanup(Context context){
//Cleanup is called once at the end to finish off anything for reducer
//Here we will write our final output
Map<String , Integer> sortedMap = new HashMap<String , Integer>();
sortedMap = sortMap(map);
for (Map.Entry<String,Integer> entry = sortedMap.entrySet()){
context.write(new Text(entry.getKey()),new IntWritable(entry.getValue()));
}
}
public Map<String , Integer > sortMap (Map<String,Integer> unsortMap){
Map<String ,Integer> hashmap = new LinkedHashMap<String,Integer>();
int count=0;
List<Map.Entry<String,Integer>> list = new
LinkedList<Map.Entry<String,Integer>>(unsortMap.entrySet());
//Sorting the list we created from unsorted Map
Collections.sort(list , new Comparator<Map.Entry<String,Integer>>(){
public int compare (Map.Entry<String , Integer> o1 , Map.Entry<String , Integer> o2 ){
//sorting in descending order
return o2.getValue().compareTo(o1.getValue());
}
});
for(Map.Entry<String, Integer> entry : list){
// only writing top 3 in the sorted map
if(count>2)
break;
hashmap.put(entry.getKey(),entry.getValue());
}
return hashmap ;
}

How to sort comma separated keys in Reducer ouput?

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is also a Text representing Customer_ID. I know that Hadoop sorts output based on keys but my final result is a bit wierd. I want the output keys to be sorted by R first, then F and then M. But I am getting the following output sort order for unknown reasons:
545,1,7652 100000
545,23,390159.402343750 100001
452,13,132586 100002
452,4,32202 100004
452,1,9310 100007
452,1,4057 100018
452,3,18970 100021
But I want the following output:
545,23,390159.402343750 100001
545,1,7652 100000
452,13,132586 100002
452,4,32202 100004
452,3,18970 100021
452,1,9310 100007
452,1,4057 100018
NOTE: The customer_ID was the key in Map phase and all the RFM values belonging to a particular Customer_ID are brought together at the Reducer for aggregation.
So after a lot of searching I found some useful material the compilation of which I am posting now:
You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:
import java.io.*;
import org.apache.hadoop.io.*;
public class TextQuadlet implements WritableComparable<TextQuadlet> {
private String customer_id;
private long R;
private long F;
private double M;
public TextQuadlet() {
}
public TextQuadlet(String customer_id, long R, long F, double M) {
set(customer_id, R, F, M);
}
public void set(String customer_id2, long R2, long F2, double M2) {
this.customer_id = customer_id2;
this.R = R2;
this.F = F2;
this.M=M2;
}
public String getCustomer_id() {
return customer_id;
}
public long getR() {
return R;
}
public long getF() {
return F;
}
public double getM() {
return M;
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.customer_id);
out.writeLong(this.R);
out.writeLong(this.F);
out.writeDouble(this.M);
}
#Override
public void readFields(DataInput in) throws IOException {
this.customer_id = in.readUTF();
this.R = in.readLong();
this.F = in.readLong();
this.M = in.readDouble();
}
// This hashcode function is important as it is used by the custom
// partitioner for this class.
#Override
public int hashCode() {
return (int) (customer_id.hashCode() * 163 + R + F + M);
}
#Override
public boolean equals(Object o) {
if (o instanceof TextQuadlet) {
TextQuadlet tp = (TextQuadlet) o;
return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
}
return false;
}
#Override
public String toString() {
return customer_id + "," + R + "," + F + "," + M;
}
// LHS in the conditional statement is the current key
// RHS in the conditional statement is the previous key
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// Returning 0 or a positive value means that you are keeping the
// order as it is
#Override
public int compareTo(TextQuadlet tp) {
// Here my natural is is customer_id and I don't even take it into
// consideration.
// So as you might have concluded, I am sorting R,F,M descendingly.
if (this.R != tp.R) {
if(this.R < tp.R) {
return 1;
}
else{
return -1;
}
}
if (this.F != tp.F) {
if(this.F < tp.F) {
return 1;
}
else{
return -1;
}
}
if (this.M != tp.M){
if(this.M < tp.M) {
return 1;
}
else{
return -1;
}
}
return 0;
}
public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
int cmp = tp1.compareTo(tp2);
return cmp;
}
public static int compare(Text customer_id1, Text customer_id2) {
int cmp = customer_id1.compareTo(customer_id1);
return cmp;
}
}
Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FirstPartitioner_RFM extends Partitioner<TextQuadlet, Text> {
#Override
public int getPartition(TextQuadlet key, Text value, int numPartitions) {
return (int) key.hashCode() % numPartitions;
}
}
Thirdly, you'll need a custom group comparater so that all the values are grouped together by their natural key which is customer_id and not the composite key which is customer_id,R,F,M:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupComparator_RFM_N extends WritableComparator {
protected GroupComparator_RFM_N() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// Here we tell hadoop to group the keys by their natural key.
return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
}
}
Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class KeyComparator_RFM extends WritableComparator {
protected KeyComparator_RFM() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// LHS in the conditional statement is the current key-value pair
// RHS in the conditional statement is the previous key-value pair
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// If you are comparing strings, the string which ends up as the argument
// for the `compareTo` method turns out to be the previous key and the
// string which is invoking the `compareTo` method turns out to be the
// current key.
if(ip1.getR() == ip2.getR()){
if(ip1.getF() == ip2.getF()){
if(ip1.getM() == ip2.getM()){
return 0;
}
else{
if(ip1.getM() < ip2.getM())
return 1;
else
return -1;
}
}
else{
if(ip1.getF() < ip2.getF())
return 1;
else
return -1;
}
}
else{
if(ip1.getR() < ip2.getR())
return 1;
else
return -1;
}
}
}
And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:
job.setPartitionerClass(FirstPartitioner_RFM.class);
job.setSortComparatorClass(KeyComparator_RFM.class);
job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
job.setMapOutputKeyClass(TextQuadlet.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(TextQuadlet.class);
job.setOutputValueClass(Text.class);
Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

Why is TreeMap reset after every reduce method?

In my reduce method, I want to operate with the TreeMap variable reduceMap to aggregate the incoming key values. However, this map loses it's state with every reduce method call. Subsequently Hadoop prints only the very last value (plus the test values I added) that is put into the TreeMap. Why is that? It does work as I intend it in my map method.
public static class TopReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private TreeMap<Text, Integer> reducedMap = new TreeMap<Text, Integer>();
#Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
String strValues = "";
for (IntWritable value : values) {
sum += value.get();
strValues += value.get() + ", ";
}
System.out.println("Map size Before: " +reducedMap);
Integer val = sum;
if (reducedMap.containsKey(key))
val += reducedMap.get(key);
// Only add, if value is of top 30.
reducedMap.put(key, val);
System.out.println("Map size After: " +reducedMap);
reducedMap.put(new Text("test"), 77777);
System.out.println("REDUCER: rcv: (" + key + "), " + "(" + sum
+ "), (" + strValues + "):: new (" + val + ")");
}
/**
* Flush top 30 context to the next phase.
*/
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
System.out.println("-----FLUSHING TOP " + TOP_N
+ " MAPPING RESULTS-------");
System.out.println("MapSize: " + reducedMap);
int i = 0;
for (Entry<Text, Integer> entry : entriesSortedByValues(reducedMap)) {
System.out.println("key " + entry.getKey() + ", value "
+ entry.getValue());
context.write(entry.getKey(), new IntWritable(entry.getValue()));
if (i >= TOP_N)
break;
else
i++;
}
}
}
Hadoop re-uses object references for efficiency purposes - so when you call reducedMap.put(key, val) the key value will match a key already in the map (because Hadoop had just replaced the contents of your key object, not given you a new reference to a new object with new contents). It's effectively the same as calling the following:
Text key = new Text("x");
reducedMap.put(key, val); // map will be of size 1
key.set("y");
reducedMap.put(key, val); // map will still be of size 1
// as it will be comparing key to the itself
// and just updating the mapped value val
You need to make a deep copy of your key before putting it into the map:
reducedMap.put(new Text(key), val)

hadoop pig bag subtraction

I'm using Pig to parse my application logs to know which exposed methods have been called by a user that wasn't called the last month (by the same user).
I have managed to get methods called grouped by users before last month and after last month :
BEFORE last month relation sample
u1 {(m1),(m2)}
u2 {(m3),(m4)}
AFTER last month relation sample
u1 {(m1),(m3)}
u2 {(m1),(m4)}
What I want is to find, by users, which methods are in AFTER that are not in BEFORE, that is
NEWLY_CALLED expected result
u1 {(m3)}
u2 {(m1)}
Question : how can I do that in Pig ? is it possible to subtract bags ?
I have tried DIFF function but it does not perform the expected subtraction.
Regards,
Joel
I think you need to write a UDF, then you can use
Set<T> setA ...
Set<T> setB ...
Set<T> setAminusB = setA.subtract(setB);
For those who might be interested, here is the subtract function I wrote the class below and proposed it to Pig (PIG-2881) :
/**
* Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br>
* If null bag arguments are replaced by empty bags.
* <p>
* The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously.
* </br>
* If that is not the case the UDF will still function, but it will be <strong>very</strong> slow.
*/
public class Subtract extends EvalFunc<DataBag> {
/**
* Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag.
* #param input a tuple with exactly two bag fields.
* #throws IOException if there are not exactly two fields in a tuple or if they are not {#link DataBag}.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input.size() != 2) {
throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs.");
}
DataBag bag1 = toDataBag(input.get(0));
DataBag bag2 = toDataBag(input.get(1));
return subtract(bag1, bag2);
}
private static String classNameOf(Object o) {
return o == null ? "null" : o.getClass().getSimpleName();
}
private static DataBag toDataBag(Object o) throws ExecException {
if (o == null) {
return BagFactory.getInstance().newDefaultBag();
}
if (o instanceof DataBag) {
return (DataBag) o;
}
throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o)));
}
private static DataBag subtract(DataBag bag1, DataBag bag2) {
DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag();
// convert each bag to Set, this does make the assumption that the sets will fit in memory.
Set<Tuple> set1 = toSet(bag1);
// remove elements of bag2 from set1
Iterator<Tuple> bag2Iterator = bag2.iterator();
while (bag2Iterator.hasNext()) {
set1.remove(bag2Iterator.next());
}
// set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag.
for (Tuple tuple : set1) {
subtractBag2FromBag1.add(tuple);
}
return subtractBag2FromBag1;
}
private static Set<Tuple> toSet(DataBag bag) {
Set<Tuple> set = new HashSet<Tuple>();
Iterator<Tuple> iterator = bag.iterator();
while (iterator.hasNext()) {
set.add(iterator.next());
}
return set;
}
}

Hadoop seems to modify my key object during an iteration over values of a given reduce call

Hadoop Version: 0.20.2 (On Amazon EMR)
Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of values in reduce call, my key got changed and i got values of that new key.
My key type:
class MyKey implements WritableComparable<MyKey>, Serializable {
private MyEnum type; //MyEnum is a simple enumeration.
private TreeMap<String, String> subKeys;
MyKey() {} //for hadoop
public MyKey(MyEnum t, Map<String, String> sK) { type = t; subKeys = new TreeMap(sk); }
public void readFields(DataInput in) throws IOException {
Text typeT = new Text();
typeT.readFields(in);
this.type = MyEnum.valueOf(typeT.toString());
subKeys.clear();
int i = WritableUtils.readVInt(in);
while ( 0 != i-- ) {
Text keyText = new Text();
keyText.readFields(in);
Text valueText = new Text();
valueText.readFields(in);
subKeys.put(keyText.toString(), valueText.toString());
}
}
public void write(DataOutput out) throws IOException {
new Text(type.name()).write(out);
WritableUtils.writeVInt(out, subKeys.size());
for (Entry<String, String> each: subKeys.entrySet()) {
new Text(each.getKey()).write(out);
new Text(each.getValue()).write(out);
}
}
public int compareTo(MyKey o) {
if (o == null) {
return 1;
}
int typeComparison = this.type.compareTo(o.type);
if (typeComparison == 0) {
if (this.subKeys.equals(o.subKeys)) {
return 0;
}
int x = this.subKeys.hashCode() - o.subKeys.hashCode();
return (x != 0 ? x : -1);
}
return typeComparison;
}
}
Is there anything wrong with this implementation of key? Following is the code where I am facing the mixup of keys in reduce call:
reduce(MyKey k, Iterable<MyValue> values, Context context) {
Iterator<MyValue> iterator = values.iterator();
int sum = 0;
while(iterator.hasNext()) {
MyValue value = iterator.next();
//when i come here in the 2nd iteration, if i print k, it is different from what it was in iteration 1.
sum += value.getResult();
}
//write sum to context
}
Any help in this would be greatly appreciated.
This is expected behavior (with the new API at least).
When the next method for the underlying iterator of the values Iterable is called, the next key/value pair is read from the sorted mapper / combiner output, and checked that the key is still part of the same group as the previous key.
Because hadoop re-uses the objects passed to the reduce method (just calling the readFields method of the same object) the underlying contents of the Key parameter 'k' will change with each iteration of the values Iterable.

Resources