I'm trying to write in pseudo code a MapReduce task that returns the items sorted in descending order. For example: for the wordcount task, instead of getting:
apple 1
banana 3
mango 2
I want the output to be:
banana 3
mango 2
apple 1
Any ideas of how to do it? I know how to do it in ascending order (replace the keys and value in the mapper job) but not in descending order.
Here you can take help of below reducer code to achieve sorting in descending order .
Assuming you have written mapper and driver code where mapper will produce output as (Banana,1) etc
In reducer we will sum all values for a particular key and put final result in a map then sort the map on the basis of values and write final result in cleanup function of reduce.
Please see below code for further understadnind:
public class Word_Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
// Change access modifier as per your need
public Map<String , Integer > map = new LinkedHashMap<String , Integer>();
public void reduce(Text key , Iterable<IntWritable> values ,Context context)
{
// write logic for your reducer
// Enter reduced values in map for each key
for (IntWritable value : values ){
// calculate "count" associated with each word
}
map.put(key.toString() , count);
}
public void cleanup(Context context){
//Cleanup is called once at the end to finish off anything for reducer
//Here we will write our final output
Map<String , Integer> sortedMap = new HashMap<String , Integer>();
sortedMap = sortMap(map);
for (Map.Entry<String,Integer> entry = sortedMap.entrySet()){
context.write(new Text(entry.getKey()),new IntWritable(entry.getValue()));
}
}
public Map<String , Integer > sortMap (Map<String,Integer> unsortMap){
Map<String ,Integer> hashmap = new LinkedHashMap<String,Integer>();
int count=0;
List<Map.Entry<String,Integer>> list = new
LinkedList<Map.Entry<String,Integer>>(unsortMap.entrySet());
//Sorting the list we created from unsorted Map
Collections.sort(list , new Comparator<Map.Entry<String,Integer>>(){
public int compare (Map.Entry<String , Integer> o1 , Map.Entry<String , Integer> o2 ){
//sorting in descending order
return o2.getValue().compareTo(o1.getValue());
}
});
for(Map.Entry<String, Integer> entry : list){
// only writing top 3 in the sorted map
if(count>2)
break;
hashmap.put(entry.getKey(),entry.getValue());
}
return hashmap ;
}
Related
I have a LinkedHashMap which contains multiple entries. I'd like to reduce the multiple entries to a single one in the first step, and than map that to a single String.
For example:
I'm starting with a Map like this:
{"<a>"="</a>", "<b>"="</b>", "<c>"="</c>", "<d>"="</d>"}
And finally I want to get a String like this:
<a><b><c><d></d></c></b></a>
(In that case the String contains the keys in order, than the values in reverse order. But that doesn't really matter, I'd like an general solution)
I think I need map.entrySet().stream().reduce(), but I have no idea what to write in the reduce method, and how to continue.
Since you're reducing entries by concatenating keys with keys and values with values, the identity you're looking for is an entry with empty strings for both key and value.
String reduceEntries(LinkedHashMap<String, String> map) {
Entry<String, String> entry =
map.entrySet()
.stream()
.reduce(
new SimpleImmutableEntry<>("", ""),
(left, right) ->
new SimpleImmutableEntry<>(
left.getKey() + right.getKey(),
right.getValue() + left.getValue()
)
);
return entry.getKey() + entry.getValue();
}
Java 9 adds a static method Map.entry(key, value) for creating immutable entries.
here is an example about how I would do it :
import java.util.LinkedHashMap;
public class Main {
static String result = "";
public static void main(String [] args)
{
LinkedHashMap<String, String> map = new LinkedHashMap<String, String>();
map.put("<a>", "</a>");
map.put("<b>", "</b>");
map.put("<c>", "</c>");
map.put("<d>", "</d>");
map.keySet().forEach(s -> result += s);
map.values().forEach(s -> result += s);
System.out.println(result);
}
}
note: you can reverse values() to get d first with ArrayUtils.reverse()
In need to emit two keys and two values from my mapper. could you please provide me info , how to write code and data type for that. for example :-
key = { store_id : this.store_id,
product_id : this.product_id };
value = { quantity : this.quantity,
price : this.price,
count : this.count };
emit(key, value);
regards
As per the given example, A B B C A R A D S D A C A R S D F A B
From the mapper emit
key - A
value A, AB
key - B
value B,BB
key - B
value B, BC
key - C
value C, CA
and so on...
In the reducer, you get the grouped values
key - A
values A, AB, A, AR, A, AD, A, AC and so on
key - B
value - B, BB,B,BC and so on
Add a delimiter of your choice between the 2 words/alphabets
for each key in reducer, you can use a hashmap/mapwritable to track the occurrence count of each value
ie for example
A - 5 times
AB - 7 times
etc etc
Then you can calculate the ratio
Sample Mapper Implementation
public class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] valueSplits = value.toString().split(" ");
for(int i=0;i<valueSplits.length;i++){
if(i!=valueSplits.length-1){
context.write(new Text(valueSplits[i]),new Text(valueSplits[i]+"~"+valueSplits[i+1]));
}
context.write(new Text(valueSplits[i]), new Text(valueSplits[i]));
}
}
}
Sample Reducer Implementation
public class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String,Integer> countMap= new HashMap<String,Integer>();
for(Text t : values){
String value = t.toString();
int count =0;
if(countMap.containsKey(value)){
count = countMap.get(value);
count+=1;
}else{
count =1;
}
countMap.put(value, count);
}
for(String s : countMap.keySet()){
if(s.equalsIgnoreCase(key.toString())){
}else{
int keyCount = countMap.get(s.split("~")[0]);
int occurrence = countMap.get(s);
context.write(new Text(key.toString()+" , "+s), new Text(String.valueOf((float)occurrence/(float)keyCount)));
}
}
}
}
For an input of
A A A B
the reducer would emit
A , A~A 0.6666667
A , A~B 0.33333334
AA appears 2 times, AB 1 time and A 3 times.
AA is hence 2/3
AB is hence 1/3
I am trying to write a MapReduce application in which the Mapper passes a set of values to the Reducer as follows:
Hello
World
Hello
Hello
World
Hi
Now these values are to be grouped and counted first and then some further processing is to be done. The code I wrote is:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<String> records = new ArrayList<String>();
/* Collects all the records from the mapper into the list. */
for (Text value : values) {
records.add(value.toString());
}
/* Groups the values. */
Map<String, Integer> groupedData = groupAndCount(records);
Set<String> groupKeys = groupedData.keySet();
/* Writes the grouped data. */
for (String groupKey : groupKeys) {
System.out.println(groupKey + ": " + groupedData.get(groupKey));
context.write(NullWritable.get(), new Text(groupKey + groupedData.get(groupKey)));
}
}
public Map<String, Integer> groupAndCount(List<String> records) {
Map<String, Integer> groupedData = new HashMap<String, Integer>();
String currentRecord = "";
Collections.sort(records);
for (String record : records) {
System.out.println(record);
if (!currentRecord.equals(record)) {
currentRecord = record;
groupedData.put(currentRecord, 1);
} else {
int currentCount = groupedData.get(currentRecord);
groupedData.put(currentRecord, ++currentCount);
}
}
return groupedData;
}
But in the output I get a count of 1 for all. The sysout statements are printed something like:
Hello
World
Hello: 1
World: 1
Hello
Hello: 1
Hello
World
Hello: 1
World: 1
Hi
Hi: 1
I cannot understand what the issue is and why not all records are received by the Reducer at once and passed to the groupAndCount method.
As you note in your comment, if each value has a different corresponding key then they will not be reduced in the same reduce call, and you'll get the output you're currently seeing.
Fundamental to Hadoop reducers is the notion that values will be collected and reduced for the same key - i suggest you re-read some of the Hadoop getting started documentation, especially the Word Count example, which appears to be roughly what you are trying to achieve with your code.
I'm using Pig to parse my application logs to know which exposed methods have been called by a user that wasn't called the last month (by the same user).
I have managed to get methods called grouped by users before last month and after last month :
BEFORE last month relation sample
u1 {(m1),(m2)}
u2 {(m3),(m4)}
AFTER last month relation sample
u1 {(m1),(m3)}
u2 {(m1),(m4)}
What I want is to find, by users, which methods are in AFTER that are not in BEFORE, that is
NEWLY_CALLED expected result
u1 {(m3)}
u2 {(m1)}
Question : how can I do that in Pig ? is it possible to subtract bags ?
I have tried DIFF function but it does not perform the expected subtraction.
Regards,
Joel
I think you need to write a UDF, then you can use
Set<T> setA ...
Set<T> setB ...
Set<T> setAminusB = setA.subtract(setB);
For those who might be interested, here is the subtract function I wrote the class below and proposed it to Pig (PIG-2881) :
/**
* Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br>
* If null bag arguments are replaced by empty bags.
* <p>
* The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously.
* </br>
* If that is not the case the UDF will still function, but it will be <strong>very</strong> slow.
*/
public class Subtract extends EvalFunc<DataBag> {
/**
* Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag.
* #param input a tuple with exactly two bag fields.
* #throws IOException if there are not exactly two fields in a tuple or if they are not {#link DataBag}.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input.size() != 2) {
throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs.");
}
DataBag bag1 = toDataBag(input.get(0));
DataBag bag2 = toDataBag(input.get(1));
return subtract(bag1, bag2);
}
private static String classNameOf(Object o) {
return o == null ? "null" : o.getClass().getSimpleName();
}
private static DataBag toDataBag(Object o) throws ExecException {
if (o == null) {
return BagFactory.getInstance().newDefaultBag();
}
if (o instanceof DataBag) {
return (DataBag) o;
}
throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o)));
}
private static DataBag subtract(DataBag bag1, DataBag bag2) {
DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag();
// convert each bag to Set, this does make the assumption that the sets will fit in memory.
Set<Tuple> set1 = toSet(bag1);
// remove elements of bag2 from set1
Iterator<Tuple> bag2Iterator = bag2.iterator();
while (bag2Iterator.hasNext()) {
set1.remove(bag2Iterator.next());
}
// set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag.
for (Tuple tuple : set1) {
subtractBag2FromBag1.add(tuple);
}
return subtractBag2FromBag1;
}
private static Set<Tuple> toSet(DataBag bag) {
Set<Tuple> set = new HashSet<Tuple>();
Iterator<Tuple> iterator = bag.iterator();
while (iterator.hasNext()) {
set.add(iterator.next());
}
return set;
}
}
Hadoop Version: 0.20.2 (On Amazon EMR)
Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of values in reduce call, my key got changed and i got values of that new key.
My key type:
class MyKey implements WritableComparable<MyKey>, Serializable {
private MyEnum type; //MyEnum is a simple enumeration.
private TreeMap<String, String> subKeys;
MyKey() {} //for hadoop
public MyKey(MyEnum t, Map<String, String> sK) { type = t; subKeys = new TreeMap(sk); }
public void readFields(DataInput in) throws IOException {
Text typeT = new Text();
typeT.readFields(in);
this.type = MyEnum.valueOf(typeT.toString());
subKeys.clear();
int i = WritableUtils.readVInt(in);
while ( 0 != i-- ) {
Text keyText = new Text();
keyText.readFields(in);
Text valueText = new Text();
valueText.readFields(in);
subKeys.put(keyText.toString(), valueText.toString());
}
}
public void write(DataOutput out) throws IOException {
new Text(type.name()).write(out);
WritableUtils.writeVInt(out, subKeys.size());
for (Entry<String, String> each: subKeys.entrySet()) {
new Text(each.getKey()).write(out);
new Text(each.getValue()).write(out);
}
}
public int compareTo(MyKey o) {
if (o == null) {
return 1;
}
int typeComparison = this.type.compareTo(o.type);
if (typeComparison == 0) {
if (this.subKeys.equals(o.subKeys)) {
return 0;
}
int x = this.subKeys.hashCode() - o.subKeys.hashCode();
return (x != 0 ? x : -1);
}
return typeComparison;
}
}
Is there anything wrong with this implementation of key? Following is the code where I am facing the mixup of keys in reduce call:
reduce(MyKey k, Iterable<MyValue> values, Context context) {
Iterator<MyValue> iterator = values.iterator();
int sum = 0;
while(iterator.hasNext()) {
MyValue value = iterator.next();
//when i come here in the 2nd iteration, if i print k, it is different from what it was in iteration 1.
sum += value.getResult();
}
//write sum to context
}
Any help in this would be greatly appreciated.
This is expected behavior (with the new API at least).
When the next method for the underlying iterator of the values Iterable is called, the next key/value pair is read from the sorted mapper / combiner output, and checked that the key is still part of the same group as the previous key.
Because hadoop re-uses the objects passed to the reduce method (just calling the readFields method of the same object) the underlying contents of the Key parameter 'k' will change with each iteration of the values Iterable.