hadoop pig bag subtraction

hadoop pig bag subtraction - hadoop

I'm using Pig to parse my application logs to know which exposed methods have been called by a user that wasn't called the last month (by the same user).
I have managed to get methods called grouped by users before last month and after last month :
BEFORE last month relation sample
u1 {(m1),(m2)}
u2 {(m3),(m4)}
AFTER last month relation sample
u1 {(m1),(m3)}
u2 {(m1),(m4)}
What I want is to find, by users, which methods are in AFTER that are not in BEFORE, that is
NEWLY_CALLED expected result
u1 {(m3)}
u2 {(m1)}
Question : how can I do that in Pig ? is it possible to subtract bags ?
I have tried DIFF function but it does not perform the expected subtraction.
Regards,
Joel

I think you need to write a UDF, then you can use
Set<T> setA ...
Set<T> setB ...
Set<T> setAminusB = setA.subtract(setB);

For those who might be interested, here is the subtract function I wrote the class below and proposed it to Pig (PIG-2881) :
/**
* Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br>
* If null bag arguments are replaced by empty bags.
* <p>
* The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously.
* </br>
* If that is not the case the UDF will still function, but it will be <strong>very</strong> slow.
*/
public class Subtract extends EvalFunc<DataBag> {
/**
* Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag.
* #param input a tuple with exactly two bag fields.
* #throws IOException if there are not exactly two fields in a tuple or if they are not {#link DataBag}.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input.size() != 2) {
throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs.");
}
DataBag bag1 = toDataBag(input.get(0));
DataBag bag2 = toDataBag(input.get(1));
return subtract(bag1, bag2);
}
private static String classNameOf(Object o) {
return o == null ? "null" : o.getClass().getSimpleName();
}
private static DataBag toDataBag(Object o) throws ExecException {
if (o == null) {
return BagFactory.getInstance().newDefaultBag();
}
if (o instanceof DataBag) {
return (DataBag) o;
}
throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o)));
}
private static DataBag subtract(DataBag bag1, DataBag bag2) {
DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag();
// convert each bag to Set, this does make the assumption that the sets will fit in memory.
Set<Tuple> set1 = toSet(bag1);
// remove elements of bag2 from set1
Iterator<Tuple> bag2Iterator = bag2.iterator();
while (bag2Iterator.hasNext()) {
set1.remove(bag2Iterator.next());
}
// set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag.
for (Tuple tuple : set1) {
subtractBag2FromBag1.add(tuple);
}
return subtractBag2FromBag1;
}
private static Set<Tuple> toSet(DataBag bag) {
Set<Tuple> set = new HashSet<Tuple>();
Iterator<Tuple> iterator = bag.iterator();
while (iterator.hasNext()) {
set.add(iterator.next());
}
return set;
}
}

Related

Comparable class with a List<String> field

I have a simple class which stores an integer and a list of Strings.
As I want to use this class in a TreeSet<>, the one must be Comparable.
But when trying to use the Java 8 Comparator class, I cannot compare my inner list.
I have the following error:
Bad return type in method reference: cannot convert java.util.List to U
I think there is a very simple way to do that but I could not find it out.
How to do that?
public class MyClass implements Comparable<MyClass> {
private final int someInt;
private final List<String> someStrings;
public MyClass (List<String> someStrings, int someInt) {
this.someInt = someInt;
this.someStrings = new ArrayList<>(someStrings);
}
#Override
public int compareTo(MyClass other) {
return
Comparator.comparing(MyClass::getSomeInt)
.thenComparing(MyClass::getSomeStrings) // Error here
.compare(this, other);
}
public int getSomeInt() {
return someInt;
}
public List<String> getSomeStrings() {
return someStrings;
}
}
Edit 1
I just want the String list to be compared in the simplest way (using implicitly String.compareTo()).
Note that I do now want to sort my List<String> but I want it to be Comparable so that MyClass is also comparable and finally, I can insert MyClass instances into a TreeSet<MyClass>.
A also saw in the JavaDoc the following:
java.util.Comparator<T> public Comparator<T>
thenComparing(#NotNull Comparator<? super T> other)
For example, to sort a collection of String based on the length and then case-insensitive natural ordering, the comparator can be composed using following code,
Comparator<String> cmp = Comparator.comparingInt(String::length)
.thenComparing(String.CASE_INSENSITIVE_ORDER);
It seems to be an clue but I don't know how to apply it to this simple example.
Edit 2
Let's say I want my List<String> to be sorted the following way:
First check: List.size() (the shorter is less than the larger one);
Second check if sizes match: comparing one by one each element of both Lists until finding one where the String.compareTo method returns 1 or -1.
How to do that with lambdas in a my compareTo method?
Edit 3
This does not duplicates this question because I want to know how to build a comparator of a class which contains a List<String> with Java 8 chaining Comparable calls.

So to compare the list, first you check the length, then you compare each item with same indexes in both list one by one right?
(That is [a, b, c] < [b, a, c])
Make a custom comparator for list return join of your list string:
Comparator<List<String>> listComparator = (l1, l2) -> {
if (l1.size() != l2.size()) {
return l1.size() - l2.size();
}
for (int i = 0; i < l1.size(); i++) {
int strCmp = l1.get(i).compareTo(l2.get(i));
if (strCmp != 0) {
return strCmp;
}
}
return 0; // Two list equals
};
Then you can compare using that custom comparator:
#Override
public int compareTo(MyClass other) {
return Comparator.comparing(MyClass::getSomeInt)
.thenComparing(Comparator.comparing(MyClass:: getSomeStrings , listComparator))
.compare(this, other);
}
If you want [a, b, c] = [b, a, c], then you have to sort those list first before comparing:
public String getSomeStringsJoined() {
return getSomeStrings().stream().sort(Comparator.naturalOrder()).collect(Collectors.joining());
}

MapReduce sort by value in descending order

I'm trying to write in pseudo code a MapReduce task that returns the items sorted in descending order. For example: for the wordcount task, instead of getting:
apple 1
banana 3
mango 2
I want the output to be:
banana 3
mango 2
apple 1
Any ideas of how to do it? I know how to do it in ascending order (replace the keys and value in the mapper job) but not in descending order.

Here you can take help of below reducer code to achieve sorting in descending order .
Assuming you have written mapper and driver code where mapper will produce output as (Banana,1) etc
In reducer we will sum all values for a particular key and put final result in a map then sort the map on the basis of values and write final result in cleanup function of reduce.
Please see below code for further understadnind:
public class Word_Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
// Change access modifier as per your need
public Map<String , Integer > map = new LinkedHashMap<String , Integer>();
public void reduce(Text key , Iterable<IntWritable> values ,Context context)
{
// write logic for your reducer
// Enter reduced values in map for each key
for (IntWritable value : values ){
// calculate "count" associated with each word
}
map.put(key.toString() , count);
}
public void cleanup(Context context){
//Cleanup is called once at the end to finish off anything for reducer
//Here we will write our final output
Map<String , Integer> sortedMap = new HashMap<String , Integer>();
sortedMap = sortMap(map);
for (Map.Entry<String,Integer> entry = sortedMap.entrySet()){
context.write(new Text(entry.getKey()),new IntWritable(entry.getValue()));
}
}
public Map<String , Integer > sortMap (Map<String,Integer> unsortMap){
Map<String ,Integer> hashmap = new LinkedHashMap<String,Integer>();
int count=0;
List<Map.Entry<String,Integer>> list = new
LinkedList<Map.Entry<String,Integer>>(unsortMap.entrySet());
//Sorting the list we created from unsorted Map
Collections.sort(list , new Comparator<Map.Entry<String,Integer>>(){
public int compare (Map.Entry<String , Integer> o1 , Map.Entry<String , Integer> o2 ){
//sorting in descending order
return o2.getValue().compareTo(o1.getValue());
}
});
for(Map.Entry<String, Integer> entry : list){
// only writing top 3 in the sorted map
if(count>2)
break;
hashmap.put(entry.getKey(),entry.getValue());
}
return hashmap ;
}

java 8 use reduce and Collectors grouping by to get list

EDIT
**Request to provide answer to First approach also using reduce method **
public class Messages {
int id;
String message;
String field1;
String field2;
String field3;
int audId;
String audmessage;
//constructor
//getter or setters
}
public class CustomMessage {
int id;
String msg;
String field1;
String field2;
String field3;
List<Aud> list;
//getters and setters
}
public class Aud {
int id;
String message;
//getters and setters
}
public class Demo {
public static void main(String args[]){
List<Messages> list = new ArrayList<Messages>();
list.add(new Messages(1,"abc","c","d","f",10,"a1"));
list.add(new Messages(2,"ac","d","d","f",21,"a2"));
list.add(new Messages(3,"adc","s","d","f",31,"a3"));
list.add(new Messages(4,"aec","g","d","f",40,"a4"));
list.add(new Messages(1,"abc","c","d","f",11,"a5"));
list.add(new Messages(2,"ac","d","d","f",22,"a5"));
}
I want the message to be mapped with audits
CustomMessage must have ->1,"abc","c","d","f"----->List of 2 audits (10,a1) and (11,"a5");
There are two ways to do it
1.Reduce-I would like to use reduce also to create my own accumulator and combiner
List<CustomMessage> list1= list.stream().reduce(new ArrayList<CustomMessage>(),
accumulator1,
combiner1);
**I am unable to write a accumulator and combiner**
2.Collectors.groupingBy-
I do not want to use constructors for creating the Message and
neither for Custom Message.here I have less fields my actual object has many fields.Any way to have a static
method for object creation
Is there is a way to do it via reduce by writing accumulator or
combiner
List<CustomMessage> l = list.stream()
.collect(Collectors.groupingBy(m -> new SimpleEntry<>(m.getId(), m.getMessage()),
Collectors.mapping(m -> new Aud(m.getAudId(), m.getAudMessage()), Collectors.toList())))
.entrySet()
.stream()
.map(e -> new CustomMessage(e.getKey().getKey(), e.getKey().getValue(), e.getValue()))
.collect(Collectors.toList());
Can anyone help me with both the approaches.

This code will create a Collection of CustomMessage. I would recommend putting a constructor in CustomMessage that takes a Messages argument. And maybe also move the mergeFunction out of the collect.
Collection<CustomMessage> customMessages = list.stream()
.collect(toMap(
Messages::getId,
m -> new CustomMessage(m.getId(), m.getMessage(), m.getField1(), m.getField2(), m.getField3(),
new ArrayList<>(singletonList(new Aud(m.getAudId(), m.getAudmessage())))),
(m1, m2) -> {
m1.getList().addAll(m2.getList());
return m1;
}))
.values();
What toMap does here is : The first time a Messages id is encountered, it will put it to a Map as key with value the newly created CustomMessage by the second argument to toMap (the "valueMapper"). The next times it will merge two CustomMessage with the 3rd argument the "mergeFunction" that will effectively concatenate the 2 lists of Aud.
And if you absolutely need a List and not a Collection:
List<CustomMessage> lm = new ArrayList<>(customMessages);

You cannot do this by either grouping or reducing. You need both: group first and then reduce. I coded the reduction differently:
List<CustomMessage> list1 = list.stream()
.collect(Collectors.groupingBy(Messages::getId))
.values()
.stream() // stream of List<Messages>
.map(lm -> {
List<Aud> la = lm.stream()
.map(m -> new Aud(m.getAudId(), m.getAudmessage()))
.collect(Collectors.toList());
Messages m0 = lm.get(0);
return new CustomMessage(m0.getId(), m0.getMessage(),
m0.getField1(), m0.getField2(), m0.getField3(), la);
})
.collect(Collectors.toList());
I have introduced a constructor in Aud and then read your comment that you are trying to avoid constructors. I will revert to this point in the end. Anyway, you can rewrite the creation of Aud objects to be the same way as in your question. And the construction of CustomMessage objects too if you like.
Result:
[1 abc c d f [10 a1, 11 a5], 3 adc s d f [31 a3], 4 aec g d f [40 a4],
2 ac d d f [21 a2, 22 a5]]
I grouped messages only by ID since you said their equals method uses ID only. You may also group by more fields like in your question. A quick and dirty way wold be
.collect(Collectors.groupingBy(m -> "" + m.getId() + '-' + m.getMessage()
+ '-' + m.getField1() + '-' + m.getField2() + '-' + m.getField3()))
Avoiding public constructors and using static methods for object creation doesn’t change a lot. For example if you have
public static Aud createAud(int id, String message) {
return new Aud(id, message);
}
(well, this didn’t eliminate the constructor completely, but now you can declare it private; if still not satisfied, you can also rewrite the method into not using a declared constructor). Now in the stream you just need to do:
.map(m -> Aud.createAud(m.getAudId(), m.getAudmessage()))
You can do similarly for CustomMessage. In this case your static method may take a Messages argument if you like, a bit like Manos Nikolaidis suggested, this could simplify the stream code a bit.
Edit: You couldn’t just forget about the three-argument reduce method, could you? ;-) It can be used. If you want to do that, I suggest you first fit CustomMessage with a range of methods for the purpose:
private CustomMessage(int id, String msg,
String field1, String field2, String field3, List<Aud> list) {
this.id = id;
this.msg = msg;
this.field1 = field1;
this.field2 = field2;
this.field3 = field3;
this.list = list;
}
public static CustomMessage create(Messages m, List<Aud> la) {
return new CustomMessage(m.getId(), m.getMessage(),
m.getField1(), m.getField2(), m.getField3(), la);
}
/**
* #return original with the Aud from m added
*/
public static CustomMessage adopt(CustomMessage original, Messages m) {
if (original.getId() != m.getId()) {
throw new IllegalArgumentException("adopt(): incompatible messages, wrong ID");
}
Aud newAud = Aud.createAud(m.getAudId(), m.getAudmessage());
original.addAud(newAud);
return original;
}
public static CustomMessage merge(CustomMessage cm1, CustomMessage cm2) {
if (cm1.getId() != cm2.getId()) {
throw new IllegalArgumentException("Cannot merge non-matching custom messages, id "
+ cm1.getId() + " and " + cm2.getId());
}
cm1.addAuds(cm2.getList());
return cm1;
}
private void addAud(Aud aud) {
list.add(aud);
}
private void addAuds(List<Aud> list) {
this.list.addAll(list);
}
With these in place it’s not so bad:
List<CustomMessage> list2 = list.stream()
.collect(Collectors.groupingBy(Messages::getId))
.values()
.stream()
.map(lm -> lm.stream()
.reduce(CustomMessage.create(lm.get(0), new ArrayList<>()),
CustomMessage::adopt,
CustomMessage::merge))
.collect(Collectors.toList());

Linq Query Need - Looking for a pattern of data

Say I have a collection of the following simple class:
public class MyEntity
{
public string SubId { get; set; }
public System.DateTime ApplicationTime { get; set; }
public double? ThicknessMicrons { get; set; }
}
I need to search through the entire collection looking for 5 consecutive (not 5 total, but 5 consecutive) entities that have a null ThicknessMicrons value. Consecutiveness will be based on the ApplicationTime property. The collection will be sorted on that property.
How can I do this in a Linq query?

You can write your own extension method pretty easily:
public static IEnumerable<IEnumerable<T>> FindSequences<T>(this IEnumerable<T> sequence, Predicate<T> selector, int size)
{
List<T> curSequence = new List<T>();
foreach (T item in sequence)
{
// Check if this item matches the condition
if (selector(item))
{
// It does, so store it
curSequence.Add(item);
// Check if the list size has met the desired size
if (curSequence.Count == size)
{
// It did, so yield that list, and reset
yield return curSequence;
curSequence = new List<T>();
}
}
else
{
// No match, so reset the list
curSequence = new List<T>();
}
}
}
Now you can just say:
var groupsOfFive = entities.OrderBy(x => x.ApplicationTime)
.FindSequences(x => x.ThicknessMicrons == null, 5);
Note that this will return all sub-sequences of length 5. You can test for the existence of one like so:
bool isFiveSubsequence = groupsOfFive.Any();
Another important note is that if you have 9 consecutive matches, only one sub-sequence will be located.

Hadoop seems to modify my key object during an iteration over values of a given reduce call

Hadoop Version: 0.20.2 (On Amazon EMR)
Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of values in reduce call, my key got changed and i got values of that new key.
My key type:
class MyKey implements WritableComparable<MyKey>, Serializable {
private MyEnum type; //MyEnum is a simple enumeration.
private TreeMap<String, String> subKeys;
MyKey() {} //for hadoop
public MyKey(MyEnum t, Map<String, String> sK) { type = t; subKeys = new TreeMap(sk); }
public void readFields(DataInput in) throws IOException {
Text typeT = new Text();
typeT.readFields(in);
this.type = MyEnum.valueOf(typeT.toString());
subKeys.clear();
int i = WritableUtils.readVInt(in);
while ( 0 != i-- ) {
Text keyText = new Text();
keyText.readFields(in);
Text valueText = new Text();
valueText.readFields(in);
subKeys.put(keyText.toString(), valueText.toString());
}
}
public void write(DataOutput out) throws IOException {
new Text(type.name()).write(out);
WritableUtils.writeVInt(out, subKeys.size());
for (Entry<String, String> each: subKeys.entrySet()) {
new Text(each.getKey()).write(out);
new Text(each.getValue()).write(out);
}
}
public int compareTo(MyKey o) {
if (o == null) {
return 1;
}
int typeComparison = this.type.compareTo(o.type);
if (typeComparison == 0) {
if (this.subKeys.equals(o.subKeys)) {
return 0;
}
int x = this.subKeys.hashCode() - o.subKeys.hashCode();
return (x != 0 ? x : -1);
}
return typeComparison;
}
}
Is there anything wrong with this implementation of key? Following is the code where I am facing the mixup of keys in reduce call:
reduce(MyKey k, Iterable<MyValue> values, Context context) {
Iterator<MyValue> iterator = values.iterator();
int sum = 0;
while(iterator.hasNext()) {
MyValue value = iterator.next();
//when i come here in the 2nd iteration, if i print k, it is different from what it was in iteration 1.
sum += value.getResult();
}
//write sum to context
}
Any help in this would be greatly appreciated.

This is expected behavior (with the new API at least).
When the next method for the underlying iterator of the values Iterable is called, the next key/value pair is read from the sorted mapper / combiner output, and checked that the key is still part of the same group as the previous key.
Because hadoop re-uses the objects passed to the reduce method (just calling the readFields method of the same object) the underlying contents of the Key parameter 'k' will change with each iteration of the values Iterable.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

hadoop pig bag subtraction - hadoop

I think you need to write a UDF, then you can use Set<T> setA ... Set<T> setB ... Set<T> setAminusB = setA.subtract(setB);

Related

Comparable class with a List<String> field

MapReduce sort by value in descending order

java 8 use reduce and Collectors grouping by to get list

Linq Query Need - Looking for a pattern of data

Hadoop seems to modify my key object during an iteration over values of a given reduce call

Categories

Resources