Intersection of Intervals in Apache Pig - hadoop

In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.

This is not an efficient solution, and I recommend writing a UDF to do this.
Self Join the dataset with itself to get a cross product of all the combinations. In pig, it's difficult to join something with itself, so you just act as if you are loading two separate datasets. After the cross product, you end up with data like
1,A,1500,2000,1,A,1500,2000
1,A,1500,2000,2,A,1900,3000
.....
At this point, you need to satisfy four conditionals,
"where" field matches
id one and two from the self join don't match (so you don't get back the same ID intersecting with itself)
start time from second group being compared should be greater than start time for first group and less then end time for first group
This code should work, might have a syntax error somewhere as I couldn't test it but should help you to write what you need.
inputdataone = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
inputdatatwo = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
crossProduct = CROSS inputdataone, inputdatatwo;
crossProduct =
FOREACH crossProduct
GENERATE inputdataone::id as id_one,
inputdatatwo::id as id_two,
(inputdatatwo::start-inputdataone::start>=0 AND inputdatatwo::start-inputdataone::end<=0 AND inputdataone::where==inputdatatwo::where?1:0) as intersect;
find_intersect = FILTER crossProduct BY intersect==1;
final =
FOREACH find_intersect
GENERATE id_one,
id_two;

Crossing large sets inflates the data.
A naive solution without crossing would be to partition the intervals and check for intersections within each interval.
I am working on a similar problem and will provide a code sample when I am done.

Related

Summing up values in Pig

I'm trying to deliver an output which aggregates the last two fields (count and books) and divides them by each other (count/books) for each grouping. Currently I have the grouping code, which groups by the first element in the array. I'm not sure how to get the sums of the last two elements and sum them however. I've posted what code I have so far. Thanks in advance!
bigrams = LOAD 'txt' AS (bigram:chararray, year:int, count:int, books:int);
grouping = group bigrams by bigram;
STORE grouping INTO 's3://cse6242vrv3/output1.txt';
It's not completely clear what are you expecting in output. So, I assume you just want to know how to do aggregations in Pig. Let us know more if you are looking for something different.
bigrams = LOAD 'txt' AS (bigram:chararray, year:int, count:int, books:int);
grouping = foreach(group bigrams by bigram) generate group AS biagram,
SUM(bigrams.count) AS sum_count,
SUM(biagram.books) AS sum_books,
SUM(bigrams.count)/SUM(biagram.books) AS ratio;
STORE grouping INTO 's3://cse6242vrv3/output1.txt';
You can find more details about pig aggregation here-
https://pig.apache.org/docs/r0.15.0/basic.html#group
One more thing you might be interested in pig is nested blocks which can be used for complex calculations in group by.
https://pig.apache.org/docs/r0.15.0/basic.html#nestedblock

How to rename the fields in a relation

Consider the following code :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
The relation ranked has two fields : the line number and the text. The text is called line and can be referred to by this alias, but the line number generated by RANK has none. As a consequence, the only way I can refer to it is as $0.
How can I give $0 an name, so that I can refer to it more easily once it's been joined to another data set and is no longer $0?
What you want to do is to define a schema for you data. The easiest way to do so is to use the AS keywoard just like you're doing with LOAD.
You can define a schema with three operators : LOAD, STREAM and FOREACH.
Here, the easiest way to do so would be the following :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
renamed_ranked = foreach B generate $0 as rank, $1;
You may find more informations on the associated documentation.
It is also good to know that this operation won't add an iteration to your script. As #ArnonRotem-Gal-Oz said :
Pig doesn't perform the action in a serial manner i.e. it doesn't do all the ranking and then does another iteration on all the records. The pig optimizer will do the rename when it assigns the rank. You can see a similar behaviour explained in the pig cookbook.
You can add a projection with FOREACH as
named_ranked = FOREACH ranked GENERATE $0 as r,*;

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Pig: Count number of keys in a map

I'd like to count the number of keys in a map in Pig. I could write a UDF to do this, but I was hoping there would be an easier way.
data = LOAD 'hbase://MARS1'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'A:*', '-loadKey true -caching=100000')
AS (id:bytearray, A_map:map[]);
In the code above, I want to basically build a histogram of id and how many items in column family A that key has.
In hoping, I tried c = FOREACH data GENERATE id, COUNT(A_map); but that unsurprisingly didn't work.
Or, perhaps someone can suggest a better way to do this entirely. If I can't figure this out soon I'll just write a Java MapReduce job or a Pig UDF.
SIZE should apparently work for you (not tried it myself):
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SIZE

Efficient set operations in mapreduce

I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn't look like it is being done very efficiently, and I would like to know if anyone has any tips or suggestions on how to do this kind of calculation as efficiently as possible in mapreduce.
We use Hadoop, but I'll give an example in pseudocode, without all the cruft:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id, user_id)
reduce(ad_id, user_ids):
uniqe_user_ids = new Set()
foreach (user_id in user_ids):
unique_user_ids.add(user_id)
collect(ad_id, unique_user_ids.size)
It's not much code, and it's not very hard to understand, but it's not very efficient. Every day we get more data, and so every day we need to look at all the ad impressions from the beginning to calculate the number of unique user IDs for that ad, so each day it takes longer, and uses more memory. Moreover, without having actually profiled the code (not sure how to do that in Hadoop) I'm pretty certain that almost all of the work is in creating the set of unique IDs. It eats enormous amounts of memory too.
I've experimented with non-mapreduce solutions, and have gotten much better performance (but the question there is how to scale it in the same way that I can scale with Hadoop), but it feels like there should be a better way of doing it in mapreduce that the code I have. It must be a common enough problem for others to have solved.
How do you implement the counting of unique IDs in an efficient manner using mapreduce?
The problem is that the code you inherited was written with the mindset "I'll determine the unique set myself" instead of the "let's leverage the framework to do it for me".
I would something like this (pseudocode) instead:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id & user_id , unused dummy value)
reduce(ad_id & user_id , unused dummy value):
output (ad_id , 1); // one unique userid.
map(ad_id , 1): --> identity mapper!
collect(ad_id , 1 )
reduce(ad_id , set of a lot of '1's):
summarize ;
output (ad_id , unique_user_ids);
Niels' solution is good, but for an approximate alternative that is closer to the original code and uses only one map reduce phase, just replace the set with a bloom filter. The membership queries in a bloom filter have a small probability of error, but the size estimates are very accurate.

Resources