Join 2 tables in hadoop - hadoop

I am very new to hadoop so please bear with me. Any help would be appreciated.
I need to join 2 tables,
Table 1 will have pagename , pagerank
for eg. Actual data set is huge but with the similar pattern
pageA,0.13
pageB,0.14
pageC,0.53
Table 2, it is a simple wordcount kind of table with word , pagename
for eg. actual dataset is huge but with similar pattern
test,pageA:pageB
sample,pageC
json,pageC:pageA:pageD
Now if user searches for any word from second table, I should give him the results of pages based on their pagerank from table 1.
Output when searched for test,
test = pageB,pageA
My approach was to load the first table into distributed cache. Read second table in map method get the list of pages for the word, sort the list using the pagerank info from first table which is loaded into distributed cache. This works for the dataset i am working but wanted to know if there was any better way, also would like to know how can this join be done with pig or hive.

A simple approach using a pig script:
PAGERANK = LOAD 'hdfs/pagerank/dataset/location' USING PigStorage(',')
AS (page:chararray, rank:float);
WORDS_TO_PAGES = LOAD 'hdfs/words/dataset/location' USING PigStorage(',')
AS (word:chararray, pages:chararray);
PAGES_MATCHING = FOREACH (FILTER WORDS_TO_PAGES BY word == '$query_word') GENERATE FLATTEN(TOKENIZE(pages, ':'));
RESULTS = FOREACH (JOIN PAGERANK BY page, PAGES_MATCHING BY $0) GENERATE page, rank;
SORTED_RESULTS = ORDER RESULTS BY rank DESC;
DUMP SORTED_RESULTS;
The script needs one parameter, which is the query word:
pig -f pagerank_join.pig -param query_word=test

Related

How to rename the fields in a relation

Consider the following code :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
The relation ranked has two fields : the line number and the text. The text is called line and can be referred to by this alias, but the line number generated by RANK has none. As a consequence, the only way I can refer to it is as $0.
How can I give $0 an name, so that I can refer to it more easily once it's been joined to another data set and is no longer $0?
What you want to do is to define a schema for you data. The easiest way to do so is to use the AS keywoard just like you're doing with LOAD.
You can define a schema with three operators : LOAD, STREAM and FOREACH.
Here, the easiest way to do so would be the following :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
renamed_ranked = foreach B generate $0 as rank, $1;
You may find more informations on the associated documentation.
It is also good to know that this operation won't add an iteration to your script. As #ArnonRotem-Gal-Oz said :
Pig doesn't perform the action in a serial manner i.e. it doesn't do all the ranking and then does another iteration on all the records. The pig optimizer will do the rename when it assigns the rank. You can see a similar behaviour explained in the pig cookbook.
You can add a projection with FOREACH as
named_ranked = FOREACH ranked GENERATE $0 as r,*;

Intersection of Intervals in Apache Pig

In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.
This is not an efficient solution, and I recommend writing a UDF to do this.
Self Join the dataset with itself to get a cross product of all the combinations. In pig, it's difficult to join something with itself, so you just act as if you are loading two separate datasets. After the cross product, you end up with data like
1,A,1500,2000,1,A,1500,2000
1,A,1500,2000,2,A,1900,3000
.....
At this point, you need to satisfy four conditionals,
"where" field matches
id one and two from the self join don't match (so you don't get back the same ID intersecting with itself)
start time from second group being compared should be greater than start time for first group and less then end time for first group
This code should work, might have a syntax error somewhere as I couldn't test it but should help you to write what you need.
inputdataone = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
inputdatatwo = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
crossProduct = CROSS inputdataone, inputdatatwo;
crossProduct =
FOREACH crossProduct
GENERATE inputdataone::id as id_one,
inputdatatwo::id as id_two,
(inputdatatwo::start-inputdataone::start>=0 AND inputdatatwo::start-inputdataone::end<=0 AND inputdataone::where==inputdatatwo::where?1:0) as intersect;
find_intersect = FILTER crossProduct BY intersect==1;
final =
FOREACH find_intersect
GENERATE id_one,
id_two;
Crossing large sets inflates the data.
A naive solution without crossing would be to partition the intervals and check for intersections within each interval.
I am working on a similar problem and will provide a code sample when I am done.

Problems generating ordered files in Pig

I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
B = FOREACH A GENERATE col1,col2,col3;
B = ORDER B BY col1 PARALLEL 1;
STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:
Instead of seeing this:
part-r-00000:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1
part-r-00001:
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
I should be seeing:
part-r-00000:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1
then I see the correct result.
I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
EDIT, new answers based on the extra details you provided:
1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.
2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.
OLD ANSWERS:
1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:
Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.
The way to do this is as follows:
C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
output = FOREACH data GENERATE title, MAP_PRODUCT(category);
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Resources