I'm new to pig. I'm trying to do a merge join. To meet the following requirement:
Data must be sorted on join keys in ascending (ASC) order on both
sides.
Sample File:
4, The Object of Beauty, 1991,2.8,6150
1, The Nightmare Before Christmas, 1993,3.9,4568
2, The Mummy, 1932,3.5,4388
3, Orphans of the Storm, 1921,3.2,9062
3, Orphans of the Storm, 1921,3.2,9062
4, The Object of Beauty, 1991,2.8,6150
5, Night Tide, 1963,2.8,5126
6, One Magic Christmas, 1985,3.8,5333
7, Muriel's Wedding, 1994,3.5,6323
8, Mother's Boys, 1994,3.4,5733
9, Nosferatu: Original Version, 1929,3.5,5651
10, Nick of Time, 1995,3.4,5333
I executed the following commands, inside PIG:
movies = LOAD 'Sample.csv' using PigStorage (',') as (id: int, name, year, rating, duration);
movies_sorted movies = order by id ASC PARALLEL 3;
movies_sorted store into 'output_movies';
When I execute:
hadoop fs-cat ./output2/part-r-00000
I see that, there are records with equal keys in different partitions. For example, i have the record with id 3, in two different partitions. To my knowledge, records with the same key should always be in the same partition. F
What could be wrong?
In a few cases, including ORDER BY and skewed JOIN, Pig will break the map-reduce convention of sending all records for a given key to just one reducer. (Note that the notion of ordering is already outside the map-reduce paradigm.) You are still guaranteed, however, that if you traverse the output of the reducers in order (as indicated by the number in part-r-NNNNN), the records will be ordered as specified.
You can read more in this thread.
Related
I have preview of matrix like this:
Columns "Report#" and "Headline Indicator" are added manually, "Report Month" is number of month that is used for row grouping, "1" and "3" are column group fields (department_key_id). Design on the picture below:
I want to sort data by first column "Report#".
I tried to configure some interactive for that column but it was not worked for me(
How can I sort rows with data in the table in this way - 2, 2, 3, 3, 4, 4, 5, 5?
thanks!
So it looks to me like you might need a couple of changes to get the set-up you want here. First, I'm going to recommend adding an ORDER BY [report_month_num] to your query and you can do away with the current row grouping on report_month_num.
You'll need to add a Detail grouping row for each of your four rows. So right click in any textbox, navigate to Add Group > Row Group > Adjacent below. Make sure to select Show detail data for each of the four groups you make. You'll need to copy the data into the new rows and you should be all set.
This should ensure the ordering you are expecting.
I've been racking my brain trying to figure out a way to reorganize my model array when the user commits a drag and drop on an NSTableView. They can currently select one or more row, and drop them before or after any row in the table.
I'm given an IndexSet containing the rows of the table that were dragged. I'm also given a single destinationRow representing the destination of where those rows will be placed.
So for example, the original model looks like this: [0, 1, 2, 3, 4, 5]
And the user drags rows 1 and 2 to after 3...
Then I need to update my array to look like this: [0, 3, 1, 2, 4, 5]
Any ideas on how to efficiently achieve this? Thank you.
I realized I'd been approaching the problem the wrong way.
I was trying to think of how to swap elements with their indexes, when all I really needed to do was use the built in functions .remove(at:) and .insert(obj, at:)
That allowed me to reorganize the array quite simply.
I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
B = FOREACH A GENERATE col1,col2,col3;
B = ORDER B BY col1 PARALLEL 1;
STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:
Instead of seeing this:
part-r-00000:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1
part-r-00001:
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
I should be seeing:
part-r-00000:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1
then I see the correct result.
I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
EDIT, new answers based on the extra details you provided:
1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.
2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.
OLD ANSWERS:
1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:
Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.
The way to do this is as follows:
C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?
I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));
Consider I have two collections in MongoDB. One for products with documents like:
{'_id': ObjectId('lalala'), 'title': 'Yellow banana'}
And another stores price changes with documents like:
{'product': DBRef('products', ObjectId('lalala')),
'since': datetime(2011, 4, 5),
'new_price': 150 }
One product may have many price changes. The price lasts until a new change with later time stamp. I guess you've caught idea.
Say, I have 100 products. I want to query my DB to get know what's the price of each product at the moment of June 9, 2011. What is the most efficient (quick) way to perform this query in MongoDB? Suppose I have no cache solution or cache is empty.
I thought about group statement on prices collection, where reduce function would select last since before a date provided, grouping by product.$id. But in this case I would not benefit from an index on since field and all documents would be scanned.
Any ideas?
I had a similar problem, but for GPS locations. I found the fastest way was to set up a query for each item, which is rather counter-intuitive if your used to SQL databases.
Query for the item where it's timestamp is less or equal than the date your looking for, and limit the result to 1. Repeat for each item. To really speed things up, run multiple querys in parallel to utilise all the cores on the MongoDB server.