Interacting with multiple stored values within a field in Pig - hadoop

I am currently working with a field in pig that contains multiple values. I am looking to count users by product by location and I used LOAD to create data in the following format: (Location, {(product1), (product2), (product3)}, numOfUsers). I am looking to separate out each of the products and treat them as separate entities meaning i'd like to end up with the following:
(location, (product1), numOfUsers)
(location, (product2), numOfUsers)
(location, (product3), numOfUsers)
I believe I need to use some sort of nested FOREACH function, but i'm a bit lost. Num of users for each product contained in the same tuple will be the same since they are grouped and that's perfectly fine. I am a beginner (started with Pig 3 days ago) so any guidance would be greatly appreciated. I believe I would use FLATTEN?

FOREACH A GENERATE location, FLATTEN(products) AS product, numOfUsers;
Solved the issue. This created a cross product of all records that were stored within the bag. Used http://www.st.ewi.tudelft.nl/~hauff/BDP-Lectures/9_10_advanced_pig.pdf for reference. Very useful resource.

Related

Global filters for different data sources (with common tables)

I am currently working on Tableau using 2 data sources using each a join of 2 tables (named A, B, C):
Data source 1: A-B
Data source 2: A-C
Basically, A contains the major information that I need and then I join data from B and C to get the extra information I need for each report I am doing.
I then do a dashboard that contains reports using the data source 1 and 2.
My problem now is that I am filtering this dashboard using a dimension in A and I would like it to apply to all worksheets (e.g. for those using data sources 1 and those using data source 2).
I thought that because A is the common table in all data sources, that using a dimension in A would be ok to filter everything but it seems that it is not the case.
Is there a way to fix this?
I read some forums about creating a parameter. However, the filtering I am doing is basically as follows: I want my users to choose 1 shop name. They can find it either by:
Typing the name in the 'Shop name' quick filter,
Using a combination of the quick filters 'Region' and 'country' to then get a drop down of 'Shop Name' that has a reduced amounts of shop names (easier when the user knows where the shop is but does not remember its exact name).
Using a parameter would not allow me to do this anymore since all of this is based on 'filtering the relevant values'.
Does anyone have any recommendations?

Storing data with big and dynamic groupings / paths

I currently have the following pig script (column list truncated for brevity):
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
SPLIT inputData INTO site0 IF (SITE_ID_COL == 0), site3 IF (SITE_ID_COL == 3), site15 IF (SITE_ID_COL == 15);
STORE site0 INTO 'pigsplit1/0/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/0/','2', 'bz2', '\\x7F');
STORE site3 INTO 'pigsplit1/3/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/3/','2', 'bz2', '\\x7F');
STORE site15 INTO 'pigsplit1/15/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/15/','2', 'bz2', '\\x7F');
And it works great for what I wanted it to do, but there's actually at least 22 possible site IDs and I'm not certain there's not more. I'd like to dynamically create the splits and store into paths based on that column. Is the easiest way to do this going to be through a two step usage of the MultiStorage UDF, first splitting by the site ID and then loading all those results and splitting by the date? That seems inefficient. Can I somehow do it through GROUP BYs? It seems like I should be able to GROUP BY the site ID, then flatten each row and run the multi storage on that, but I'm not sure how to concatenate the GROUP into the path.
The MultiStorage UDF is not set up to divide inputs on two different fields, but that's essentially what you're doing -- the use of SPLIT is just to emulate MultiStorage with two parameters. In that case, I'd recommend the following:
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
dataWithKey = FOREACH inputData GENERATE CONCAT(CONCAT(SITE_ID_COL, '-'), EXTRACT_DATE_COL), *;
STORE dataWithKey INTO 'tmpDir' USING org.apache.pig.piggybank.storage.MultiStorage('tmpDir', '0', 'bz2', '\\x7F');
Then go over your output with a simple script to list all the files in your output directories, extract the site and date IDs, and move them to appropriate locations with whatever structure you like.
Not the most elegant workaround, but it could work all right for you. One thing to watch out for is the separator you choose in your key might not be allowed (it might only be alphanumeric). Also, you'll be stuck with that extra field in your output data.
I've actually submitted a patch to the MultiStorage module to allow splitting on multiple tuple fields rather than only one, resulting in a dynamic output tree.
https://issues.apache.org/jira/browse/PIG-3258
It hasn't gotten much attention yet, but I'm using it in production with no issues.

assigning IDs to hadoop/PIG output data

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));

Filtering Quotes by InventTable

I'm trying to build a report in AX 2009 (SP1, currently rollup 6) with a primary data source of the SalesQuotationLine table. Due to how our inventory is structured, I need to apply a filter that shows only certain categories of items (in this case, non-service items as defined in the InventTable). However, it seems that there is a problem in the link between the SalesQuotationLine and InventTable such that only two specific items will ever display.
We have tested this against the Sales Quotation Details screen as well, with the same results. Executing a query such as this:
...will only show quotes that have one of the specific items mentioned earlier. If we change the Item Type to something else (for example to Item), the result is an empty set. We are also getting this issue on one of our secondary test servers, which for all intents is a fresh install.
There doesn't seem to be any issues with the data mapping from the one table to the other, and we are not experiencing this issue with any other table set. Is this a real issue, or am I just missing something?
After analyzing the results from a SQL Profile run during the execution of the query, it seems the issue was a system bug. When selecting a table to join to the SalesQuotationLines, you have two options: 'Items' and 'Items (Item Number)'. Regardless of which table you select the query executes with, it joins the InventTable with the relation "SalesQuotationLines.ProjTransCode = InventTable.ItemId".
After comparing the table to other layers in the system, I found the following block of code removed from the createLine method (in the SYP layer):
if (this.ProjTransType == QuotationProjTransType::Item)
{
this.ProjTransCode = this.ItemId;
}
Since the ProjTransCode is no longer being populated, the join does not work except on certain quote lines that do have the ProjTransCode populated.
In addition, there is no directly defined relation to the InventTable - the link is only maintained via an Extended Data Type that is used on the SalesQuotationLine.ItemId field. Adding this relation in manually solved the problem.

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Resources