Hadoop Pig: Show entries using STARTSWITH - hadoop

I am having issues using the STARTSWITH string function. I want to display all records in System_Period that begins with 20040
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
sysGroup = GROUP transactions BY System_Period;
sysFilter = FILTER sysGroup BY STARTSWITH(transactions.System_Period, 20040);
DUMP sysFilter;
The error I am receiving is
Could not infer the matching function for org.apache.pig.builtin.STARTSWITH as multiple or none of them fit. Please use an explicit cast.

STARTSWITH is only used to compare a tuple1 with tuple2 to check whether tuple1 contains tuple2. You cannot pass a relation or a bag to that. And one more thing to be noted is it accepts only String(chararray) not an integer. Either FILTER the system_period that begins with 20040 before the GROUP BY and load system_period as chararray and then cast it after the filter as per your need.
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');
Else after GROUP BY FLATTEN the result and then filter
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysGroup = GROUP transactions BY System_Period;
flatres = FOREACH sysGroup GENERATE group,FLATTEN(transactions);
sysFilter = FILTER flatres BY STARTSWITH(System_Period, '20040');

Related

Filter by in pig by an existing realtion

I have a blacklist file that looks a little bit like this
481295b2-30c7-4191-8c14-4e513c7e7577
481295a2-1234-4191-8c14-4e513c7e7577
and a lot of other data i am loading .
How can i filter out the data that is already inside the blacklist?
sort of not in in SQL terms.
I tried using somthing a little bit like this
but couldn't make this work with a relation.
You can use a left join and filter to implement it. E.g.,
data = load '/path/to/data.txt' as (id: chararray);
blacklist = load '/path/to/blacklist.txt' as (id: chararray);
jnd = join data by id left outer, blacklist by id using 'replicated';
filtered = filter jnd by blacklist::id is null;
result = foreach filtered generate data::id as id;
dump result;
In this example, the input data will be joined (left outer) by blacklist. After that, we removed the rows which match the blacklist by a is null check.
using 'replicated' is used to tell Pig to load the second relation into the memory to speed up the join. If the blacklist is too big to fit into memory, you can remove using 'replicated'.

Hbase filter to find rows without a specific column

I want to filter out all rows that do not have a specific column. any idea which comparator to use?
You can use skip filter combined with qualifier filter.
If you use the java client API:
Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes("column-name")));
Filter filter2 = new SkipFilter(filter);
scan.setFilter(filter2);
this will return all the row without that specific column
SingleColumnValueFilter has method setFilterIfMissing that excludes all row that do not contain given column if it is given true. All that is needed is to design filter so it will always pass and call setFilterIfMissing(true)
SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(columnFamily), Bytes.toBytes("column_name"), CompareFilter.CompareOp.NOT_EQUAL, Bytes.toBytes("non-sense"));
filter.setFilterIfMissing(true);
scan.setFilter(filter);

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Pig Latin (Filter 2nd data source in foreach loop)

I have 2 data sources. One contains a list of api calls and the other contains all related authentication events. There can be multiple Auth Events for each Api Call, I want to find the auth event that:
a) contains the same "identifier" as the Api Call
b) happened within a second after the Api Call
c) is the closest to the Api Call after the above filtering.
I had planned to loop through each ApiCall event in a foreach loop and then use filter statements on the authevents to find the correct one - however, it does not appear that this is possible (USING Filter in a Nested FOREACH in PIG)
Would anyone be able to suggest other ways to achieve this. If it helps, here's the Pig script I tried to use:
apiRequests = LOAD '/Documents/ApiRequests.txt' AS (api_fileName:chararray, api_requestTime:long, api_timeFromLog:chararray, api_call:chararray, api_leadString:chararray, api_xmlPayload:chararray, api_sourceIp:chararray, api_username:chararray, api_identifier:chararray);
authEvents = LOAD '/Documents/AuthEvents.txt' AS (auth_fileName:chararray, auth_requestTime:long, auth_timeFromLog:chararray, auth_call:chararray, auth_leadString:chararray, auth_xmlPayload:chararray, auth_sourceIp:chararray, auth_username:chararray, auth_identifier:chararray);
specificApiCall = FILTER apiRequests BY api_call == 'CSGetUser'; -- Get all events for this specific call
match = foreach specificApiCall { -- Now try to get the closest mathcing auth event
filtered1 = filter authEvents by auth_identifier == api_identifier; -- Only use auth events that have the same identifier (this will return several)
filtered2 = filter filtered1 by (auth_requestTime-api_requestTime)<1000; -- Further refine by usings auth events within a second on the api call's tiime
sorted = order filtered2 by auth_requestTime; -- Get the auth event that's closest to the api call
limited = limit sorted 1;
generate limited;
};
dump match;
Nested FOREACH is not for working with a second relation while looping over the first one. It's for when your relation has a bag in it and you want to work with that bag as though it were its own relation. You cannot work with apiRequests and authEvents at the same time unless you do some kind of joining or grouping first to put all the information you need into a single relation.
Your task works nicely conceptually with a JOIN and FILTER, if you did not need to limit yourself to a single authorization event:
allPairs = JOIN specificApiCall BY api_identifier, authEvents BY auth_identifier;
match = FILTER allPairs BY (auth_requestTime-api_requestTime)<1000;
Now all the information is together, and you could do GROUP match BY api_identifier followed by a nested FOREACH to pick out a single event.
However, you could do this in a single step if you use the COGROUP operator, which is like JOIN but without the cross-product -- you get two bags with the grouped records from each relation. Use this to pick out the nearest authorization event:
cogrp = COGROUP specificApiCall BY api_identifier, authEvents BY auth_identifier;
singleAuth = FOREACH cogrp {
auth_sorted = ORDER authEvents BY auth_requestTime;
auth_1 = LIMIT auth_sorted 1;
GENERATE FLATTEN(specificApiCall), FLATTEN(auth_1);
};
Then FILTER to only leave the ones within 1 second:
match = FILTER singleAuth BY (auth_requestTime-api_requestTime)<1000;

Pig approach to pairing data fields in a data set

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:
Say that I have a dataset representing college classes and students that attend them:
Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry
I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:
Philosphy John,David
English Mary,Sue
English Mary,Mark
English Mary,Larry
English Sue,Mark
English Sue,Larry
English Mark,Larry
This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:
s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...
(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)
Thank you for any assistance!
You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings
(tested on Pig v0.10.0) :
register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));
When further flattening the result:
D = FOREACH C generate FLATTEN($0) as (class:chararray),
FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);
You'll end up having the desired result:
dump D;
(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)
There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.
The first approach is a self join
s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...
The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.
The other option would be to use CROSS nested in a FOREACH after the GROUP:
Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.
B = GROUP s BY class;
C = FOREACH B {
DA = CROSS s, s;
GENERATE FLATTEN(DA);
}
This can be done with a self-join and some simple filtering.
classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;
Note that filtering by student1 < student2 gets you unique pairs.

Resources