Need to flatten multiple tuples from a bag - hadoop

I have input like this:
100.101.74.22 {(1358308803000,start,100.101.74.22,http://server1.com/flvplay-1.26.swf%23),(1358308973000,stop,100.101.74.22,http://server1.com/flvplay-1.26.swf%23),(1358308843000,pause,100.101.74.22,http://server1.com/flvplay-1.26.swf%23)}
I have written a script like this:
A = load 'inputpath' USING PigStorage('\t') AS (f1 : chararray, B:bag{T:tuple(x1 : chararray, x2 : chararray, x3 : chararray, x4 : chararray)});
B = foreach A generate f1,flatten(B.(x1, x2,x3,x4));
I am expecting output as:
100.101.74.22,1358308803000,start,100.101.74.22,http://server1.com/flvplay-1.26.swf%23,1358308973000,stop,100.101.74.22,http://server1.com/flvplay-1.26.swf%23,1358308843000,pause,100.101.74.22,http://server1.com/flvplay-1.26.swf%23
How can I get that? Please help.

Related

How to merge rows (items) of same relation in Apache Pig

I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.
Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)

Apache-PIG script: ERROR Invalid field projection on joined variable

The Pig script I have created works, unless I try to use GENERATE on the field that I joined on.
cc_data = LOAD 'default.complaint1' USING org.apache.hive.hcatalog.pig.HCatLoader();
cc2_data = LOAD 'default.complaint2' USING org.apache.hive.hcatalog.pig.HCatLoader();
combined = join cc_data by complaintid, cc2_data by complaintid;
If I do a DESCRIBE on my combined it shows as follows:
combined:
{cc_data::daterecieved: chararray,
cc_data::product: chararray,
cc_data::subproduct: chararray,
cc_data::issue: chararray,
cc_data::subissue: chararray,
cc_data::consumercomplaintnarrative: chararray,
cc_data::companypublicresponse: chararray,
cc_data::company: chararray,
cc_data::state: chararray,
cc_data::zip: chararray,
cc_data::submitted: chararray,
cc_data::datesenttocompany: chararray,
cc_data::companyresponsetoconsumer: chararray,
cc_data::timelyresponse: chararray,
cc_data::consumerdisputed: chararray,
cc_data::complaintid: int,
cc2_data::complaintid: int,
cc2_data::complaintamount: float,
cc2_data::consumerzip: int,
cc2_data::creditrating: chararray,
cc2_data::bankrupthistory: chararray}
I can use a FOREACH and GENERATE on all of the fields except for complaintid field. I've even tried cc_data.complaintid. I get this error:
ERROR 1025:
<file pig_read_orcfile.pig, line 13, column 190> Invalid field projection. Projected field [complaintid] does not exist in schema
Any ideas? Any help would be greatly appreciated!
Please try
... FOREACH combined GENERATE cc_data::complaintid;
http://pig.apache.org/docs/r0.9.1/basic.html#disambiguate

Apache Pig - How to extract sets of records

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.
Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

PIG Script for processing raw data

I am doing pig processing on my raw data to make some structure out of it.
Here's the sample data:
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
Expected output:
(Nov 1 18:23:34, 03, 0123, 198.0.12.24, 5)
I am trying to CONCAT(month,day,time) and remove the information before "=". I am using following script:
A = LOAD '----' using PigStorage('\t') as (m: chararray, d: chararray, t: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time);
C = foreach A generate REGEX_EXTRACT(devid, '^.*=(.*)$', 1),REGEX_EXTRACT(userid, '^.*=(.*)$', 1), REGEX_EXTRACT(intip, '^.*=(.*)$', 1),REGEX_EXTRACT(response, '^.*=(.*)$', 1);
Dump B; Dump C;
Output:
(Nov 1 18:23:34)
(03, 0123, 198.0.12.24, 5)
Suggestion I need:
Can I merge, Union, and join (B, C) to achieve expected output? As there is no common field, how can we do that.
Is there any other way to optimize the script or different procedure to get expected output using Map-Reduce program.
Looking forward to reply, your help is highly appreciated.
Probably you problem was in delimiters. You specify \t as delimiter (though this is the default), but you input data has spaces between fields.
Here is the code that works:
$ cat input
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
$ cat script.pig
A = LOAD 'input' as (mnth: chararray, day: chararray, time: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time),
REGEX_EXTRACT(devid, '^.*=(.*)$', 1),
REGEX_EXTRACT(userid, '^.*=(.*)$', 1),
REGEX_EXTRACT(intip, '^.*=(.*)$', 1),
REGEX_EXTRACT(response, '^.*=(.*)$', 1);
DUMP B;
$ pig -x local script.pig
...log messages...
(Nov 1 18:23:34,03,0123,198.0.13.24,5)
Hope that helps.

Loading from mysqldump with PIG

I have a mysqldump of the format:
INSERT INTO `MY_TABLE` VALUES (893024968,'342903068923468','o03gj8ip234qgj9u23q59u','testing123','HTTP','1','4213883b49b74d3eb9bd57b7','blahblash','2011-04-19 00:00:00','448','206',NULL,'GG');
How do I load this data using pig? I have tried;
A = LOAD 'pig-test/test.log' USING PigStorage(',') AS (ID: chararray, USER_ID: chararray, TOKEN: chararray, NODE: chararray, CHANNEL: chararray, CODE: float, KEY: chararray, AGENT: chararray, TIME: chararray, DURATION: float, RESPONSE: chararray, MESSAGE: chararray, TARGET: chararray);
Using , as a delimiter works fine, but I want the ID to be an int and I cannot figure out how to chop off the leading "INSERT INTO MY_TABLE VALUES (" and the trailing ");" when loading.
Also how should I load datetime information so that I can query it?
Any help you can give would be great.
You could load each record as a line of text and then try to regex/extract the field with MyRegExLoader or REGEX_EXTRACT_ALL:
A = LOAD 'data' AS (record: CHARARRAY);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(record, 'INSERT INTO...., \'(\d+)\', ...');
It is a kind of a hack but you can use REPLACE for chopping off the extra text too:
B = FOREACH A
GENERATE
(INT) REPLACE(ID, 'INSERT INTO MY_TABLE VALUES (', ''),
...
REPLACE(TARGET, ');', '');
Currently there is a problem with semicolon so you might need to do your own REPLACE.
There is not native date type in Pig but you can jungle with the date utils in PiggyBank or build your own UDF in order to convert it to a Unix long.
Another way would also be doing a simple script (Python...) for preparing the data for loading.

Resources