PIG Script for processing raw data - hadoop

I am doing pig processing on my raw data to make some structure out of it.
Here's the sample data:
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
Expected output:
(Nov 1 18:23:34, 03, 0123, 198.0.12.24, 5)
I am trying to CONCAT(month,day,time) and remove the information before "=". I am using following script:
A = LOAD '----' using PigStorage('\t') as (m: chararray, d: chararray, t: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time);
C = foreach A generate REGEX_EXTRACT(devid, '^.*=(.*)$', 1),REGEX_EXTRACT(userid, '^.*=(.*)$', 1), REGEX_EXTRACT(intip, '^.*=(.*)$', 1),REGEX_EXTRACT(response, '^.*=(.*)$', 1);
Dump B; Dump C;
Output:
(Nov 1 18:23:34)
(03, 0123, 198.0.12.24, 5)
Suggestion I need:
Can I merge, Union, and join (B, C) to achieve expected output? As there is no common field, how can we do that.
Is there any other way to optimize the script or different procedure to get expected output using Map-Reduce program.
Looking forward to reply, your help is highly appreciated.

Probably you problem was in delimiters. You specify \t as delimiter (though this is the default), but you input data has spaces between fields.
Here is the code that works:
$ cat input
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
$ cat script.pig
A = LOAD 'input' as (mnth: chararray, day: chararray, time: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time),
REGEX_EXTRACT(devid, '^.*=(.*)$', 1),
REGEX_EXTRACT(userid, '^.*=(.*)$', 1),
REGEX_EXTRACT(intip, '^.*=(.*)$', 1),
REGEX_EXTRACT(response, '^.*=(.*)$', 1);
DUMP B;
$ pig -x local script.pig
...log messages...
(Nov 1 18:23:34,03,0123,198.0.13.24,5)
Hope that helps.

Related

How to merge rows (items) of same relation in Apache Pig

I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.
Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)

Data transformation using pig

I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;

apache pig output null values when loading with int datatype

I am working with pig-0.16.0
I'm trying to join two tab delimited files (.tsv) using pig script. Some of the column fields are of integer type, so I am trying to load them as int. But I see that whichever columns I made 'int' are not loaded with data and they shows as empty. My join was not outputting any result, so I took a step back and found out this problem occurred at the loading step. I am pasting a part of my pig script here:
REGISTER /usr/local/pig/lib/piggybank.jar;
-- $0 = streaminputs/forum_node.tsv
-- $1 = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);
LUFN = LIMIT u_f_n 10;
STORE LUFN INTO 'pigout/LN';
u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);
LUFUU = LIMIT u_f_u 10;
STORE LUFUU INTO 'pigout/LU';
I tried using long, but still the same issue, only chararray seemed to work here. So, what could be the problem?
Snippets from two input .tsv files:
forum_nodes.tsv:
"id" "title" "tagnames" "author_id" "body" "node_type" "parent_id" "abs_parent_id" "added_at" "score" "state_string" "last_edited_id" "last_activity_by_id" "last_activity_at" "active_revision_id" "extra" "extra_ref_id" "extra_count" "marked"
"5339" "Whether pdf of Unit and Homework is available?" "cs101 pdf" "100000458" "" "question" "\N" "\N" "2012-02-25 08:09:06.787181+00" "1" "" "\N" "100000921" "2012-02-25 08:11:01.623548+00" "6922" "\N" "\N" "204" "f"
forum_users.tsv:
"user_ptr_id" "reputation" "gold" "silver" "bronze"
"100006402" "18" "0" "0" "0"
"100022094" "6354" "4" "12" "50"
"100018705" "76" "0" "3" "4"
"100021176" "213" "0" "1" "5"
"100045508" "505" "0" "1" "5"
You need to replace quotes to let pig know its int or else it will display blank. You should use CSVLoader OR CSVExcelStorage, see my tests:
Sample File:
"1","test"
Test 1 - Using CSVLoader:
You can use CSVLoader or CSVExcelStorage if you want to ignore quotes - see example here
PIG Commands:
register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);
output:
(1,test)
Test 2 - Replacing double quotes:
PIG commands:
file1 = load 'file1.txt' using PigStorage(',');
data = foreach file1 generate REPLACE($0,'\\"','') as (f1:int) ,$1 as (f2:chararray);
output:
(1,"test")
Test 3 - using data as it is:
PIG commands:
file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);
Output:
(,"test")

How can I convert a bag to an array of numeric values?

I'm trying to turn the following schema:
{
id: chararray,
v: chararray,
paid: chararray,
ts: {(ts: int)}
}
into the following JSON output:
{
"id": "abcdef123456",
v: "some identifier",
paid: "another identifier",
ts: [ 1,2,3,4,5,6 ]
}
I know how to generate the JSON output, but I can't figure out how to turn the ts attribute in my Pig Schema to just the array of numeric values.
The number of items in the ts bag is known, but they all have the same schema (ts: int).
Pig doesn't support array kind of datatype, one option could be you can try something like this.
input
1 1 100 {(1),(2),(3)}
2 2 200 {(4),(5)}
3 3 300 {(1),(2),(3),(4),(5),(6)}
PigScript:
A = LOAD 'input' USING PigStorage() AS (id: chararray, v: chararray,paid: chararray,ts: {(ts: int)});
B = FOREACH A GENERATE id,v,paid,CONCAT('[',BagToString(ts,','),']') AS ts;
STORE B INTO 'output' USING JsonStorage();
Output:
{"id":"1","v":"1","paid":"100","ts":"[1,2,3]"}
{"id":"2","v":"2","paid":"200","ts":"[4,5]"}
{"id":"3","v":"3","paid":"300","ts":"[1,2,3,4,5,6]"}

Loading from mysqldump with PIG

I have a mysqldump of the format:
INSERT INTO `MY_TABLE` VALUES (893024968,'342903068923468','o03gj8ip234qgj9u23q59u','testing123','HTTP','1','4213883b49b74d3eb9bd57b7','blahblash','2011-04-19 00:00:00','448','206',NULL,'GG');
How do I load this data using pig? I have tried;
A = LOAD 'pig-test/test.log' USING PigStorage(',') AS (ID: chararray, USER_ID: chararray, TOKEN: chararray, NODE: chararray, CHANNEL: chararray, CODE: float, KEY: chararray, AGENT: chararray, TIME: chararray, DURATION: float, RESPONSE: chararray, MESSAGE: chararray, TARGET: chararray);
Using , as a delimiter works fine, but I want the ID to be an int and I cannot figure out how to chop off the leading "INSERT INTO MY_TABLE VALUES (" and the trailing ");" when loading.
Also how should I load datetime information so that I can query it?
Any help you can give would be great.
You could load each record as a line of text and then try to regex/extract the field with MyRegExLoader or REGEX_EXTRACT_ALL:
A = LOAD 'data' AS (record: CHARARRAY);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(record, 'INSERT INTO...., \'(\d+)\', ...');
It is a kind of a hack but you can use REPLACE for chopping off the extra text too:
B = FOREACH A
GENERATE
(INT) REPLACE(ID, 'INSERT INTO MY_TABLE VALUES (', ''),
...
REPLACE(TARGET, ');', '');
Currently there is a problem with semicolon so you might need to do your own REPLACE.
There is not native date type in Pig but you can jungle with the date utils in PiggyBank or build your own UDF in order to convert it to a Unix long.
Another way would also be doing a simple script (Python...) for preparing the data for loading.

Resources