I have a mysqldump of the format:
INSERT INTO `MY_TABLE` VALUES (893024968,'342903068923468','o03gj8ip234qgj9u23q59u','testing123','HTTP','1','4213883b49b74d3eb9bd57b7','blahblash','2011-04-19 00:00:00','448','206',NULL,'GG');
How do I load this data using pig? I have tried;
A = LOAD 'pig-test/test.log' USING PigStorage(',') AS (ID: chararray, USER_ID: chararray, TOKEN: chararray, NODE: chararray, CHANNEL: chararray, CODE: float, KEY: chararray, AGENT: chararray, TIME: chararray, DURATION: float, RESPONSE: chararray, MESSAGE: chararray, TARGET: chararray);
Using , as a delimiter works fine, but I want the ID to be an int and I cannot figure out how to chop off the leading "INSERT INTO MY_TABLE VALUES (" and the trailing ");" when loading.
Also how should I load datetime information so that I can query it?
Any help you can give would be great.
You could load each record as a line of text and then try to regex/extract the field with MyRegExLoader or REGEX_EXTRACT_ALL:
A = LOAD 'data' AS (record: CHARARRAY);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(record, 'INSERT INTO...., \'(\d+)\', ...');
It is a kind of a hack but you can use REPLACE for chopping off the extra text too:
B = FOREACH A
GENERATE
(INT) REPLACE(ID, 'INSERT INTO MY_TABLE VALUES (', ''),
...
REPLACE(TARGET, ');', '');
Currently there is a problem with semicolon so you might need to do your own REPLACE.
There is not native date type in Pig but you can jungle with the date utils in PiggyBank or build your own UDF in order to convert it to a Unix long.
Another way would also be doing a simple script (Python...) for preparing the data for loading.
Related
I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.
Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)
I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;
I am working with pig-0.16.0
I'm trying to join two tab delimited files (.tsv) using pig script. Some of the column fields are of integer type, so I am trying to load them as int. But I see that whichever columns I made 'int' are not loaded with data and they shows as empty. My join was not outputting any result, so I took a step back and found out this problem occurred at the loading step. I am pasting a part of my pig script here:
REGISTER /usr/local/pig/lib/piggybank.jar;
-- $0 = streaminputs/forum_node.tsv
-- $1 = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);
LUFN = LIMIT u_f_n 10;
STORE LUFN INTO 'pigout/LN';
u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);
LUFUU = LIMIT u_f_u 10;
STORE LUFUU INTO 'pigout/LU';
I tried using long, but still the same issue, only chararray seemed to work here. So, what could be the problem?
Snippets from two input .tsv files:
forum_nodes.tsv:
"id" "title" "tagnames" "author_id" "body" "node_type" "parent_id" "abs_parent_id" "added_at" "score" "state_string" "last_edited_id" "last_activity_by_id" "last_activity_at" "active_revision_id" "extra" "extra_ref_id" "extra_count" "marked"
"5339" "Whether pdf of Unit and Homework is available?" "cs101 pdf" "100000458" "" "question" "\N" "\N" "2012-02-25 08:09:06.787181+00" "1" "" "\N" "100000921" "2012-02-25 08:11:01.623548+00" "6922" "\N" "\N" "204" "f"
forum_users.tsv:
"user_ptr_id" "reputation" "gold" "silver" "bronze"
"100006402" "18" "0" "0" "0"
"100022094" "6354" "4" "12" "50"
"100018705" "76" "0" "3" "4"
"100021176" "213" "0" "1" "5"
"100045508" "505" "0" "1" "5"
You need to replace quotes to let pig know its int or else it will display blank. You should use CSVLoader OR CSVExcelStorage, see my tests:
Sample File:
"1","test"
Test 1 - Using CSVLoader:
You can use CSVLoader or CSVExcelStorage if you want to ignore quotes - see example here
PIG Commands:
register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);
output:
(1,test)
Test 2 - Replacing double quotes:
PIG commands:
file1 = load 'file1.txt' using PigStorage(',');
data = foreach file1 generate REPLACE($0,'\\"','') as (f1:int) ,$1 as (f2:chararray);
output:
(1,"test")
Test 3 - using data as it is:
PIG commands:
file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);
Output:
(,"test")
The Pig script I have created works, unless I try to use GENERATE on the field that I joined on.
cc_data = LOAD 'default.complaint1' USING org.apache.hive.hcatalog.pig.HCatLoader();
cc2_data = LOAD 'default.complaint2' USING org.apache.hive.hcatalog.pig.HCatLoader();
combined = join cc_data by complaintid, cc2_data by complaintid;
If I do a DESCRIBE on my combined it shows as follows:
combined:
{cc_data::daterecieved: chararray,
cc_data::product: chararray,
cc_data::subproduct: chararray,
cc_data::issue: chararray,
cc_data::subissue: chararray,
cc_data::consumercomplaintnarrative: chararray,
cc_data::companypublicresponse: chararray,
cc_data::company: chararray,
cc_data::state: chararray,
cc_data::zip: chararray,
cc_data::submitted: chararray,
cc_data::datesenttocompany: chararray,
cc_data::companyresponsetoconsumer: chararray,
cc_data::timelyresponse: chararray,
cc_data::consumerdisputed: chararray,
cc_data::complaintid: int,
cc2_data::complaintid: int,
cc2_data::complaintamount: float,
cc2_data::consumerzip: int,
cc2_data::creditrating: chararray,
cc2_data::bankrupthistory: chararray}
I can use a FOREACH and GENERATE on all of the fields except for complaintid field. I've even tried cc_data.complaintid. I get this error:
ERROR 1025:
<file pig_read_orcfile.pig, line 13, column 190> Invalid field projection. Projected field [complaintid] does not exist in schema
Any ideas? Any help would be greatly appreciated!
Please try
... FOREACH combined GENERATE cc_data::complaintid;
http://pig.apache.org/docs/r0.9.1/basic.html#disambiguate
I am doing pig processing on my raw data to make some structure out of it.
Here's the sample data:
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
Expected output:
(Nov 1 18:23:34, 03, 0123, 198.0.12.24, 5)
I am trying to CONCAT(month,day,time) and remove the information before "=". I am using following script:
A = LOAD '----' using PigStorage('\t') as (m: chararray, d: chararray, t: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time);
C = foreach A generate REGEX_EXTRACT(devid, '^.*=(.*)$', 1),REGEX_EXTRACT(userid, '^.*=(.*)$', 1), REGEX_EXTRACT(intip, '^.*=(.*)$', 1),REGEX_EXTRACT(response, '^.*=(.*)$', 1);
Dump B; Dump C;
Output:
(Nov 1 18:23:34)
(03, 0123, 198.0.12.24, 5)
Suggestion I need:
Can I merge, Union, and join (B, C) to achieve expected output? As there is no common field, how can we do that.
Is there any other way to optimize the script or different procedure to get expected output using Map-Reduce program.
Looking forward to reply, your help is highly appreciated.
Probably you problem was in delimiters. You specify \t as delimiter (though this is the default), but you input data has spaces between fields.
Here is the code that works:
$ cat input
Nov 1 18:23:34 dev_id=03 user_id=0123 int_ip=198.0.13.24 response_code=5
$ cat script.pig
A = LOAD 'input' as (mnth: chararray, day: chararray, time: chararray, devid: chararray, userid: chararray, intip: chararray, response: chararray);
B = foreach A generate CONCAT(CONCAT(CONCAT(CONCAT(mnth, ' '), day), ' '), time),
REGEX_EXTRACT(devid, '^.*=(.*)$', 1),
REGEX_EXTRACT(userid, '^.*=(.*)$', 1),
REGEX_EXTRACT(intip, '^.*=(.*)$', 1),
REGEX_EXTRACT(response, '^.*=(.*)$', 1);
DUMP B;
$ pig -x local script.pig
...log messages...
(Nov 1 18:23:34,03,0123,198.0.13.24,5)
Hope that helps.