Pig Script STRSPLIT - hadoop

My column has first name and last name separated by SPACE. I want to use pig function to split into 2 different columns. I think of STRSPLIT function but I don't know how to use it.
Could anyone help me on this simple question?

You can try something like this, sample code below
here what i am doing is
1.Reading each line as single column
2.Apply the STRSPLIT function using space as delimiter
3.Store the firstname and lastname into two different columns
input.txt
Pearson Charles
James Michael
Smith Linda
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+',2)) AS (firstname:chararray,lastname:chararray);
C = FOREACH B GENERATE firstname,lastname;
DUMP C;
Output:
(Pearson,Charles)
(James,Michael)
(Smith,Linda)
Check more info from this link
http://pig.apache.org/docs/r0.13.0/func.html#strsplit

Related

PIG: CONCAT A relation OUTPUT to another RELATION

Sorry for the wrong phrasing of question.
I am new to stackoverflow as well as I am completely new to PIG and trying to experiment on my own.
I have a scenario where to process the words.t file and data.txt file.
words.txt
word1
word2
word3
word4
data.txt
{"created_at":"18:47:31,Sun Sep 30 2012","text":"RT #Joey7Barton: ..give a word1 about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616}
I need to get the output as
(word1_epochtime){complete data which matched in text attribute}
i.e
(word1_1234567890){"created_at":"18:47:31,Sun Sep 30 2012","text":"RT #Joey7Barton: ..give a word1 about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616}
I have got the ouput as
(word1){"created_at":"18:47:31,Sun Sep 30 2012","text":"RT #Joey7Barton: ..give a
word1 about whether the americans wins a Ryder cup. I mean surely he
has slightly more important matters. #fami
...","user_id":450990391,"id":252479809098223616}
by using this script.
load words.txt
load data.txt
c = cross words,data;
d = FILTER c BY (data::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
e = foreach (group d BY word) {data);
and I got the epochtime with the words as
time = FOREACH words GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime(created_at));
But I am unable to CONCAT the words with time.
How can i get the output as
(word1_time){data}
Please feel free to suggest me for the above.
Thank you.
I think i got the output.
here is the script that I have written.
d = FILTER c BY (data::text MATCHES CONCAT(CONCAT('.*',word::word),'.*'));
e = FOREACH d GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime(created_at))) as epochtime;
f = foreach (group e BY epochtime) {data}
dump f;
Per this reference, CONCAT takes in two "Fields" as an input. I think in your case the problem is (chararray)ToUnixTime(CurrentTime()), is not being a field name. You could generate field that represents the current timestamp value and use it then in your concat function.

Split characters inside Pig field

I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.
You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.
I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

Splitting Pig tuple

I want to split the following tuple into two tuples using pig script.
(key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
I want the output as follows:
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
Yes you can solve this problem using REGEX and TOTUPLE function. First split the string into two parts, first column is before the first comma and second column is remaining strings. Finally convert the two columns as tuples and store it.
input
key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([^,]+),(.*)$')) AS (col1,col2);
C = FOREACH B GENERATE TOTUPLE(col1),TOTUPLE(col2);
STORE C INTO 'output';
Output:( will be stored in output/part* file)
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)

How to load the data without text qualifiers dynamically from a file using PIG/HIVE/HBASE?

My csv file contain 150 columns!! It contain "" as text qualifiers. how can i remove quotes("") using pig/hive/hbase dynamic script? similarly I have multiple files(50 csv files with different columns). How can i remove these "" from different files?
I tried with below pig script for 2 columns in a file:
A = LOAD 'hdfs://<hostname>:<port>/user/test/input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'"(.*)","(.*)"')) AS (id:int,name:chararray);
STORE B INTO '/user/test/output' USING PigStorage(',');
Any help would be appreciated.
Can you try like this?
input.txt
"123","456","789"
"abc","def","ghi"
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE REPLACE(line,'\\"','') AS line1;
C = FOREACH B GENERATE FLATTEN(STRSPLIT(line1,'\\,',3));
D = FOREACH C GENERATE $0,$1,$2;
DUMP D;
Output:
(123,456,789)
(abc,def,ghi)
In your case you can change the above 3rd line to STRSPLIT(line1,'\\,',150), where 150 is the total number of columns and you can access all the values by $0,$1...$149

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.

Resources