I want to split the following tuple into two tuples using pig script.
(key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
I want the output as follows:
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
Yes you can solve this problem using REGEX and TOTUPLE function. First split the string into two parts, first column is before the first comma and second column is remaining strings. Finally convert the two columns as tuples and store it.
input
key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([^,]+),(.*)$')) AS (col1,col2);
C = FOREACH B GENERATE TOTUPLE(col1),TOTUPLE(col2);
STORE C INTO 'output';
Output:( will be stored in output/part* file)
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
Related
I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.
You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.
I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin
How can I do this using STRSPLIT, TOKENIZER or any other method?
You can use STRSPLIT with regex to solve this problem. I am not sure your input has single or multiple combination of delimiters(dash,comma,hypen,space and hash) but the below solution will work for both.
input
a#b c-d,e
f e,g#h:i
1,2,3,4,5
l#y#z#h#n
A B C D E
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'[-,:\\s#]',5));
DUMP B;
Output:
(a,b,c,d,e)
(f,e,g,h,i)
(1,2,3,4,5)
(l,y,z,h,n)
(A,B,C,D,E)
If you have only single delimiter in your input, say'#' or any other delimiter that you mentioned then try the below script ( '5' in the third arg is total number of columns in your input)
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'#',5));
In case of multiple delimiter, suppose you want to add any new delimiter say '$' then just add this delimiter inside the character class of regex.
Note '$' is special character in Regex which needs escaping for double backslashs like this '[\\$-,:\\s#]'
My csv file contain 150 columns!! It contain "" as text qualifiers. how can i remove quotes("") using pig/hive/hbase dynamic script? similarly I have multiple files(50 csv files with different columns). How can i remove these "" from different files?
I tried with below pig script for 2 columns in a file:
A = LOAD 'hdfs://<hostname>:<port>/user/test/input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'"(.*)","(.*)"')) AS (id:int,name:chararray);
STORE B INTO '/user/test/output' USING PigStorage(',');
Any help would be appreciated.
Can you try like this?
input.txt
"123","456","789"
"abc","def","ghi"
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE REPLACE(line,'\\"','') AS line1;
C = FOREACH B GENERATE FLATTEN(STRSPLIT(line1,'\\,',3));
D = FOREACH C GENERATE $0,$1,$2;
DUMP D;
Output:
(123,456,789)
(abc,def,ghi)
In your case you can change the above 3rd line to STRSPLIT(line1,'\\,',150), where 150 is the total number of columns and you can access all the values by $0,$1...$149
My column has first name and last name separated by SPACE. I want to use pig function to split into 2 different columns. I think of STRSPLIT function but I don't know how to use it.
Could anyone help me on this simple question?
You can try something like this, sample code below
here what i am doing is
1.Reading each line as single column
2.Apply the STRSPLIT function using space as delimiter
3.Store the firstname and lastname into two different columns
input.txt
Pearson Charles
James Michael
Smith Linda
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+',2)) AS (firstname:chararray,lastname:chararray);
C = FOREACH B GENERATE firstname,lastname;
DUMP C;
Output:
(Pearson,Charles)
(James,Michael)
(Smith,Linda)
Check more info from this link
http://pig.apache.org/docs/r0.13.0/func.html#strsplit
I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.