Handle thorn delimiter in pig - hadoop

My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.
Option 1 :
Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.
Option 2 :
I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.
I can see multiple questions in web, but unable to find a solution.
Kindly direct me with this.
Thorn Details :
http://www.fileformat.info/info/unicode/char/fe/index.htm

Is this the solution are you expecting?
input.txt:
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));
dump B;
Output:
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
Added 2nd option with different datatypes:
input.txt
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);
DUMP B;
DESCRIBE B;
Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)
(hello,4567,1990-01-01T00:00:00.000+00:00,world)
(hello,8901,2001-01-01T00:00:00.000+00:00,world)
(hello,9876,2014-01-01T00:00:00.000+00:00,world)
B: {f1: chararray,f2: long,f3: datetime,f4: chararray}
Another thorn symbol A¾:
input.txt
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
PigScript:
A = LOAD 'jinput.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);
DUMP B;
describe B;
Output:
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}
}

This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):
A = LOAD 'input' USING
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);
I don't see why the last field should be left out.
Somehow, this does not work:
A = LOAD 'input' USING PigStorage('\00DE');

Related

Pig Latin: Loading a very simple Bag

I'm writing because today I bumped into a problem I can't solve in any way despite having searched everywhere and tried a lot of different statements.
I have this input file:
3 {(car pen house glass)}
5 {(battery phone)}
6 {(the)}
(I would like to clarify that I've added '(' and ')' to the original file because they were missing).
My goal is just to load this file (using LOAD) into a variable and dumping it (using DUMP).
I show below the attempts I made and their relative DUMP outputs:
wc = LOAD 'input.txt' USING PigStorage(' ') AS (count:int,b:bag{(s:chararray)});
(3,)
(5,)
(6,{(the)})
wc = LOAD 'input.txt' USING PigStorage(' ') AS (count:int,b:tuple(s:chararray));
(3,)
(5,)
(6,(the))
wc = LOAD 'input.txt' USING PigStorage(' ') AS (count:int,b:bag{item:tuple(s:chararray)});
(3,)
(5,)
(6,{(the)})
Have you any idea to solve it?
Thanks in advance.
The issue here is that you are using ' ' as the delimiter and the bag contains ' '.A workaround is to load the records into a line and then use STRSPLIT to split the line into 2 fields.
wc = LOAD 'input.txt' AS (line:chararray);
wc_new = FOREACH wc GENERATE STRSPLIT(line,' ',2);
DUMP wc_new;

Pig Latin - foreach generate method does not work without the first field

I am facing a strange problem with pig generate function where if I do not use the first field the data generated seems to be wrong. Is this the expected behaviour ?
a = load '/input/temp2.txt' using PigStorage(' ','-tagFile') as (fname:chararray,line:chararray) ;
grunt> b = foreach a generate $1;
grunt> dump b;
(temp2.txt)
(temp2.txt)
grunt> c = foreach a generate $0,$1;
grunt> dump c;
(temp2.txt,field1,field2)
(temp2.txt,field1,field22)
$cat temp2.txt
field1,field2
field1,field22
pig -version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
In the example I was expecting dump b to return data file values instead of the file name
in your example , you use PigStorage(' ','-tagFile') ,so each line were split by space .
then:
$0 ->field1,field2
$1 -> nothing ,
just use PigStorage(',','-tagFile') .

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Below is the data I have and the schema for the same is-
student_name, question_number, actual_result(either - false/Correct)
(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)
What I want is to get the count for each student i.e. a/b for total
correct and false answer he/she has made.
For the use case shared, below pig script is suffice.
Pig Script :
student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
answers = student_data.actual_result;
correct_answers = FILTER answers BY actual_result=='Correct';
incorrect_answers = FILTER answers BY actual_result=='false';
GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};
Input : student_data.csv :
b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false
Output : DUMP kpi:
-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)
Ref : For more details on nested FOR EACH
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach
Use this:
data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
and answer will be like:
(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)
Hope this is the output you are looking for

Pig Latin - adding values from different bags?

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

Cannot extract values from a map in apache pig

I have a simple relation, v, in Apache Pig:
dump v;
(151364,[ 'ref'#'R813','highway'#'secondary', 'name:ga'#'Lána Chairdif', 'name'#'Cardiff Lane'],(31015271, 31053762))
(151368,[ 'ref'#'N1', 'oneway'#'yes','designation'#'Buses Only', 'highway'#'trunk', 'motor_vehicle'#'designated', 'name:ga'#'Cearnóg Pharnell Thoir', 'maxspeed'#'30', 'name'#'Parnell Square East'],(389365, 540403072))
(151596,[ 'name:en'#'Liffey', 'boundary'#'administrative', 'name:ga'#'An Life','admin_level'#'8', 'name'#'Liffey', 'waterway'#'river'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[ 'maxspeed'#'80', 'ref'#'L2223','highway'#'tertiary'],(13259933, 2384217, 335978958))
(367952,['created_by'#'YahooApplet 1.0', 'name'#'Charnwood Avenue', 'highway'#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ 'ref'#'L3018','highway'#'tertiary', 'maxspeed'#'50', 'name'#'Shelerin Road'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[ 'ref'#'N8','highway'#'trunk', 'name'#'Merchants' Quay'],(354153, 453344873))
(727157,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'Kyle Street'],(354168, 354167))
(727159,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'North Main Street'],(354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[ 'maxspeed'#'30','highway'#'pedestrian', 'name'#'Maylor Street'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))
On #orangeoctopus's advice, I have tried regenerating my data with any ' in the key names, and I have this data:
(151364,[ ref#'R813', name:ga#'Lána Chairdif', name#'Cardiff Lane',highway#'secondary'],(31015271, 31053762))
(151368,[ motor_vehicle#'designated', name#'Parnell Square East', highway#'trunk', oneway#'yes',designation#'Buses Only', maxspeed#'30', name:ga#'Cearnóg Pharnell Thoir', ref#'N1'],(389365, 540403072))
(151596,[ name:en#'Liffey', boundary#'administrative', waterway#'river', name:ga#'An Life',admin_level#'8', name#'Liffey'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[highway#'tertiary', maxspeed#'80', ref#'L2223'],(13259933, 2384217, 335978958))
(367952,[ name#'Charnwood Avenue',created_by#'YahooApplet 1.0', highway#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ maxspeed#'50', ref#'L3018', name#'Shelerin Road',highway#'tertiary'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[highway#'trunk', name#'Merchants' Quay', ref#'N8'],(354153, 453344873))
(727157,[ oneway#'yes', maxspeed#'30', name#'Kyle Street',highway#'unclassified'],(354168, 354167))
(727159,[ oneway#'yes', maxspeed#'30', name#'North Main Street',highway#'unclassified' (354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[highway#'pedestrian', name#'Maylor Street', maxspeed#'30'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))
In both cases v has the same schema/structure:
grunt> describe v;
2012-01-09 22:55:34,271 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
v: {id: int,tags: map[ ],nodes: (null)}
Then I try to extract out just one value from the tags map:
grunt> w = foreach v generate tags#'ref';
dump w;
But it only gives me empty data, even though some elements have data here.
()
()
()
()
()
()
()
()
()
()
With the old 'quoted' keys I tried (as per #orangeoctopus' solution)
w = foreach v generate tags#'\'ref\'';
And that gave me the same 'empty' data, and didn't work. (I also tried other combinations of ' and ", like "'ref'"/'"ref"'/etc. but all except '\'ref\'' were invalid pig latin syntax)
What's going on? If i try to filter based on the tag value, (e.g. filter v by tags#'highway' != ''), I get nothing, which is consistant with this above problem of not being able to extract data from the map, am I doing something wrong?
Very tricky!
Your problem is that your literal data includes single quotes. Your string is not ref (3 characters long), it is 'ref' (5 characters long). I realized this because the dump of a map containing strings does not typically have the quotes there.
Therefore, you need to be keying including those quotes (you have to escape them with \):
grunt> w = foreach v generate tags#'\'ref\'';
Your other option would be to change the way your data is being loaded so it doesn't include the single quotes in the strings themselves, and strips them out. PigStorage doesn't do this for free, but you could use something like REPLACE or your own UDF to do this.
Are you loading the data correctly too? It is weird that there is a space after the [ and before the ] when you dump your map.
Also it is more simple to drop all the quotes in the key and value in the input data. For example:
Input file
151364 [ref#R813,highway#secondary]
Pig
a = LOAD 'data.txt' AS (id:INT, m:MAP[]);
DUMP a;
b = FOREACH a GENERATE m#'ref';
DUMP b;
Output
(151364,[highway#secondary,ref#R813])
(R813)

Resources