How to process multi - delimiter file in pig 0.8 - hadoop

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.

Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN

Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Related

Error in loading the csv file in Apache Pig

I tried to load the data using following command in apache pig on hdfs mode:
test = LOAD /user/swap/done2.csv using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);
it gives the error as below:
2017-12-12 13:49:10,347 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input '/' expecting QUOTEDSTRING
Details at logfile: /home/matlab/Documents/pig_1513066708530.log
surprisingly My dataset does not have the 13 columns.
file path should be in quotes '' to LOAD
test = LOAD '/user/swap/done2.csv' using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);

Getting AVG in pig

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.
I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

PIG: Unable to open iterator for alias AliasName.Scalar has more than one row in the output

I am new to pig and trying to learn on my own.
I have written a script to get the epoch time with a word that is reading from words.txt file.
Here is the script.
words = LOAD 'words.txt' AS word:chararray;
B = FOREACH A GENERATE CONCAT(CONCAT(A.word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B;
But the issue is, if words.txt file have only one word it is giving proper output.
If it is having multiple words like
word1
word2
word3
word4
then it is giving the following error
ERROR 1066: Unable to open iterator for alias B
java.lang.Exception:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Scalar has more than one row in the output. 1st : (word1 ), 2nd :(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar"
should be "foo::bar" ) at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
0: Scalar has more than one row in the output. 1st : (word1 ), 2nd
:(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar"
should be "foo::bar" ) at
org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:122) at
o
Please suggest me to solve this issue.
Thank you.
solved on my own.
just removed the A. from the inner CONCAT. It worked for me.
script:
words = LOAD 'words.txt' AS word:chararray;
B = FOREACH A GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B;

SPLIT command not working in apache-pig

initialData = load 'Weather_Report.log' using PigStorage('|') as (cityid:int,cityname:chararray,currentWeather:chararray,weatherCode:int);
SPLIT initialData INTO noRainsCities IF weatherCode ==10;
STORE noRainCities INTO 'WEATHER_ANALYTICS/TEST_OUT/NoRainCititesData';
PLZ HElp me out guys
This is the error
2016-09-28 11:03:14,597 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 52> Syntax error, unexpected symbol at or near '='
iniData = LOAD 'empdetail.log' using PigStorage('|') as (id:int,x:chararray,city:chararray,tech:chararray);
split iniData into a if tech=='Java',b if city=='Pune';
dump a;
dump b;
GUYS : SPLIT DONT WORK UNTIL 2 OR MORE CONDITIONS ARE GIVEN ;
PROBLEM SOLVED THANKS

string concatenation not working in pig

I have a table in hcatalog which has 3 string columns. When I try to concatenate string, I am getting the following error:
A = LOAD 'default.temp_table_tower' USING org.apache.hcatalog.pig.HCatLoader() ;
B = LOAD 'default.cdr_data' USING org.apache.hcatalog.pig.HCatLoader();
c = FOREACH A GENERATE CONCAT(mcc,'-',mnc) as newCid;
Could not resolve concat using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast
What might be the root cause of the problem?
May be it will help for concatenation in pig
data1 contain:
(Maths,abc)
(Maths,def)
(Maths,ef)
(Maths,abc)
(Science,ac)
(Science,bc)
(Chemistry,xc)
(Telugu,xyz)
considering schema as sub:Maths,Maths,Science....etc and name :abc,def ,ef..etc
X = FOREACH data1 GENERATE CONCAT(sub,CONCAT('#',name));
O/P of X is:
(Maths#abc)
(Maths#def)
(Maths#ef)
(Maths#abc)
(Science#ac)
(Science#bc)
(Chemistry#xc)
(Telugu#xyz)

Resources