How to use string functions in pig - hadoop

I am trying to convert a string to upper case in pig using one of it's built-in functions. I am using pig in local mode.
emps.csv
1,John,35,M,101,50000.00,03/03/79
2,Jack,30,F,201,3540000.00,09/10/84
Commands for loading data (WORKS FINE)
empdata = load 'emps.csv' using PigStorage(',') as (id:int,name:chararray,age:int,gender:chararray,deptId:int,sal:double);
dump empdata
Convert to upper case and print it (FAILS WITH ERROR)
empnameucase = foreach empdata generate id,upper(name);
But I am getting following exception after executing above command:
Error Log:
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve upper using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:653)
at org.apache.pig.impl.PigContext.getClassForAlias(PigContext.java:769)
at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1491)
... 28 more
Please guide.

Try this,
You should specify the function name in UPPER case like
UPPER(name)
Hopt,it should work.

Related

Error in loading the csv file in Apache Pig

I tried to load the data using following command in apache pig on hdfs mode:
test = LOAD /user/swap/done2.csv using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);
it gives the error as below:
2017-12-12 13:49:10,347 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input '/' expecting QUOTEDSTRING
Details at logfile: /home/matlab/Documents/pig_1513066708530.log
surprisingly My dataset does not have the 13 columns.
file path should be in quotes '' to LOAD
test = LOAD '/user/swap/done2.csv' using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);

Getting AVG in pig

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.
I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Error while using python udf in Pig

I am trying to use python udf but it is throwing below error. I am using CDH5.2
cat /home/spanda20/pig_data/panda1.py
def get_length(data):
return len(data)
REGISTER '/home/spanda20/pig_data/panda1.py' USING jython as my_udf;
grunt> A = LOAD 'hdfs://itsusmpl00509.jnj.com:8020/user/spanda20/pig_1.dat' USING PigStorage(',') AS (name:chararray, id:int);
grunt> B = FOREACH A GENERATE name, id,my_udf.get_length(name) as name_len;
2015-01-25 20:47:15,243 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve
my_udf.get_length using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at
logfile: /home/spanda20/pig_1422230028021.log
Sometimes, after a pig REGISTER command fails for UDF, you might have to restart the client for PIG to reload the UDF

string concatenation not working in pig

I have a table in hcatalog which has 3 string columns. When I try to concatenate string, I am getting the following error:
A = LOAD 'default.temp_table_tower' USING org.apache.hcatalog.pig.HCatLoader() ;
B = LOAD 'default.cdr_data' USING org.apache.hcatalog.pig.HCatLoader();
c = FOREACH A GENERATE CONCAT(mcc,'-',mnc) as newCid;
Could not resolve concat using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast
What might be the root cause of the problem?
May be it will help for concatenation in pig
data1 contain:
(Maths,abc)
(Maths,def)
(Maths,ef)
(Maths,abc)
(Science,ac)
(Science,bc)
(Chemistry,xc)
(Telugu,xyz)
considering schema as sub:Maths,Maths,Science....etc and name :abc,def ,ef..etc
X = FOREACH data1 GENERATE CONCAT(sub,CONCAT('#',name));
O/P of X is:
(Maths#abc)
(Maths#def)
(Maths#ef)
(Maths#abc)
(Science#ac)
(Science#bc)
(Chemistry#xc)
(Telugu#xyz)

Resources