string concatenation not working in pig - hadoop

I have a table in hcatalog which has 3 string columns. When I try to concatenate string, I am getting the following error:
A = LOAD 'default.temp_table_tower' USING org.apache.hcatalog.pig.HCatLoader() ;
B = LOAD 'default.cdr_data' USING org.apache.hcatalog.pig.HCatLoader();
c = FOREACH A GENERATE CONCAT(mcc,'-',mnc) as newCid;
Could not resolve concat using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast
What might be the root cause of the problem?

May be it will help for concatenation in pig
data1 contain:
(Maths,abc)
(Maths,def)
(Maths,ef)
(Maths,abc)
(Science,ac)
(Science,bc)
(Chemistry,xc)
(Telugu,xyz)
considering schema as sub:Maths,Maths,Science....etc and name :abc,def ,ef..etc
X = FOREACH data1 GENERATE CONCAT(sub,CONCAT('#',name));
O/P of X is:
(Maths#abc)
(Maths#def)
(Maths#ef)
(Maths#abc)
(Science#ac)
(Science#bc)
(Chemistry#xc)
(Telugu#xyz)

Related

Invalid format: "19690321" is too short

I am trying to convert yyyyMMdd format to yyyy/MM/dd format using pig for that i have written below code.
Code:
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate($8,'yyyy/MM/dd','UTC') AS dob;
When i am trying to dump the result i am getting below error.
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0:
Exception while executing [POUserFunc (Name:
POUserFunc(org.apache.pig.builtin.ToDate3ARGS)[datetime] - scope-209
Operator Key: scope-209) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "19690321" is too
short
Sample:
EXVORV##PDULD21F|ID|1|483|1020783||EXVORV##PDULD||19690321|F|
$8 seems valid to me i am not able to locate the reason the issue is coming. Any help would be really appreciated.
You use :
ToDate($8,'yyyy/MM/dd','UTC')
but the format is
19690321
so you should have
ToDate($8,'yyyyMMdd','UTC')
The issue is most likely because of the load statement.Since you are not specifying the schema the datatype by default will be bytearray. You will have to convert it to chararray before passing the field to ToDate
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate((chararray)$8,'yyyy/MM/dd','UTC') AS dob;

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Presence of "in" in Pig's UDF causes problems

I was trying my first UDF in pig and wrote the following function -
package com.pig.in.action.assignments.udf;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;
import java.io.IOException;
public class CountLength extends EvalFunc<Integer> {
public Integer exec(Tuple inputVal) throws IOException {
// Validate Input Value ...
if (inputVal == null ||
inputVal.size() == 0 ||
inputVal.get(0) == null) {
// Emit warning text for user, and skip this iteration
super.warn("Inappropriate parameter, Skipping ...",
PigWarning.SKIP_UDF_CALL_FOR_NULL);
return null;
}
// Count # of characters in this string ...
final String inputString = (String) inputVal.get(0);
return inputString.length();
}
}
However, when I try to use it as follows, Pig throws an error message that it not easy to understand atleast for me in the context of my UDF :
grunt> cat dept.txt;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
grunt> dept = LOAD '/user/sgn/dept.txt' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
grunt> d = FOREACH dept GENERATE dept_no, com.pig.in.action.assignments.udf.CountLength(d_name);
2015-06-02 16:24:13,416 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 79> mismatched input '(' expecting SEMI_COLON
Details at logfile: /home/sgn/pig_1433261973141.log
Can anyone help me figuring out whats wrong with this ?
I have gone through the documentation, but nothing seems obvious to me that is wrong in the sample above. Am I missing something here ?
These are the libraries I am using in pom.xml :
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
Is there any compatibility problem ?
Thanks,
-Vipul Pathak;
Found the reason of the problem after about 36 hours of downtime ...
The package name contains "IN" which somehow was the problem to Pig.
package com.pig.in.action.assignments.udf;
// ^^
When I changed the package name to the following, everything was good -
package com.pig.nnn.action.assignments.udf;
// ^^^
After building my modified UDF, I registered the Jar and Defined an alias for the function name and bingo, everything worked -
REGISTER /user/sgn/UDFs/Pig/CountLength-1.jar;
DEFINE CL com.pig.nnn.action.assignments.udf.CountLength;
. . .
. . .
d = FOREACH dept GENERATE dept_no, CL(d_name) AS DeptLength;
I don't recall if IN is a reserve word in Pig. But still presence of IN causes problem, (atleast in version 0.14.0 of Pig).
Tried the above example. As long as the jar is registered using REGISTER command and the jar is available in classpath, we should not be seeing any error.
REGISTER myudfs.jar;
dept = LOAD 'a.csv' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
d = FOREACH dept GENERATE dept_no, CountLength(d_name) as length;
Input : a.csv
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
Output : d
(10,10)
(20,8)
(30,5)
(40,10)
N.B. : In the above run the class CountLength has been defined in a default package.
If this class - CountLength has been defined in a package com.pig.utility then to access the UDF, either we have to have a DEFINE statement as below
DEFINE CountLength com.pig.utility.CountLength;
OR
We have to refer the UDF by complete path as below :
d = FOREACH dept GENERATE dept_no, com.pig.utility.CountLength(d_name) as length;
Your jar should be registered
ex:
REGISTER /home/hadoop/udf.jar;
DEFINE package.CountLength CountLength ;

Error while using python udf in Pig

I am trying to use python udf but it is throwing below error. I am using CDH5.2
cat /home/spanda20/pig_data/panda1.py
def get_length(data):
return len(data)
REGISTER '/home/spanda20/pig_data/panda1.py' USING jython as my_udf;
grunt> A = LOAD 'hdfs://itsusmpl00509.jnj.com:8020/user/spanda20/pig_1.dat' USING PigStorage(',') AS (name:chararray, id:int);
grunt> B = FOREACH A GENERATE name, id,my_udf.get_length(name) as name_len;
2015-01-25 20:47:15,243 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve
my_udf.get_length using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at
logfile: /home/spanda20/pig_1422230028021.log
Sometimes, after a pig REGISTER command fails for UDF, you might have to restart the client for PIG to reload the UDF

How to use string functions in pig

I am trying to convert a string to upper case in pig using one of it's built-in functions. I am using pig in local mode.
emps.csv
1,John,35,M,101,50000.00,03/03/79
2,Jack,30,F,201,3540000.00,09/10/84
Commands for loading data (WORKS FINE)
empdata = load 'emps.csv' using PigStorage(',') as (id:int,name:chararray,age:int,gender:chararray,deptId:int,sal:double);
dump empdata
Convert to upper case and print it (FAILS WITH ERROR)
empnameucase = foreach empdata generate id,upper(name);
But I am getting following exception after executing above command:
Error Log:
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve upper using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:653)
at org.apache.pig.impl.PigContext.getClassForAlias(PigContext.java:769)
at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1491)
... 28 more
Please guide.
Try this,
You should specify the function name in UPPER case like
UPPER(name)
Hopt,it should work.

Resources