Presence of "in" in Pig's UDF causes problems - hadoop

I was trying my first UDF in pig and wrote the following function -
package com.pig.in.action.assignments.udf;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;
import java.io.IOException;
public class CountLength extends EvalFunc<Integer> {
public Integer exec(Tuple inputVal) throws IOException {
// Validate Input Value ...
if (inputVal == null ||
inputVal.size() == 0 ||
inputVal.get(0) == null) {
// Emit warning text for user, and skip this iteration
super.warn("Inappropriate parameter, Skipping ...",
PigWarning.SKIP_UDF_CALL_FOR_NULL);
return null;
}
// Count # of characters in this string ...
final String inputString = (String) inputVal.get(0);
return inputString.length();
}
}
However, when I try to use it as follows, Pig throws an error message that it not easy to understand atleast for me in the context of my UDF :
grunt> cat dept.txt;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
grunt> dept = LOAD '/user/sgn/dept.txt' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
grunt> d = FOREACH dept GENERATE dept_no, com.pig.in.action.assignments.udf.CountLength(d_name);
2015-06-02 16:24:13,416 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 79> mismatched input '(' expecting SEMI_COLON
Details at logfile: /home/sgn/pig_1433261973141.log
Can anyone help me figuring out whats wrong with this ?
I have gone through the documentation, but nothing seems obvious to me that is wrong in the sample above. Am I missing something here ?
These are the libraries I am using in pom.xml :
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
Is there any compatibility problem ?
Thanks,
-Vipul Pathak;

Found the reason of the problem after about 36 hours of downtime ...
The package name contains "IN" which somehow was the problem to Pig.
package com.pig.in.action.assignments.udf;
// ^^
When I changed the package name to the following, everything was good -
package com.pig.nnn.action.assignments.udf;
// ^^^
After building my modified UDF, I registered the Jar and Defined an alias for the function name and bingo, everything worked -
REGISTER /user/sgn/UDFs/Pig/CountLength-1.jar;
DEFINE CL com.pig.nnn.action.assignments.udf.CountLength;
. . .
. . .
d = FOREACH dept GENERATE dept_no, CL(d_name) AS DeptLength;
I don't recall if IN is a reserve word in Pig. But still presence of IN causes problem, (atleast in version 0.14.0 of Pig).

Tried the above example. As long as the jar is registered using REGISTER command and the jar is available in classpath, we should not be seeing any error.
REGISTER myudfs.jar;
dept = LOAD 'a.csv' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
d = FOREACH dept GENERATE dept_no, CountLength(d_name) as length;
Input : a.csv
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
Output : d
(10,10)
(20,8)
(30,5)
(40,10)
N.B. : In the above run the class CountLength has been defined in a default package.
If this class - CountLength has been defined in a package com.pig.utility then to access the UDF, either we have to have a DEFINE statement as below
DEFINE CountLength com.pig.utility.CountLength;
OR
We have to refer the UDF by complete path as below :
d = FOREACH dept GENERATE dept_no, com.pig.utility.CountLength(d_name) as length;

Your jar should be registered
ex:
REGISTER /home/hadoop/udf.jar;
DEFINE package.CountLength CountLength ;

Related

Airflow Failed: ParseException line 2:0 cannot recognize input near

I'm trying to run a test task on Airflow but I keep getting the following error:
FAILED: ParseException 2:0 cannot recognize input near 'create_import_table_fct_latest_values' '.' 'hql'
Here is my Airflow Dag file:
import airflow
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.models import DAG
args = {
'owner': 'raul',
'start_date': datetime(2018, 11, 12),
'provide_context': True,
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email': ['raul.gregglino#leroymerlin.ru'],
'email_on_failure': True,
'email_on_retry': False
}
dag = DAG('opus_data',
default_args=args,
max_active_runs=6,
schedule_interval="#daily"
)
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='create_import_table_fct_latest_values.hql ',
hiveconf_jinja_translate=True,
dag=dag
)
deps = {}
# Explicity define the dependencies in the DAG
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
dag.set_dependency(upstream, downstream)
Here is the content of my HQL file, in case this may be the issue and I can't figure:
*I'm testing the connection to understand if the table is created or not, then I'll try to LOAD DATA, hence the LOAD DATA is commented out.
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED ',';
#LOAD DATA LOCAL INPATH
#'/media/windows_share/schemas/opus/fct_latest_values_20181106.csv'
#OVERWRITE INTO TABLE opus_data.fct_latest_values_new_data;
In the HQL file it should be FIELDS TERMINATED BY ',':
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
And comments should start with -- in HQL file, not #
Also this seems incorrect and causing Exception hql='create_import_table_fct_latest_values.hql '
Have a look at this example:
#Create full path for the file
hql_file_path = os.path.join(os.path.dirname(__file__), source['hql'])
print hql_file_path
run_hive_query = HiveOperator(
task_id='run_hive_query',
dag = dag,
hql = """
{{ local_hive_settings }}
""" + "\n " + open(hql_file_path, 'r').read()
)
See here for more details.
Or put all HQL into hql parameter:
hql='CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data ...'
I managed to find the answer for my issue.
It was related to the path my HiveOperator was calling the file from. As no Variable had been defined to tell Airflow where to look for, I was getting the error I mentioned in my post.
Once I have defined it using the webserver interface (See picture), my dag started to work propertly.
I made a change to my DAG code regarding the file location for organization only and this is how my HiveOperator looks like now:
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='hql/create_import_table_fct_latest_values2.hql',
hiveconf_jinja_translate=True,
dag=dag
)
Thanks to (#panov.st) who helped me in person to identify my issue.

1003 error (unable to find an operator for alias ) in group function in pig

I have written a .pig file whose content is :
register /home/tuhin/Documents/PigWork/pigdata/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '/pigdata/salaryTravelReport.csv' using csvloader();
x = foreach xyz generate $0 as name:chararray, $1 as title:chararray, replace($2, ',','') as salary:bytearray, replace($3, ',', '') as travel:bytearray, $4 as orgtype:chararray, $5 as org:chararray, $6 as year:bytearray;
refined = foreach x generate name, title, (float)salary, (float)travel, orgtype, org, (int)year;
year2010 = filter refined by year == 2010;
byjobtitile = GROUP year2010 by title;
The purpose is to remove ',' in dollar value in 2 columns and then group the data by jobtitle. When I am running this using run command there is not error. Even dumping of year2010 is working fine. But dumping byjobtitiel is giving error:
error in dumping
The output of the log file is:
Pig Stack Trace
--------------- ERROR 1003: Unable to find an operator for alias byjobtitle
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1003: Unable
to find an operator for alias byjobtitle at
org.apache.pig.PigServer$Graph.buildPlan(PigServer.java:1544) at
org.apache.pig.PigServer.storeEx(PigServer.java:1029) at
org.apache.pig.PigServer.store(PigServer.java:997) at
org.apache.pig.PigServer.openIterator(PigServer.java:910) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at
org.apache.pig.Main.run(Main.java:565) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I am new to bigdata and dont have much knowledge. But it looks like there is a problem in data type. Can anyone help me out?
The issue is due to "CSVLoader" you are using. This will have ',' as default delimiter. Since your data also has "," in some of its field like salary and travel, the positional index is getting changed. So if your data is something like this
name title salary travel orgtype org year
A B 10,000 23,1357 ORG_TYPE ORG 2016
then using CSVLoader will make "A B 10" as the first field, "000 23" as the second field and "1357 ORG_TYPE ORG 2016" as the third field based on ","
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '<path to your file>' using csvloader();
a = foreach xyz generate $0;
2016-06-07 12:28:12,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1<br>
(A B 10)<br>
You can make your delimiter different so that it is not present in any field value.
Try using CSVExcelStorage. You can use its constructor to explicitly define the delimiter
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage('|','NO_MULTILINE','NOCHANGE');
It will work fine as long as same identifier is not present as ;
delimiter
any field value

Using Spark Context To Read Parquet File as RDD(wihout using Spark-Sql Context) giving Exception

I am trying to read and write Parquet file as RDD using Spark. I cant use Spark-Sql-Context in my current application(It needs a parquet schema in StructType which when I convert from Avro Schema gives me castException in few cases)
So if i try to implement and save Parquet File by overload AvroParquetFormat and Sending ParquetInputFormat to Hadoop To write in following way:
def saveAsParquetFile[T <: IndexedRecord](records: RDD[T], path: String)(implicit m: ClassTag[T]) = {
val keyedRecords: RDD[(Void, T)] = records.map(record => (null, record))
spark.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
val job = Job.getInstance(spark.hadoopConfiguration)
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, m.runtimeClass.newInstance().asInstanceOf[IndexedRecord].getSchema())
keyedRecords.saveAsNewAPIHadoopFile(
path,
classOf[Void],
m.runtimeClass.asInstanceOf[Class[T]],
classOf[ParquetOutputFormat[T]],
job.getConfiguration
)
}
This is thowing error:
Exception in thread "main" java.lang.InstantiationException: org.apache.avro.generic.GenericRecord
I am calling The function as follows:
val file1: RDD[GenericRecord] = sc.parquetFile[GenericRecord]("/home/abc.parquet")
sc.saveAsParquetFile(file1, "/home/abc/")

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

string concatenation not working in pig

I have a table in hcatalog which has 3 string columns. When I try to concatenate string, I am getting the following error:
A = LOAD 'default.temp_table_tower' USING org.apache.hcatalog.pig.HCatLoader() ;
B = LOAD 'default.cdr_data' USING org.apache.hcatalog.pig.HCatLoader();
c = FOREACH A GENERATE CONCAT(mcc,'-',mnc) as newCid;
Could not resolve concat using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast
What might be the root cause of the problem?
May be it will help for concatenation in pig
data1 contain:
(Maths,abc)
(Maths,def)
(Maths,ef)
(Maths,abc)
(Science,ac)
(Science,bc)
(Chemistry,xc)
(Telugu,xyz)
considering schema as sub:Maths,Maths,Science....etc and name :abc,def ,ef..etc
X = FOREACH data1 GENERATE CONCAT(sub,CONCAT('#',name));
O/P of X is:
(Maths#abc)
(Maths#def)
(Maths#ef)
(Maths#abc)
(Science#ac)
(Science#bc)
(Chemistry#xc)
(Telugu#xyz)

Resources