Error in loading the csv file in Apache Pig - hadoop

I tried to load the data using following command in apache pig on hdfs mode:
test = LOAD /user/swap/done2.csv using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);
it gives the error as below:
2017-12-12 13:49:10,347 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input '/' expecting QUOTEDSTRING
Details at logfile: /home/matlab/Documents/pig_1513066708530.log
surprisingly My dataset does not have the 13 columns.

file path should be in quotes '' to LOAD
test = LOAD '/user/swap/done2.csv' using PigStorage (',')as (ID:long, Country:chararray, Carrier:float, ClickDate:chararray, Device:chararray, OS:chararray, UserIp:chararray, PublisherId:float, advertiserCampaignId:float, Fraud:float);

Related

Error while trying aggregate data using Apache Pig

This is the code I'm running:
bigrams = LOAD 's3://******' AS (bigram:chararray, year:int, occurrences:int, books:int);
bg_tmp = filter bigrams BY (occurrences >= 300) AND (books >= 12);
bg_tmp_2 = GROUP bg_tmp ALL;
occ_cnt = FOREACH bg_tmp_2 GENERATE bigram, SUM(bg_tmp_2.occurrences);
x = LIMIT occ_cnt 100;
DUMP x;
This is the error I'm getting when I'm computing occ_cnt
81201 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_218/10/26 16:05:07 ERROR grunt.Grunt: ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_2
Details at logfile: /mnt/var/log/pig/pig_1540569826316.log
I have no idea why this is happening. I'm using Apache Pig 0.17.0 and Hadoop 2.8.4 on AWS EMR
I would rewrite your query as
bg_tmp_2 = GROUP bg_tmp by (bigram);
occ_cnt = FOREACH bg_tmp_2 GENERATE group, SUM(bg_tmp.occurrences);
Replacing GROUP ALL since I think you want the SUM per bigram entry.
Replacing bg_tmp2 with bg_tmp since you want to reference the bg_tmp BAG inside bg_tmp_2 relation.
(If you run "describe bg_tmp_2", you'll see the following schema)
bg_tmp_2: {group: chararray,bg_tmp: {(bigram: chararray,year: int,occurrences: int,books: int)}}

SPLIT command not working in apache-pig

initialData = load 'Weather_Report.log' using PigStorage('|') as (cityid:int,cityname:chararray,currentWeather:chararray,weatherCode:int);
SPLIT initialData INTO noRainsCities IF weatherCode ==10;
STORE noRainCities INTO 'WEATHER_ANALYTICS/TEST_OUT/NoRainCititesData';
PLZ HElp me out guys
This is the error
2016-09-28 11:03:14,597 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 52> Syntax error, unexpected symbol at or near '='
iniData = LOAD 'empdetail.log' using PigStorage('|') as (id:int,x:chararray,city:chararray,tech:chararray);
split iniData into a if tech=='Java',b if city=='Pune';
dump a;
dump b;
GUYS : SPLIT DONT WORK UNTIL 2 OR MORE CONDITIONS ARE GIVEN ;
PROBLEM SOLVED THANKS

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Error while using python udf in Pig

I am trying to use python udf but it is throwing below error. I am using CDH5.2
cat /home/spanda20/pig_data/panda1.py
def get_length(data):
return len(data)
REGISTER '/home/spanda20/pig_data/panda1.py' USING jython as my_udf;
grunt> A = LOAD 'hdfs://itsusmpl00509.jnj.com:8020/user/spanda20/pig_1.dat' USING PigStorage(',') AS (name:chararray, id:int);
grunt> B = FOREACH A GENERATE name, id,my_udf.get_length(name) as name_len;
2015-01-25 20:47:15,243 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve
my_udf.get_length using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at
logfile: /home/spanda20/pig_1422230028021.log
Sometimes, after a pig REGISTER command fails for UDF, you might have to restart the client for PIG to reload the UDF

How to use string functions in pig

I am trying to convert a string to upper case in pig using one of it's built-in functions. I am using pig in local mode.
emps.csv
1,John,35,M,101,50000.00,03/03/79
2,Jack,30,F,201,3540000.00,09/10/84
Commands for loading data (WORKS FINE)
empdata = load 'emps.csv' using PigStorage(',') as (id:int,name:chararray,age:int,gender:chararray,deptId:int,sal:double);
dump empdata
Convert to upper case and print it (FAILS WITH ERROR)
empnameucase = foreach empdata generate id,upper(name);
But I am getting following exception after executing above command:
Error Log:
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve upper using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:653)
at org.apache.pig.impl.PigContext.getClassForAlias(PigContext.java:769)
at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1491)
... 28 more
Please guide.
Try this,
You should specify the function name in UPPER case like
UPPER(name)
Hopt,it should work.

Resources