Getting AVG in pig - hadoop

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.

I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

Related

Error while trying aggregate data using Apache Pig

This is the code I'm running:
bigrams = LOAD 's3://******' AS (bigram:chararray, year:int, occurrences:int, books:int);
bg_tmp = filter bigrams BY (occurrences >= 300) AND (books >= 12);
bg_tmp_2 = GROUP bg_tmp ALL;
occ_cnt = FOREACH bg_tmp_2 GENERATE bigram, SUM(bg_tmp_2.occurrences);
x = LIMIT occ_cnt 100;
DUMP x;
This is the error I'm getting when I'm computing occ_cnt
81201 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_218/10/26 16:05:07 ERROR grunt.Grunt: ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_2
Details at logfile: /mnt/var/log/pig/pig_1540569826316.log
I have no idea why this is happening. I'm using Apache Pig 0.17.0 and Hadoop 2.8.4 on AWS EMR
I would rewrite your query as
bg_tmp_2 = GROUP bg_tmp by (bigram);
occ_cnt = FOREACH bg_tmp_2 GENERATE group, SUM(bg_tmp.occurrences);
Replacing GROUP ALL since I think you want the SUM per bigram entry.
Replacing bg_tmp2 with bg_tmp since you want to reference the bg_tmp BAG inside bg_tmp_2 relation.
(If you run "describe bg_tmp_2", you'll see the following schema)
bg_tmp_2: {group: chararray,bg_tmp: {(bigram: chararray,year: int,occurrences: int,books: int)}}

SPLIT command not working in apache-pig

initialData = load 'Weather_Report.log' using PigStorage('|') as (cityid:int,cityname:chararray,currentWeather:chararray,weatherCode:int);
SPLIT initialData INTO noRainsCities IF weatherCode ==10;
STORE noRainCities INTO 'WEATHER_ANALYTICS/TEST_OUT/NoRainCititesData';
PLZ HElp me out guys
This is the error
2016-09-28 11:03:14,597 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 52> Syntax error, unexpected symbol at or near '='
iniData = LOAD 'empdetail.log' using PigStorage('|') as (id:int,x:chararray,city:chararray,tech:chararray);
split iniData into a if tech=='Java',b if city=='Pune';
dump a;
dump b;
GUYS : SPLIT DONT WORK UNTIL 2 OR MORE CONDITIONS ARE GIVEN ;
PROBLEM SOLVED THANKS

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Pig Job Failed.Need suggestion

Please help me to understand why the below join is failing.
Code:
nyse_div= load '/home/cloudera/NYSE_daily_dividends' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, dividends:double);
nyse_div1= foreach nyse_div generate symbol,SUBSTRING(date,0,4) as year,dividends;
nyse_div2= group nyse_div1 by (symbol,year);
nyse_div3= foreach nyse_div2 generate group,AVG(nyse_div1.dividends);
nyse_price= load '/home/cloudera/NYSE_daily_prices' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj:double);
nyse_price1= foreach nyse_price generate symbol,SUBSTRING(date,0,4) as year,open..;
nyse_price2= group nyse_price1 by (symbol,year);
nyse_price3= foreach nyse_price2 generate group,MAX(nyse_price1.high),MIN(nyse_price1.low);
nyse_final= join nyse_div3 by group,nyse_price3 by group;
--store nyse_div3 into 'home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
--store nyse_price3 into 'home/cloudera/NYSE_daily_dividends/output1' using PigStorage(',');
store nyse_final into '/home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
****Failed Jobs:**
JobId Alias Feature Message Outputs
job_local766969553_0008 nyse_final HASH_JOIN Message: Job failed! /home/cloudera/NYSE_daily_dividends/output,
Input(s):
Successfully read records from: "/home/cloudera/NYSE_daily_dividends"
Successfully read records from: "/home/cloudera/NYSE_daily_prices"
Output(s):
Failed to produce result in "/home/cloudera/NYSE_daily_dividends/output"
Job DAG:
job_local1308827629_0006 -> job_local766969553_0008,
job_local241929118_0007 -> job_local766969553_0008,
job_local766969553_0008
2014-11-12 17:00:35,263 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs**
Your code works perfectly for me. I think you have to paste the complete error log to understand the error.
please see the output that i got for the below input
NYSE_daily_dividends
NYSE,AIT,2009-11-12,0.15
NYSE,AIT,2009-08-12,0.15
NYSE,AIT,2009-05-13,0.15
NYSE,AIT,2009-02-11,0.15
NYSE_daily_prices
NYSE,AEA,2010-02-08,4.42,4.42,4.21,4.24,205500,4.24
NYSE,AEA,2010-02-05,4.42,4.54,4.22,4.41,194300,4.41
NYSE,AEA,2010-02-04,4.55,4.69,4.39,4.42,233800,4.42
NYSE,AIT,2009-02-11,0.15,4.87,4.55,4.55,234444,4.56
Output from your code
((AIT,2009),0.15,(AIT,2009),4.87,4.55)

Resources