error while storing in Hbase using Pig

error while storing in Hbase using Pig - hadoop

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.

solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

Related

Getting AVG in pig

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.

I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

Parse xml data using multiple attributes in hadoop using pig or hive.

Parse below mentioned type of xml in hadoop or pig
I tried with below script in pig or hive
PowerEvent sequence="00829" elapsedrealtime="0000047391" uptime="0000047391" timestamp="2016-01-17 00:31:36.750+0100" health="Good" level="69" plugged="NotPlugged" present="Present" status="NotCharging" temperature="23.0" voltage="3731" chargercurrent="25" batterycurrent="2209" coulombcounter="4294967292" screen="Off"
ConnectivityEvent sequence="00830" elapsedrealtime="0000047471" uptime="0000047471" timestamp="2016-01-17 00:31:36.831+0100" connected ="true" available="true" activenetwork="WIFI" mobiledata="Off" cellular="Unknown" operatorid="22210" operatorname="vodafone IT"
I tried with below script
register '/home/rajpsu03/pig/piggybank.jar'
xmldata = LOAD '/user/rajpsu03/pig/test.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Events') as(doc:chararray);
data = foreach xmldata
GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'[\\s*\\S*]*<PowerEvent[\\s*\\S*]*sequence="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*elapsedrealtime="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*uptime="(.*?)"[\\s*\\S*]*/>')) AS (sequence:chararray,elapsedrealtime:chararray,uptime:chararray);
dump data;

Creating schema for Tuple in Apache Pig

How can I create Pig schema for the below tuple data while loading the relation?
]$ cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
I tried the below statement in local mode
A = LOAD '/home/cloudera/data' AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
If I dump the data, I expected the result
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
But what I get was,
((3,8,9),)
((1,4,7),)
((2,5,8),)
I am using Apache Pig version 0.11.0-cdh4.7.0

the next work:
A = load '$input' using PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
describe A;
dump A;
The dump:
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))

Pig Job Failed.Need suggestion

Please help me to understand why the below join is failing.
Code:
nyse_div= load '/home/cloudera/NYSE_daily_dividends' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, dividends:double);
nyse_div1= foreach nyse_div generate symbol,SUBSTRING(date,0,4) as year,dividends;
nyse_div2= group nyse_div1 by (symbol,year);
nyse_div3= foreach nyse_div2 generate group,AVG(nyse_div1.dividends);
nyse_price= load '/home/cloudera/NYSE_daily_prices' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj:double);
nyse_price1= foreach nyse_price generate symbol,SUBSTRING(date,0,4) as year,open..;
nyse_price2= group nyse_price1 by (symbol,year);
nyse_price3= foreach nyse_price2 generate group,MAX(nyse_price1.high),MIN(nyse_price1.low);
nyse_final= join nyse_div3 by group,nyse_price3 by group;
--store nyse_div3 into 'home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
--store nyse_price3 into 'home/cloudera/NYSE_daily_dividends/output1' using PigStorage(',');
store nyse_final into '/home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
****Failed Jobs:**
JobId Alias Feature Message Outputs
job_local766969553_0008 nyse_final HASH_JOIN Message: Job failed! /home/cloudera/NYSE_daily_dividends/output,
Input(s):
Successfully read records from: "/home/cloudera/NYSE_daily_dividends"
Successfully read records from: "/home/cloudera/NYSE_daily_prices"
Output(s):
Failed to produce result in "/home/cloudera/NYSE_daily_dividends/output"
Job DAG:
job_local1308827629_0006 -> job_local766969553_0008,
job_local241929118_0007 -> job_local766969553_0008,
job_local766969553_0008
2014-11-12 17:00:35,263 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs**

Your code works perfectly for me. I think you have to paste the complete error log to understand the error.
please see the output that i got for the below input
NYSE_daily_dividends
NYSE,AIT,2009-11-12,0.15
NYSE,AIT,2009-08-12,0.15
NYSE,AIT,2009-05-13,0.15
NYSE,AIT,2009-02-11,0.15
NYSE_daily_prices
NYSE,AEA,2010-02-08,4.42,4.42,4.21,4.24,205500,4.24
NYSE,AEA,2010-02-05,4.42,4.54,4.22,4.41,194300,4.41
NYSE,AEA,2010-02-04,4.55,4.69,4.39,4.42,233800,4.42
NYSE,AIT,2009-02-11,0.15,4.87,4.55,4.55,234444,4.56
Output from your code
((AIT,2009),0.15,(AIT,2009),4.87,4.55)

Need javax.jdo.option.ConnectionURL for cassandra

Are the below properties in hive-site.xml correct for Hive access to cassandra??
(I HAVE COPIED ENTIRE HIVE-DEFAULT.XML CONTENT BUT HAVE CHANGED ONLY THE BELOW PROPERTIES)
javax.jdo.option.ConnectionURL : cassandra://localhost:9160
javax.jdo.option.ConnectionDriverName:org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbclass: jdbc:cassandra
hive.stats.jdbcdriver: org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbconnectionstring: jdbc:cassandra:;databaseName=TempStatsStore;create=true
I am running 1-node Cassandra. But, later would make it a minimum 2 node cluster.
When I run the below table creation command I get an error:
CREATE EXTERNAL TABLE MyHiveTable
(m string, n string, o string, p string)
STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
"cassandra.cf.name" = "test",
"cassandra.cql3.type" = "text, text, text, text");
Error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

don't know about jdo settings but you could try this link which is far better option for integrating hive with cassandra -
https://github.com/milliondreams/hive/tree/cas-support-cql/cassandra-handler

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

error while storing in Hbase using Pig - hadoop

Related

Getting AVG in pig

Parse xml data using multiple attributes in hadoop using pig or hive.

Creating schema for Tuple in Apache Pig

Pig Job Failed.Need suggestion

Need javax.jdo.option.ConnectionURL for cassandra

Categories

Resources