hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');
Related
I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.
I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.
Parse below mentioned type of xml in hadoop or pig
I tried with below script in pig or hive
PowerEvent sequence="00829" elapsedrealtime="0000047391" uptime="0000047391" timestamp="2016-01-17 00:31:36.750+0100" health="Good" level="69" plugged="NotPlugged" present="Present" status="NotCharging" temperature="23.0" voltage="3731" chargercurrent="25" batterycurrent="2209" coulombcounter="4294967292" screen="Off"
ConnectivityEvent sequence="00830" elapsedrealtime="0000047471" uptime="0000047471" timestamp="2016-01-17 00:31:36.831+0100" connected ="true" available="true" activenetwork="WIFI" mobiledata="Off" cellular="Unknown" operatorid="22210" operatorname="vodafone IT"
I tried with below script
register '/home/rajpsu03/pig/piggybank.jar'
xmldata = LOAD '/user/rajpsu03/pig/test.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Events') as(doc:chararray);
data = foreach xmldata
GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'[\\s*\\S*]*<PowerEvent[\\s*\\S*]*sequence="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*elapsedrealtime="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*uptime="(.*?)"[\\s*\\S*]*/>')) AS (sequence:chararray,elapsedrealtime:chararray,uptime:chararray);
dump data;
How can I create Pig schema for the below tuple data while loading the relation?
]$ cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
I tried the below statement in local mode
A = LOAD '/home/cloudera/data' AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
If I dump the data, I expected the result
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
But what I get was,
((3,8,9),)
((1,4,7),)
((2,5,8),)
I am using Apache Pig version 0.11.0-cdh4.7.0
the next work:
A = load '$input' using PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
describe A;
dump A;
The dump:
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
Please help me to understand why the below join is failing.
Code:
nyse_div= load '/home/cloudera/NYSE_daily_dividends' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, dividends:double);
nyse_div1= foreach nyse_div generate symbol,SUBSTRING(date,0,4) as year,dividends;
nyse_div2= group nyse_div1 by (symbol,year);
nyse_div3= foreach nyse_div2 generate group,AVG(nyse_div1.dividends);
nyse_price= load '/home/cloudera/NYSE_daily_prices' using PigStorage(',')as(exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj:double);
nyse_price1= foreach nyse_price generate symbol,SUBSTRING(date,0,4) as year,open..;
nyse_price2= group nyse_price1 by (symbol,year);
nyse_price3= foreach nyse_price2 generate group,MAX(nyse_price1.high),MIN(nyse_price1.low);
nyse_final= join nyse_div3 by group,nyse_price3 by group;
--store nyse_div3 into 'home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
--store nyse_price3 into 'home/cloudera/NYSE_daily_dividends/output1' using PigStorage(',');
store nyse_final into '/home/cloudera/NYSE_daily_dividends/output' using PigStorage(',');
****Failed Jobs:**
JobId Alias Feature Message Outputs
job_local766969553_0008 nyse_final HASH_JOIN Message: Job failed! /home/cloudera/NYSE_daily_dividends/output,
Input(s):
Successfully read records from: "/home/cloudera/NYSE_daily_dividends"
Successfully read records from: "/home/cloudera/NYSE_daily_prices"
Output(s):
Failed to produce result in "/home/cloudera/NYSE_daily_dividends/output"
Job DAG:
job_local1308827629_0006 -> job_local766969553_0008,
job_local241929118_0007 -> job_local766969553_0008,
job_local766969553_0008
2014-11-12 17:00:35,263 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs**
Your code works perfectly for me. I think you have to paste the complete error log to understand the error.
please see the output that i got for the below input
NYSE_daily_dividends
NYSE,AIT,2009-11-12,0.15
NYSE,AIT,2009-08-12,0.15
NYSE,AIT,2009-05-13,0.15
NYSE,AIT,2009-02-11,0.15
NYSE_daily_prices
NYSE,AEA,2010-02-08,4.42,4.42,4.21,4.24,205500,4.24
NYSE,AEA,2010-02-05,4.42,4.54,4.22,4.41,194300,4.41
NYSE,AEA,2010-02-04,4.55,4.69,4.39,4.42,233800,4.42
NYSE,AIT,2009-02-11,0.15,4.87,4.55,4.55,234444,4.56
Output from your code
((AIT,2009),0.15,(AIT,2009),4.87,4.55)
Are the below properties in hive-site.xml correct for Hive access to cassandra??
(I HAVE COPIED ENTIRE HIVE-DEFAULT.XML CONTENT BUT HAVE CHANGED ONLY THE BELOW PROPERTIES)
javax.jdo.option.ConnectionURL : cassandra://localhost:9160
javax.jdo.option.ConnectionDriverName:org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbclass: jdbc:cassandra
hive.stats.jdbcdriver: org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbconnectionstring: jdbc:cassandra:;databaseName=TempStatsStore;create=true
I am running 1-node Cassandra. But, later would make it a minimum 2 node cluster.
When I run the below table creation command I get an error:
CREATE EXTERNAL TABLE MyHiveTable
(m string, n string, o string, p string)
STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
"cassandra.cf.name" = "test",
"cassandra.cql3.type" = "text, text, text, text");
Error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
don't know about jdo settings but you could try this link which is far better option for integrating hive with cassandra -
https://github.com/milliondreams/hive/tree/cas-support-cql/cassandra-handler