Creating schema for Tuple in Apache Pig - hadoop

How can I create Pig schema for the below tuple data while loading the relation?
]$ cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
I tried the below statement in local mode
A = LOAD '/home/cloudera/data' AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
If I dump the data, I expected the result
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
But what I get was,
((3,8,9),)
((1,4,7),)
((2,5,8),)
I am using Apache Pig version 0.11.0-cdh4.7.0

the next work:
A = load '$input' using PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
describe A;
dump A;
The dump:
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))

Related

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

How to merge orc files for external tables?

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.
Hive gave me the following error when I try to run it :
FAILED: SemanticException
org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge
can only be performed on managed tables
Following are the table parameters :
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 535
numRows 27051810
orc.compress SNAPPY
rawDataSize 20192634094
totalSize 304928695
transient_lastDdlTime 1512126635
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I believe your table is an external table,then there are two ways:
Either you can change it to Managed table (ALTER TABLE <table> SET
TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE
CONCATENATE.Then you can convert the same back to external changing
it to TRUE.
Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()

XPath for stackoverflow dump files

Am working with file with following format:
<badges>
<row Id="1" UserId="1" Name="Teacher" Date="2009-09-30T15:17:50.66"/>
<row Id="2" UserId="3" Name="Teacher" Date="2009-09-30T15:17:50.69"/>
</badges>
I am using pig xmlloader to fetch this xml data into hdfs.
A = LOAD '/badges' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray);
B = foreach A generate xpath(x, '/row#Id').
Dump B;
Output I get () - No values.
I want the file output as text i.e 1,1,Teacher,2009-09-30T15:17:50.66. How can I do this?
I'm not familiar with pig xmlloader, but /row#Id has two problems:
It's not valid XPath
If it were, it would be an absolute path
Try:
B = foreach A generate xpath(x, 'row/#Id').
It uses valid syntax and a relative path.
Use XPathAll for extracting attributes.Xpath has an issue when it comes to attributes.
REGISTER '/path/piggybank-0.15.0.jar'; -- Use the jar name you downloaded
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
B = foreach A generate XPathAll(x, 'row/#Id', true, false).$0 as (id:chararray);

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

Parse xml data using multiple attributes in hadoop using pig or hive.

Parse below mentioned type of xml in hadoop or pig
I tried with below script in pig or hive
PowerEvent sequence="00829" elapsedrealtime="0000047391" uptime="0000047391" timestamp="2016-01-17 00:31:36.750+0100" health="Good" level="69" plugged="NotPlugged" present="Present" status="NotCharging" temperature="23.0" voltage="3731" chargercurrent="25" batterycurrent="2209" coulombcounter="4294967292" screen="Off"
ConnectivityEvent sequence="00830" elapsedrealtime="0000047471" uptime="0000047471" timestamp="2016-01-17 00:31:36.831+0100" connected ="true" available="true" activenetwork="WIFI" mobiledata="Off" cellular="Unknown" operatorid="22210" operatorname="vodafone IT"
I tried with below script
register '/home/rajpsu03/pig/piggybank.jar'
xmldata = LOAD '/user/rajpsu03/pig/test.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Events') as(doc:chararray);
data = foreach xmldata
GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'[\\s*\\S*]*<PowerEvent[\\s*\\S*]*sequence="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*elapsedrealtime="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*uptime="(.*?)"[\\s*\\S*]*/>')) AS (sequence:chararray,elapsedrealtime:chararray,uptime:chararray);
dump data;

Resources