How to Store into HBase using Pig and HBaseStorage - hadoop

In the HBase shell, I created my table via:
create 'pig_table','cf'
In Pig, here are the results of the alias I wish to store into pig_table:
DUMP B;
Produces tuples with 6 fields:
(D1|30|2014-01-01 13:00,D1,30,7.0,2014-01-01 13:00,DEF)
(D1|30|2014-01-01 22:00,D1,30,1.0,2014-01-01 22:00,JKL)
(D10|20|2014-01-01 11:00,D10,20,4.0,2014-01-01 11:00,PQR)
...
The first field is a concatenation of the 2nd, third, and 5th fields, and will be used as the HBase rowkey.
But
STORE B INTO 'hbase://pig_table'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')
results in:
`Failed to produce result in "hbase:pig_table"
The logs are giving me:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataByteArray
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.objToBytes(HBaseStorage.java:924)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:875)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:468)
... 11 more
What is wrong with my syntax?

It appears that HBaseStorage does not automatically convert the data fields of the tuples into chararray, and which is necessary before it can be stored in HBase. I simply casted them as such:
C = FOREACH B {
GENERATE
(chararray)$0
,(chararray)$1
,(chararray)$2
,(chararray)$3
,(chararray)$4
,(chararray)$5
,(chararray)$6
;
}
STORE B INTO 'hbase://pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ( 'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')

Related

PySpark JdbcUtils relation between Catalyst and JDBC Types

While writing a Dataframe as a Mysql table in PySpark I am facing a java.sql.BatchUpdateException: Data truncation: Data too long for column error, wich means that the data exceed the maximum value admitted by Mysql TEXT type.
As seen in JdbcUtils.getCommonJDBCType method, TEXT is the default JDBC type for Catalyst's StringType:
def getCommonJDBCType(dt: DataType): Option[JdbcType] = {
dt match {
...
case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))
...
}
}
I was wondering, is there any way of manually defining a relation between a Catalyst Type (StringType) and a JDBC type (LONGTEXT) using the write.jdbc method of a Dataframe?

Moving data to HBASE using Pig

I tried moving 851 data in my hbase for that i created hbase using below command
create 'customers', 'customers_data'
i moved the files using pig script. My pig script is
STOCK_A = LOAD '/user/cloudera/xxx' USING PigStorage('|');
data = FILTER STOCK_A BY ( $0 matches '.*MH.*');
MH_DATA = FOREACH data GENERATE $1, $3, $4;
STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
i got 851 data using my pig command. My data is
(aman,george,22)
(aman,george,22)
(aman,george,22)
.
.
.
.
.
851
but when i try to put this data in hbase using below command
PIG_CLASSPATH=/usr/lib/hbase/hbase.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar /usr/bin/pig /home/cloudera/remot/pighl7
data that is getting stored in HBASE is
ROW COLUMN+CELL
\xB5~\x5C& column=customers_data:firstname, timestamp=1478700582076, value=george
\xB5~\x5C& column=customers_data:lastname, timestamp=1478700582076, value=22
I cant find my 851 records as well as the third parameter. I don't know what i am doing wrong.
Please help
I think you have missed giving alias in the generate statement (for safer side i have casted your tuples into chararray)
also at the end give name for you store relation
TRY:
MH_DATA = FOREACH data GENERATE (chararray)$1 AS firstname , (chararray)$3 AS lastname, (chararray)$4 AS age;
STORE_IN_HBASE = STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
for more information follow this link:
https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
After doing a lot of research and trail and error when i changed the row key from name to timestamp i solved my problem, As i am using using row key which is having same name as of others it always updates it.

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Filter by UUID in Pig

I have a list of known UUIDs. I want to do a FILTER in Pig that filters out records whose id column do not contain a UUID from my list.
I have yet to find a way to specify bytearray literals such that I can write that filter statement.
How do I filter by UUID?
(in one attempt I tried using https://github.com/cevaris/pig-dse per How to FILTER Cassandra TimeUUID/UUID in Pig thinking I could filter by a chararray literal of the UUID but I got
grunt> post_creators= LOAD 'cql://mykeyspace/mycf/' using AbstractCassandraStorage;
2014-10-09 14:56:05,597 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: could not instantiate 'AbstractCassandraStorage' with arguments 'null'
)
Use this python UDF
import array
import uuid
#outputSchema("uuid:bytearray")
def to_bytes(uuid_str):
return array.array('b', uuid.UUID(uuid_str).bytes)
Filter like this:
users = FILTER users by user_id == my_udf.to_bytes('dd2e03a7-7d3d-45b9-b902-2b39c5c541b5');

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Resources