Moving data to HBASE using Pig - hadoop

I tried moving 851 data in my hbase for that i created hbase using below command
create 'customers', 'customers_data'
i moved the files using pig script. My pig script is
STOCK_A = LOAD '/user/cloudera/xxx' USING PigStorage('|');
data = FILTER STOCK_A BY ( $0 matches '.*MH.*');
MH_DATA = FOREACH data GENERATE $1, $3, $4;
STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
i got 851 data using my pig command. My data is
(aman,george,22)
(aman,george,22)
(aman,george,22)
.
.
.
.
.
851
but when i try to put this data in hbase using below command
PIG_CLASSPATH=/usr/lib/hbase/hbase.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar /usr/bin/pig /home/cloudera/remot/pighl7
data that is getting stored in HBASE is
ROW COLUMN+CELL
\xB5~\x5C& column=customers_data:firstname, timestamp=1478700582076, value=george
\xB5~\x5C& column=customers_data:lastname, timestamp=1478700582076, value=22
I cant find my 851 records as well as the third parameter. I don't know what i am doing wrong.
Please help

I think you have missed giving alias in the generate statement (for safer side i have casted your tuples into chararray)
also at the end give name for you store relation
TRY:
MH_DATA = FOREACH data GENERATE (chararray)$1 AS firstname , (chararray)$3 AS lastname, (chararray)$4 AS age;
STORE_IN_HBASE = STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
for more information follow this link:
https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html

After doing a lot of research and trail and error when i changed the row key from name to timestamp i solved my problem, As i am using using row key which is having same name as of others it always updates it.

Related

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Store pig result in a text file

Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.

Pig reading data as databytearray

Hey guys i have one more question I am just not able to understand the behavior of pig
I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).
But when I load the data from /user/sga/transformeddata location and do
temp = load '/user/sga/transformeddata' using PigStorage();
gen = foreach temp generate page_type;
dump gen;
getting following error:
databytearray can not be cast to java.lang.String
but if i do
gen = foreach temp generate *;
dump gen;
it works fine
any help is totally appreciated to understand this.
As required presenting the code:
STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
union_of_all_records is an alias in pig.
now another script which will consume this data
lookup_data =
LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);
load_denorm_clickstream_record =
LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
and join on these two aliases
denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;
step x : final_output =
FOREACH join_with_lookup
GENERATE denorm_clickstream_record::page_type as page_type;
at step x i get the above error.
I think you have to options:
1) You have to tell Pig the schema that the data has. For example:
temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);
2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.

How to Store into HBase using Pig and HBaseStorage

In the HBase shell, I created my table via:
create 'pig_table','cf'
In Pig, here are the results of the alias I wish to store into pig_table:
DUMP B;
Produces tuples with 6 fields:
(D1|30|2014-01-01 13:00,D1,30,7.0,2014-01-01 13:00,DEF)
(D1|30|2014-01-01 22:00,D1,30,1.0,2014-01-01 22:00,JKL)
(D10|20|2014-01-01 11:00,D10,20,4.0,2014-01-01 11:00,PQR)
...
The first field is a concatenation of the 2nd, third, and 5th fields, and will be used as the HBase rowkey.
But
STORE B INTO 'hbase://pig_table'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')
results in:
`Failed to produce result in "hbase:pig_table"
The logs are giving me:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataByteArray
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.objToBytes(HBaseStorage.java:924)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:875)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:468)
... 11 more
What is wrong with my syntax?
It appears that HBaseStorage does not automatically convert the data fields of the tuples into chararray, and which is necessary before it can be stored in HBase. I simply casted them as such:
C = FOREACH B {
GENERATE
(chararray)$0
,(chararray)$1
,(chararray)$2
,(chararray)$3
,(chararray)$4
,(chararray)$5
,(chararray)$6
;
}
STORE B INTO 'hbase://pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ( 'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Resources