Hive Columnar Loader in HDP2.0 - hadoop

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!

Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Related

Moving data to HBASE using Pig

I tried moving 851 data in my hbase for that i created hbase using below command
create 'customers', 'customers_data'
i moved the files using pig script. My pig script is
STOCK_A = LOAD '/user/cloudera/xxx' USING PigStorage('|');
data = FILTER STOCK_A BY ( $0 matches '.*MH.*');
MH_DATA = FOREACH data GENERATE $1, $3, $4;
STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
i got 851 data using my pig command. My data is
(aman,george,22)
(aman,george,22)
(aman,george,22)
.
.
.
.
.
851
but when i try to put this data in hbase using below command
PIG_CLASSPATH=/usr/lib/hbase/hbase.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar /usr/bin/pig /home/cloudera/remot/pighl7
data that is getting stored in HBASE is
ROW COLUMN+CELL
\xB5~\x5C& column=customers_data:firstname, timestamp=1478700582076, value=george
\xB5~\x5C& column=customers_data:lastname, timestamp=1478700582076, value=22
I cant find my 851 records as well as the third parameter. I don't know what i am doing wrong.
Please help
I think you have missed giving alias in the generate statement (for safer side i have casted your tuples into chararray)
also at the end give name for you store relation
TRY:
MH_DATA = FOREACH data GENERATE (chararray)$1 AS firstname , (chararray)$3 AS lastname, (chararray)$4 AS age;
STORE_IN_HBASE = STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
for more information follow this link:
https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
After doing a lot of research and trail and error when i changed the row key from name to timestamp i solved my problem, As i am using using row key which is having same name as of others it always updates it.

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Pig reading data as databytearray

Hey guys i have one more question I am just not able to understand the behavior of pig
I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).
But when I load the data from /user/sga/transformeddata location and do
temp = load '/user/sga/transformeddata' using PigStorage();
gen = foreach temp generate page_type;
dump gen;
getting following error:
databytearray can not be cast to java.lang.String
but if i do
gen = foreach temp generate *;
dump gen;
it works fine
any help is totally appreciated to understand this.
As required presenting the code:
STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
union_of_all_records is an alias in pig.
now another script which will consume this data
lookup_data =
LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);
load_denorm_clickstream_record =
LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
and join on these two aliases
denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;
step x : final_output =
FOREACH join_with_lookup
GENERATE denorm_clickstream_record::page_type as page_type;
at step x i get the above error.
I think you have to options:
1) You have to tell Pig the schema that the data has. For example:
temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);
2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.

Extracting an Array of Structs in Hive

I have an external table in hive
CREATE EXTERNAL TABLE FOO (
TS string,
customerId string,
products array< struct <productCategory:string, productId:string> >
)
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'some.serde'
WITH SERDEPROPERTIES ('error.ignore'='true')
LOCATION 'some_locations'
;
A record of the table may hold data such as:
1340321132000, 'some_company', [{"productCategory":"footwear","productId":"nik3756"},{"productCategory":"eyewear","productId":"oak2449"}]
Do anyone know if there is a way to simply extract all the productCategory from this record and return it as an array of productCategories without using explode. Something like the following:
["footwear", "eyewear"]
Or do I need to write my own GenericUDF, if so, I do not know much Java (a Ruby person), can someone give me some hints? I read some instructions on UDF from Apache Hive. However, I do not know which collection type is best to handle array, and what collection type to handle structs?
===
I have somewhat answered this question by writing a GenericUDF, but I ran into 2 other problems. It is in this SO Question
You can use json serde or build-in functions get_json_object, json_tuple.
With rcongiu's Hive-JSON SerDe the usage will be:
define table:
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>)
load sample json into it (it is important for this data to be one-lined):
{"DocId":"ABC","Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}
Then fetching orders ids is as easy as:
SELECT Orders.ItemId FROM complex_json LIMIT 100;
It will return the list of ids for you:
itemid
[1111,2222]
Proven to return correct results on my environment. Full listing:
add jar hdfs:///tmp/json-serde-1.3.6.jar;
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA INPATH '/tmp/test.json' OVERWRITE INTO TABLE complex_json;
SELECT Orders.ItemId FROM complex_json LIMIT 100;
Read more here:
http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html
One way would be to use either the inline or explode functions, like so:
SELECT
TS,
customerId,
pCat,
pId,
FROM FOO
LATERAL VIEW inline(products) p AS pCat, pId
Otherwise you can write UDF. Check out this post and this post for that. Along with the following resources:
Matthew Rathbone's guide to writing generic UDFs
Mark Grover's how to guide
the baynote blog post on generic UDFs
If size of array is fixed ( like 2 ). Please try:
products[0].productCategory,products[1].productCategory
But if not, UDF should be the right solution. I guess that you could do it in JRuby. GL!

Linq Contains issue: cannot formulate the equivalent of 'WHERE IN' query

In the table ReservationWorkerPeriods there are records of all workers that are planned to work on a given period on any possible machine.
The additional table WorkerOnMachineOnConstructionSite contains columns workerId, MachineId and ConstructionSiteId.
From the table ReservationWorkerPeriods I would like to retrieve just workers who work on selected machine.
In order to retrieve just relevant records from WorkerOnMachineOnConstructionSite table I have written the following code:
var relevantWorkerOnMachineOnConstructionSite = (from cswm in currentConstructionSiteSchedule.ContrustionSiteWorkerOnMachine
where cswm.MachineId == machineId
select cswm).ToList();
workerOnMachineOnConstructionSite = relevantWorkerOnMachineOnConstructionSite as List<ContrustionSiteWorkerOnMachine>;
These records are also used in the application so I don't want to bypass the above code even if is possible to directly retrieve just workerPeriods for workers who work on selected machine. Anyway I haven't figured out how it is possible to retrieve the relevant workerPeriods once we know which userIDs are relevant.
I have tried the following code:
var userIDs = from w in workerOnMachineOnConstructionSite select new {w.WorkerId};
List<ReservationWorkerPeriods> workerPeriods = currentConstructionSiteSchedule.ReservationWorkerPeriods.ToList();
allocatedWorkers = workerPeriods.Where(wp => userIDs.Contains(wp.WorkerId));
but it seems to be incorrect and don't know how to fix it. Does anyone know what is the problem and how it is possible to retrieve just records which contain userIDs from the list?
Currently, you are constructing an anonymous object on the fly, with one property. You'll want to grab the id directly with (note the missing curly braces):
var userIDs = from w in workerOnMachineOnConstructionSite select w.WorkerId;
Also, in such cases, don't call ToList on it - the variable userIDs just contains the query, not the result. If you use that variable in a further query, the provider can translate it to a single sql query.

Resources