Extracting an Array of Structs in Hive - hadoop

I have an external table in hive
CREATE EXTERNAL TABLE FOO (
TS string,
customerId string,
products array< struct <productCategory:string, productId:string> >
)
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'some.serde'
WITH SERDEPROPERTIES ('error.ignore'='true')
LOCATION 'some_locations'
;
A record of the table may hold data such as:
1340321132000, 'some_company', [{"productCategory":"footwear","productId":"nik3756"},{"productCategory":"eyewear","productId":"oak2449"}]
Do anyone know if there is a way to simply extract all the productCategory from this record and return it as an array of productCategories without using explode. Something like the following:
["footwear", "eyewear"]
Or do I need to write my own GenericUDF, if so, I do not know much Java (a Ruby person), can someone give me some hints? I read some instructions on UDF from Apache Hive. However, I do not know which collection type is best to handle array, and what collection type to handle structs?
===
I have somewhat answered this question by writing a GenericUDF, but I ran into 2 other problems. It is in this SO Question

You can use json serde or build-in functions get_json_object, json_tuple.
With rcongiu's Hive-JSON SerDe the usage will be:
define table:
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>)
load sample json into it (it is important for this data to be one-lined):
{"DocId":"ABC","Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}
Then fetching orders ids is as easy as:
SELECT Orders.ItemId FROM complex_json LIMIT 100;
It will return the list of ids for you:
itemid
[1111,2222]
Proven to return correct results on my environment. Full listing:
add jar hdfs:///tmp/json-serde-1.3.6.jar;
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA INPATH '/tmp/test.json' OVERWRITE INTO TABLE complex_json;
SELECT Orders.ItemId FROM complex_json LIMIT 100;
Read more here:
http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html

One way would be to use either the inline or explode functions, like so:
SELECT
TS,
customerId,
pCat,
pId,
FROM FOO
LATERAL VIEW inline(products) p AS pCat, pId
Otherwise you can write UDF. Check out this post and this post for that. Along with the following resources:
Matthew Rathbone's guide to writing generic UDFs
Mark Grover's how to guide
the baynote blog post on generic UDFs

If size of array is fixed ( like 2 ). Please try:
products[0].productCategory,products[1].productCategory
But if not, UDF should be the right solution. I guess that you could do it in JRuby. GL!

Related

How to get table name for a simple Sequel Dataset object?

Ie, given a dataset object ds = DB[:transactions].where{updated_at > 1.day.ago} - no funny joins and stuff going on - how could I fetch the table name (:transactions) ?
If you want the first table in the dataset, you can use ds.first_source.
If you want it as a string you can do:
ds.first_source_table.to_s
If you want a symbol, just omit .to_s
Based on the example provided, I would do something like this.
ds.klass.name
That will return a string with the name of your table.

Stored procedure/functions in SparkSql

Any ways to achieve sql features like stored procedure or functions in sparksql?
I'm aware about hpl sql and coprocessor in hbase. But want to know if anything similar like is available in spark or not.
You may consider of use User Defined Function and inbuilt function
A quick example
val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
Result
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
We cannot create SP/Functions in SparkSql. However best way is to create a temp table just like CTE and used those tables for further usage. Or you can create a UDF Function in Spark.

How to choose data type for creating table in hive

I have to ingest data in hive table from hdfs but I don't know how to choose correct data type for the data mentioned below:-
$34740$#$Disrupt Worldwide LLC$#$40425$#$null$#$13$#$6$#$317903$#$null$#$Scott Bodily$#$+$#$null$#$10$#$0$#$1$#$0$#$disruptcentral.com$#$null$#$null$#$1$#$null$#$null$#$null$#$Scott Bodily$#$1220DB56-56D7-E411-80D6-005056A451E3$#$true$
$34741$#$The Top Tipster Leagues Limited$#$35605$#$null$#$13$#$7$#$317902$#$null$#$AM Support Team$#$+447886 027371$#$null$#$1$#$1$#$1$#$0$#$www.toptipsterleagues.com, www.toptipsterleagues.co.uk, http://test.toptipsterleague.com$#$Jamil Johnson$#$Cheng Liem Li$#$1$#$0.70$#$1.50$#$1.30$#$Bono van Nijnatten$#$0B758BF9-F1D6-E411-80D7-005056A44C5C$#$true$
Refer this link for different data types,
Click here
Other than all the numeric and decimal fields you can use STRING data type. For the numeric fields based on the range and precision you can use INT or DECIMAL.
Using string and varchar or any other string data types will read null in your data as string i.e "null" for handling nul you should mention the table properties like below,
ALTER TABLE tablename SET
SERDEPROPERTIES ('serialization.null.format' = 'null');
Let me know if anything needed on this.

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

With Oracle XML Tables do XQuery selects use XmlIndexes?

I am trying to retrieve keys and parent keys from some structured xml stored as binary xml in oracle. I have tried created unstructured index and also an index with a structured component. The structured component works fine when doing a SELECT against XMLTABLE() but I cannot retrieve values of parent node using XMLTable. I am therefore trying the following Xquery to retrieve parent values but this is not using the index at all. Does this style of query support using XmlIndexes? I can't find anything in the docs that say either way.
SELECT y.*
FROM xml_data x, XMLTABLE(xmlnamespaces( DEFAULT 'namespace'),
'for $i in /foo/bar
return element r {
$i/someKey
,element parentKey { $i/../someKey }
}'
PASSING x.import_xml
COLUMNS
someKey VARCHAR2(100) PATH 'someKey'
,parentKey VARCHAR2(100) PATH 'parentKey'
) y
Thanks, Tom

Resources