How to create Glue table with Parquet format? - parquet

In the documentation I found how to create Glue table in JSON format but I cannot find how to create one in Parquet format.
I think I could provide a subtype of glue.DataFormat, but I don't know how to do that https://docs.aws.amazon.com/cdk/api/latest/docs/aws-glue-readme.html

Ok, I found what are the working values on Terraform's website.
https://www.terraform.io/docs/providers/aws/r/glue_catalog_table.html
const glue_DataFormat_Parquet = <glue.DataFormat> {
inputFormat: new glue.InputFormat('org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'),
outputFormat: new glue.OutputFormat('org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'),
serializationLibrary: new glue.SerializationLibrary('org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe')
};

Related

Errors writing Geopandas Dataframe to Shapefile

I'm writing a geopandas polygon file to an Esri Shapefile. I can't write directly because I have date fields that I don't want to convert to text, I want to keep them as date.
I've written a custom schema, but how do I handle the geomtry column in the custom shapefile? It's a WKT field.
This is my custom schema (shortened for length):
schema = {
'geometry':'MultiPolygon',
'properties':{
'oid':'int',
'date_anncd':'date',
'value_mm':'float',
'geometry':??
}
}
Change the geometry type to 'Polygon' (MultiPolygon isn't an Esri type) and drop the geometry value in the properties.

Maximo MAXINTMSGTRK table: How to extract text from MSGDATA column? (HUGEBLOB)

I'm attempting to extract the text from the MSGDATA column (HUGEBLOB) in the MAXINTMSGTRK table:
I've tried the options outlined here: How to query hugeblob data:
select
msg.*,
utl_raw.cast_to_varchar2(dbms_lob.substr(msgdata,1000,1)) msgdata_expanded,
dbms_lob.substr(msgdata, 1000,1) msgdata_expanded_2
from
maxintmsgtrk msg
where
rownum = 1
However, the output is not text:
How can I extract text from MSGDATA column?
It's is possible to do it using Automation script, uncompress data using psdi.iface.jms.MessageUtil class.
from psdi.iface.jms import MessageUtil
...
msgdata_blob = maxintmsgtrkMbo.getBytes("msgdata")
byteArray = MessageUtil.uncompressMessage(msgdata_blob, maxintmsgtrkMbo.getLong("msglength"))
msgdata_clob = ""
for symb1 in byteArray:
msgdata_clob = msgdata_clob + chr(symb1)
It sounds like it's not possible because the value is compressed:
Starting in Maximo 7.6, the messages written by the Message Tracking
application are stored in the database. They are no longer written as
xml files as in previous versions.
Customers have asked how to search and view MSGDATA data from the
MAXINTMSGTRK table.
It is not possible to search or retrieve the data in the maxintmsgtrk
table in 7.6.using SQL. The BLOB field is stored compressed.
MIF 7.6 Message tracking changes

Stored procedure/functions in SparkSql

Any ways to achieve sql features like stored procedure or functions in sparksql?
I'm aware about hpl sql and coprocessor in hbase. But want to know if anything similar like is available in spark or not.
You may consider of use User Defined Function and inbuilt function
A quick example
val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
Result
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
We cannot create SP/Functions in SparkSql. However best way is to create a temp table just like CTE and used those tables for further usage. Or you can create a UDF Function in Spark.

How to choose data type for creating table in hive

I have to ingest data in hive table from hdfs but I don't know how to choose correct data type for the data mentioned below:-
$34740$#$Disrupt Worldwide LLC$#$40425$#$null$#$13$#$6$#$317903$#$null$#$Scott Bodily$#$+$#$null$#$10$#$0$#$1$#$0$#$disruptcentral.com$#$null$#$null$#$1$#$null$#$null$#$null$#$Scott Bodily$#$1220DB56-56D7-E411-80D6-005056A451E3$#$true$
$34741$#$The Top Tipster Leagues Limited$#$35605$#$null$#$13$#$7$#$317902$#$null$#$AM Support Team$#$+447886 027371$#$null$#$1$#$1$#$1$#$0$#$www.toptipsterleagues.com, www.toptipsterleagues.co.uk, http://test.toptipsterleague.com$#$Jamil Johnson$#$Cheng Liem Li$#$1$#$0.70$#$1.50$#$1.30$#$Bono van Nijnatten$#$0B758BF9-F1D6-E411-80D7-005056A44C5C$#$true$
Refer this link for different data types,
Click here
Other than all the numeric and decimal fields you can use STRING data type. For the numeric fields based on the range and precision you can use INT or DECIMAL.
Using string and varchar or any other string data types will read null in your data as string i.e "null" for handling nul you should mention the table properties like below,
ALTER TABLE tablename SET
SERDEPROPERTIES ('serialization.null.format' = 'null');
Let me know if anything needed on this.

Extracting an Array of Structs in Hive

I have an external table in hive
CREATE EXTERNAL TABLE FOO (
TS string,
customerId string,
products array< struct <productCategory:string, productId:string> >
)
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'some.serde'
WITH SERDEPROPERTIES ('error.ignore'='true')
LOCATION 'some_locations'
;
A record of the table may hold data such as:
1340321132000, 'some_company', [{"productCategory":"footwear","productId":"nik3756"},{"productCategory":"eyewear","productId":"oak2449"}]
Do anyone know if there is a way to simply extract all the productCategory from this record and return it as an array of productCategories without using explode. Something like the following:
["footwear", "eyewear"]
Or do I need to write my own GenericUDF, if so, I do not know much Java (a Ruby person), can someone give me some hints? I read some instructions on UDF from Apache Hive. However, I do not know which collection type is best to handle array, and what collection type to handle structs?
===
I have somewhat answered this question by writing a GenericUDF, but I ran into 2 other problems. It is in this SO Question
You can use json serde or build-in functions get_json_object, json_tuple.
With rcongiu's Hive-JSON SerDe the usage will be:
define table:
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>)
load sample json into it (it is important for this data to be one-lined):
{"DocId":"ABC","Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}
Then fetching orders ids is as easy as:
SELECT Orders.ItemId FROM complex_json LIMIT 100;
It will return the list of ids for you:
itemid
[1111,2222]
Proven to return correct results on my environment. Full listing:
add jar hdfs:///tmp/json-serde-1.3.6.jar;
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA INPATH '/tmp/test.json' OVERWRITE INTO TABLE complex_json;
SELECT Orders.ItemId FROM complex_json LIMIT 100;
Read more here:
http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html
One way would be to use either the inline or explode functions, like so:
SELECT
TS,
customerId,
pCat,
pId,
FROM FOO
LATERAL VIEW inline(products) p AS pCat, pId
Otherwise you can write UDF. Check out this post and this post for that. Along with the following resources:
Matthew Rathbone's guide to writing generic UDFs
Mark Grover's how to guide
the baynote blog post on generic UDFs
If size of array is fixed ( like 2 ). Please try:
products[0].productCategory,products[1].productCategory
But if not, UDF should be the right solution. I guess that you could do it in JRuby. GL!

Resources