How to Create External Tables (similar to Hive) on Azure Delta Lake - azure-databricks

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.
Actions in sequence-
Create a dataframe DF
Drop Hive external table if exists, load dataframe DF to this external table using DF.write.insertInto("table")
create a dataframe DF1
Drop Hive external table if exists, load dataframe DF1 to this external table using DF1.write.insertInto("table")
Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-
>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
| s| d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+
I am trying to perform the similar using Azure Delta lake Table with below steps-
Create the dataframes.
Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
Create unmanaged table on top of this path.
Here is the code in my Databricks notebook.
df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://mycontainer#xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://mycontainer#xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table'")
However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'spark-warehouse/tables/my_existing_table'
Use the above method to create an external table in Delta Lake like Hive manner.
For complete reference, Check the below web doc link
https://learn.microsoft.com/en-us/azure/databricks/spark/2.x/spark-sql/language-manual/create-table
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
[AS select_statement]
[LOCATION path] helps to give the external path. This clause will take care of external table creation

Related

How to create external tables from parquet files in s3 using hive 1.2?

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC
Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

Does hive query always do full scan, even if it is in equals condition?

I am trying to create an external Hive table that maps to a DynamoDB table like the official documentation says.
CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, fullColumn map<String, String>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "myTable",
"dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");
But doing a query using the hash key it seems it is doing a full scan table.
hive> select * from dynamodb where hashKey="test";
Any suggestions on that? Thanks

hive add columns on partitioned table does not work

I share my experience about adding columns on a partitioned hive table.
As you can see, despite the CASCADE function, the ALTER brakes my table :(
add columns on partitioned table
table description
CREATE TABLE test (
a string,
b string,
c string
)
PARTITIONED BY (
x string,
y string,
z string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.compress'='SNAPPY'
);
duplicate the table
CREATE TABLE test_tmp...
hadoop distcp hdfs://.../test/* dfs://.../test_tmp
MSCK REPAIR TABLE test_tmp;
SELECT * FROM test_tmp
LIMIT 100
check : OK (I get results)
modify the table
ALTER TABLE test_tmp
ADD COLUMNS(
aa timestamp,
bb string,
cc int,
dd string
) CASCADE;
SELECT * FROM test_tmp
LIMIT 100
...
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:19, Vertex vertex_1502459312997_187854_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 21.655/0.000 sec [0 successful, 1 errors]
check : KO (I get this error)
If you are using Hive 0.x or 1.x then you are probably a victim of...
HIVE-10598 Vectorization borks when column is added to table.
...which is specific to ORC format, even if it's not apparent from the JIRA label.
There is a partial fix as of Hive 2.0 (i.e. ADD is fixed, but DROP / RENAME / CHANGE are still crippled) thanks to
HIVE-11981 ORC Schema Evolution Issues (Vectorized, ACID, and
Non-Vectorized)
And another related fix as of Hive 2.1.1 for CHANGE
HIVE-14355 Schema evolution for ORC in llap is broken
for Int to String conversion
To be continued...

insert data to external table from an external table

While inserting data to an external table-2 from an external table-1 the data of external table-2 gets stored in /user/hive/warehouse/db-name/table-name/,but as an external table it should not store data into warehouse directory right?
Should we specify location for storing data to external table?
Yes, you will have to mention the location while creating the external table.
You can simply do it in following way.
Create the tables table1 and table2:
CREATE EXTERNAL TABLE table1(col1 INT, col2 BIGINT,col3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '<hdfs_location1>';
CREATE EXTERNAL TABLE table2(col21 INT, col22 BIGINT,col23 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '<hdfs_location2>';
Now insert the data from table1 to table 2
INSERT OVERWRITE TABLE table2(col21,col22,col23) SELECT * FROM table1
It will copy the data from table 1 to table2 hdfs location.
Please note that CTAS(Create table AS Select) is not supported for external tables.
Any table you create in hive whether its internal or external file is moved to '/user/hive/warehouse' or whatever you specify in
hive.metastore.warehouse.dir
in hive-site.xml
External table is created- to prevent the data loss when someone drop the table accidentally. Try to create 2 external tables and browse the filesystem. You can easily understand the concept.
I think you have created external table-2 without specifying LOCATION. Try using below syntax
CREATE EXTERNAL TABLE [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Resources