Delta table configuration in open source and databricks version are different - azure-databricks

I'm trying to setup the isolation level to Serializable on open source delta using azure synapse notebook.
Command :
ALTER TABLE schema.table SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')
It seems like delta is not able to identify the configuration
Error: Unknown configuration was specified: delta.isolationLevel org.apache.spark.sql.delta.DeltaErrors$.unknownConfigurationKeyException(DeltaErrors.scala:398) org.apache.spark.sql.delta.DeltaConfigsBase.$anonfun$validateConfigurations$3(DeltaConfig.scala:147) scala.Option.getOrElse(Option.scala:189) org.apache.spark.sql.delta.DeltaConfigsBase.$anonfun$validateConfigurations$1(DeltaConfig.scala:147) scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
The same command works in databricks delta. Is this expected ? Why is this inconsistency between databricks delta and open delta ?

The command which you are trying is using t-sql type of syntax which is compatible to work with Databricks. On synapse spark pool, can you try below command and see if it works for you.
spark.sql("ALTER TABLE schema.table SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')")

Answering my own question, isolation level not exposed in open source delta version
https://github.com/delta-io/delta/issues/1265

Related

Hive Transactions + Remote Metastore Error

I'm running Hive 2.1.1 on EMR 5.5.0 with a remote mysql metastore DB. I need to enable transactions on hive, but when I follow the configuration here and run any query, I get the following error
FAILED: Error in acquiring locks: Error communicating with the metastore
Settings on the metastore:
hive.compactor.worker.threads = 0
hive.compactor.initiator.on = true
Settings in the hive client:
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
This only happens when I set hive.txn.manager, so my hive metastore is definitely online.
I've tried some of the old suggestions of turning hive test features on which didn't work, but I don't think this is a test feature anymore. I can't turn off concurrency as a similar post in SO suggests because I need concurrency. It seems like the problem is that either DbTxnManager isn't getting the remote metastore connection info properly from hive config or the mysqldb is missing some tables required by DbTxnManager. I have datanucleus.autoCreateTables=true.
It looks like hive wasn't properly creating the tables needed for the transaction manager. I'm not sure where it was getting its schema, but it was definitely wrong.
So we just ran the hive-txn-schema query to setup the schema manually. We'll do this at the start of any of our clusters from now on.
https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-txn-schema-2.1.0.mysql.sql
The error from
FAILED: Error in acquiring locks: Error communicating with the metastore
sometimes because of it without any data, you need to initialization some data in your tables. for example below.
create table t1(id int, name string)
clustered by (id) into 8 buckets
stored as orc TBLPROPERTIES ('transactional'='true');

Changing Transaction Isolation level setting behavior in Sqoop

We are currently trying to use Sqoop to ingest data from Hadoop to Azure SQL Data Warehouse but getting error related to Transaction isolation level. What's happening is that Sqoop tries to set transaction isolation level to READ COMMITTED while trying to import/export whereas this feature isn't currently supported in Azure SQL Data warehouse. I've tried using --relaxed-isolation parameter of Sqoop but still no effect.
As a solution, I am thinking to:
1. Change Sqoop source code to alter Sqoop's behavior to not set transaction level
2. Look for APIs (if any) that may allow me to change this Sqoop's behavior programmatically.
Has anyone encountered such scenario? Looking for suggestions for the proposed solutions and how to go about them.
This issue has just been resolved in Sqoop: https://issues.apache.org/jira/browse/SQOOP-2349
Otherwise, #wBob's comment about using Polybase is definitely best practice: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-azure-sql-data-warehouse-connector#use-polybase-to-load-data-into-azure-sql-data-warehouse

Impala Query Editor always shows AnalysisException

I am running a Quickstart VM Cloudera on a Windows 7 computer, with 8Go of RAM and 4Go dedicated to the VM.
I loaded tables from a SQL database into Hive, using Sqoop (Cloudera VM tutorial exercise 1). Using the Hive Query Editor OR Impala Shell, everything works fine (i.e. "show tables" shows me the tables that were imported).
Using the Impala Query Editor, whatever I type, I get the same error message:
AnalysisException: Syntax error in line 1: USE `` ^ Encountered: EMPTY IDENTIFIER Expected: IDENTIFIER CAUSED BY...
I have the same if I type "show tables;" ...
I checked that Impala-services were up and running and it was the case, and everything works fine in the Impala shell:
I googled around but could not find any answer, many thanks in advance for your answer !!
Need to use the Hive Query Editor. The error shows up if you use the Impala or other Query Editor because you're using a library written for Hive.
Query -> Editor -> Hive
Yes, try selecting a database and if one does not appear, try either clearing your browser cache and reloading the page and also verify that your user has permissions to view the default database. Although since you said that Hive query editor works fine, it sounds like permissions are not the issue.
I solved this issue cleaning history from Firefox. After that i signed again on HUE and the databases on Impala Query Editor was showed again.enter image description here
Impala does not support ORC file format I changed to sequence file it works

create database in hive with multiple locations having sentry enable

I am creating a database in hive with multiple location for example
CREATE DATABASE sample1 location 'hdfs://nameservice1:8020/db/dev/abc','hdfs://nameservice1:8020/db/dev/def','hdfs://nameservice1:8020/db/dev/ghi'
but i am getting error while doing this. Can anyone help in this kind of creating a database with multiple locations is allowed ? Is there any alternate solution for this.
PS: My cluster is sentry enabled
Which error? If that is
User xx does not have privileges for CREATETABLE
then look at
http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/quot-User-does-not-have-privileges-for-CREATETABLE-quot-Error/td-p/21044
You may have to omit LOCATION, and upload file directly to a hive warehouse location of that hive schema. I can't think of a better workaround.

Tables not found when hive cli called from different directory

I am facing a weird problem with Hive Tables. I have HIVE_HOME set in my environ and it is also in my search path so i can invoke hive directly.
Now I invoke hive from a directory lets say /a/b/c and create some tables. I can see the tables.
Now I change to a directory e.g /a/b and invoke hive from there. Here is the problem part. Either i am unable to see the tables or i get this error
hive> show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start
database 'metastore_db', see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database 'metastore_db', see the next exception
for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Why are tables tied to the directory from which the hive cli was called from? Any pointers?
I think you are using derby server which hive uses for storing the metadata. So, for that what you can do is delete everything inside metastore_db folder and then try to restart the hadoop. And then try to see. But, i think best advice would be you use the mysql as a metastore.

Resources