hive metastrore (COLUMNS_V2) for azure databricks - azure-databricks

I have configured Hive Version 2.3.0 in azure sql database on DBR 10.X. I can see entries all delta tables in dbo.TBLS however [dbo].[COLUMNS_V2] shows only one entry per table which is like below..
CD_ID
COMMENT
COLUMN_NAME
TYPE_NAME
INTEGER_IDX
346
from deserializer
col
array
0
what I am missing here? why don't I see all columns for Table ID-346?

I came accross this same problem and found the cause for my case:
When I wrote data to the hive table in my databricks notebook I had:
myDf.write.mode('Delta').saveAsTable('myHiveDb.myTable')
This caused the columns to not show up in COLUMNS_V2. Instead you need to change the mode to 'Hive' instead of 'Delta':
myDf.write.mode('Hive').saveAsTable('myHiveDb.myTable')

Related

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements).
However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.
Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore.
Hence, I tried the below options :
Changing the spark setting:
spark.sql.hive.convertMetastoreParquet=false
and Refreshing the spark catalog:
spark.catalog.refreshTable("table_name")
However, the above two options are not solving the problem.
Any suggestions or alternatives would be super helpful.
This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:
...Interestingly enough it appears that if you create the table
differently like:
spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")
Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")
Then it works properly...
To fix this solution, you have to use the same alter command used in hive to spark-shell as well.
spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

Hive Table retention support

I want to support retention on a Hive table for old partitions. Basically I need to automatically delete Hive partitions after a specific period. I can manually do this or with a script but I have noticed that a retention property exists in every Hive Table but I can't find many information about it.
For example when using descibe in a hive table there is a retention property
desc formatted my_hive_table;
>>>
col_name data_type comment
...
Retention: 0 NULL
...
and I have found this 2014 Jira but I am not sure if it is implemented and how.
Can anyone confirm if Hive supports this capability and if yes how to configure it properly?
I think it's available in Hive 3, at least it's in HDP since 3.1.4
See configuration here https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive-set-partition-retention.html

RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe

On HDP cluster, I am trying create and integrate Hive tables with existing Hbase tables. It creates the hive table. But when I am try to query the hive table it throws the following exception especially when the number of columns exceeds 200.
I checked the # of columns in both hbase & hive is same. Not getting proper solution to debug it.
hive> select * from hbase_hive.lead;
FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
org.apache.hadoop.hive.hbase.HBaseSerDe:
columns has 273 elements while hbase.columns.mapping has 204 elements (counting the key if implicit))
Is there any column limitation in this case?
Please suggest me solution on this
This has fixed the issue.
https://hortonworks.my.salesforce.com/kA2E0000000LZQ5?srPos=0&srKp=ka2⟨=en_US
ROOT CAUSE:
A character limitation for PARAM_VALUE field in SERDE_PARAMS table in hive metastore for 4000 character is the root cause of this issue. This limitation prevents Hive from creating a table with high column numbers, eventually causing desc or select * from to fail with error above.
WORKAROUND: This issue can be worked around by doing the following in hive metastore
-- log into Hive Metastore DB -- >alter table SERDE_PARAMS MODIFY PARAM_VALUE VARCHAR(400000000);

Hive error - Select * from table ;

I created one external table in hive which was successfully created.
create external table load_tweets(id BIGINT,text STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/cloudera/data/tweets_raw';
But, when I did:
hive> select * from load_tweets;
I got the below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream#5dfb0646; line: 1, column: 2]**
Please suggest me how to fix this. Is it the twitter o/p file which was created using flume was corrupted or anything else?
You'll need to do two additional things.
1) Put data into the file (perhaps using INSERT). Or maybe it's already there. In either case, you'll then need to
2) from Hive, msck repair table load_tweets;
For Hive tables, the schema and other meta-information about the data is stored in what's called the Hive Metastore -- it's actually a relational database under the covers. When you perform operations on Hive tables created without the LOCATION keyword (that is, internal, not external tables), the Hive will automatically update the metastore.
But most Hive use-cases cause data to be appended to files that are updated using other processes, and thus external tables are common. If new partitions are created externally, before you can query them with Hive you need to force the metastore to sync with the current state of the data using msck repair table <tablename>;.

Sqoop - Create empty hive partitioned table based on schema of oracle partitioned table

I have an oracle table which has 80 columns and id partitioned on state column. My requirement is to create a hive table with similar schema of oracle table and partitioned on state.
I tried using sqoop -create-hive-table option. But keep getting an error
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Partition key state cannot be a column to import.
I understand that in Hive the partitioned column should not be in table definition, but then how do I get around the issue?
I do not want to manually write create table command, as I have 50 such tables to import and would like to use sqoop.
Any suggestion or ideas?
Thanks
There is a turn around for this.
Below is the procedure i fallow :
On Oracle run query to get the schema for a table and store it to a file.
Move that file to Hadoop
On Hadoop create a shell script which constructs a HQL file.
That hql file contains "Hive create table statement along with columns". For this we can use the above file(Oracle schema file copied to hadoop).
For this script to run u need to just pass Hive database name,table name, partition column name,path, etc.. depending on u r customization level.At the end of this shell script add "hive -f HQL filename".
If everything is ready it just takes couple of mins for each table creation.

Resources