Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command - hadoop

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements).
However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.
Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore.
Hence, I tried the below options :
Changing the spark setting:
spark.sql.hive.convertMetastoreParquet=false
and Refreshing the spark catalog:
spark.catalog.refreshTable("table_name")
However, the above two options are not solving the problem.
Any suggestions or alternatives would be super helpful.

This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:
...Interestingly enough it appears that if you create the table
differently like:
spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")
Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")
Then it works properly...

To fix this solution, you have to use the same alter command used in hive to spark-shell as well.
spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

Related

How to use external table in hive?

Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

HBase Shell - Create a reduced table from existing Hbase table

I want to create a reduced version of an HBase Table via Hbase shell. For example:
HBase Table 'test' is already present in HBase with following info:
TableName: 'test'
ColumnFamily: 'f'
Columns: 'f:col1', 'f:col2', 'f:col3', 'f:col4'
I want to create another table in HBase 'test_reduced' which looks like this
TableName: 'test_reduced'
ColumnFamily: 'f'
Columns: 'f:col1', 'f:col3'
How can we do this via HBase shell ? I know how to copy the table using snapshot command So I am mainly looking for dropping column names in HBase Table.
can't do it. you need to use Hbase Client API.
1- read the table in.
2- only "put" columns you want into your new table.
Cloudera came close by enabling users to perform "Partial HBase table copies" with "CopyTable" function, but that will allow you to change column_family names only ... (I am not sure you are using cloudera), but even that, is not what you are looking for.
for your ref:
http://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/

Hive - Hbase integration Transactional update with timestamp

I am new to hadoop and big data, just trying to figure out the possibilities to move my Data store to hbase these days, and I have come across a problem, which some of you might be able to help me with. So its like,
I have a hbase table "hbase_testTable" with Column Family : "ColFam1". I have set the version of "ColFam1" to 10, as I have to maintain history upto 10 updates to this column family. Which works fine. When I try to add new rows through hbase shell with explicit timestamp value it works fine. Basically I want to use the timestamp as my version control. So I specify the time stamp as
put 'hbase_testTable' '1001','ColFam1:q1', '1000$', 3
where '3' is my version. And everything works fine.
Now I am trying to integrate with HIVE external table, and I have all mappings well set to match that of hbase table like below :
create external table testtable (id string, q1 string, q2 string, q3 string)
STOREd BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,colfam1:q1, colfam1:q2, colfam1:q3")
TBLPROPERTIES("hbase.table.name" = "testtable", "transactional" = "true");
And works fine with normal insertion. It updates the HBase table and vice-versa.
Even though the external table is made "Transactional", I am not able to update the data on HIVE. It gives me an error :
FAILED: SemanticException [Error 10294]: Attempt to do update or delete
using transaction manager that does not support these operations
Said that, Any updates, made to the hbase tables are reflected immediately on the hive table.
I can update the Hbase table with hive external table by trying to insert into the hive external table for the "rowid" with new data for the column.
Is it possible to I control the timestamp being written to the referenced hbase table ( like 4,5,6,7..etc) Please help.
The timestamp is one of important element in Hbase versioning. You are trying to create your own timestamp, which works fine at Hbase level.
One point, is you should be very careful, with unique and non-negative. You can look at Custom versioning in HBase-Definitve Guide book.
Now you have Hive on top of Hbase. As per documentation,
there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
Thats for the reading part. And for putting data, you can look here.
It still says that, you have to give valid time stamp and not any other value.
The future versions are expected to expose the timestamp attribute.
I hope you got a better idea regarding how to deal with custom timestamp in Hive-Hbase integration.

Sqoop - Create empty hive partitioned table based on schema of oracle partitioned table

I have an oracle table which has 80 columns and id partitioned on state column. My requirement is to create a hive table with similar schema of oracle table and partitioned on state.
I tried using sqoop -create-hive-table option. But keep getting an error
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Partition key state cannot be a column to import.
I understand that in Hive the partitioned column should not be in table definition, but then how do I get around the issue?
I do not want to manually write create table command, as I have 50 such tables to import and would like to use sqoop.
Any suggestion or ideas?
Thanks
There is a turn around for this.
Below is the procedure i fallow :
On Oracle run query to get the schema for a table and store it to a file.
Move that file to Hadoop
On Hadoop create a shell script which constructs a HQL file.
That hql file contains "Hive create table statement along with columns". For this we can use the above file(Oracle schema file copied to hadoop).
For this script to run u need to just pass Hive database name,table name, partition column name,path, etc.. depending on u r customization level.At the end of this shell script add "hive -f HQL filename".
If everything is ready it just takes couple of mins for each table creation.

Resources