Below is what am trying in order,
create partitioned table in hive based on current hour.
use spark hive context and perform msck repair table.
delete the hdfs folders of one of the added partitions manually.
use spark hive context again and perform
a> msck repair
this does not remove the partition added already with no hdfs folder.
seems like known behavior with respect to "msck repair"
b> select * from tablexxx where (existing partition);
Fails with exception : Filenotfound exception pointing to hdfs folder
which was deleted manually.
Any insights on this behavior would be of great help.
Yes, MSCK REPAIR TABLE will only discover new partitions, not delete "old" ones.
Working with external hive tables where you deleted the HDFS folder, I see two solutions
drop the table (files will not be deleted because the table is external), then re-create the table using the same location, and then run MSCK REPAIR TABLE. This is my prefered solution.
Drop all the partitions you deleted using ALTER TABLE <table> DROP PARTITION <partition>
What you observe in your case is maybe related to these: https://issues.apache.org/jira/browse/SPARK-15044 and
https://issues.apache.org/jira/browse/SPARK-19187
Related
I am creating a partition through the spark in hdfs path not directly in hive. And then I am copying it to the user/hive/warehouse/test.db/testtbl through CP command. But after show partitions command in hive-shell it will not be showing the partition. I also ran the repair table command to repair the table and add the partition. But it will not work. How can i add that partitions in hive? Is there another way to add it?
Anyone of the below command should work for you.
MSCK REPAIR TABLE <table_name>
ALTER TABLE <table_name> ADD PARTITION (<col_name>='<value>')
Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides.
I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted
"Partitions missing from filesystem:"
I tried running select count(*) <db.tablename>; (on tez) it failed with the following error:
Caused by: java.util.concurrent.ExecutionException:
java.io.FileNotFoundException:
But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" it worked fine without any issue.
I have two questions now :
How is this is possible?
How can I sync the hive metastore and an hdfs partition? For the
above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".)
Thanks in advance for help.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS.
However, this is available only from Hive version 3.0.. See - HIVE-17824
In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore.
Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement .
ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);
Run SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed.
This should sync the partitions in HMS as in HDFS.
Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. Because dropping an external table will not delete the underlying data.
Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually.
====
To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive.
I have a folder structure in HDFS like below. However, no partitions were actually created on the table using the ALTER TABLE ADD PARTITION commands, even though the folder structure was setup as if the table had partitions.
How can I automatically add all the partitions to the Hive table? (Hive 1.0, external table)
/user/frank/clicks.db
/date=20190401
/file0004.csv
/date=20190402
/file0009.csv
/date=20190501
/file0000.csv
/file0001.csv
...etc
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or
ALTER TABLE tablename RECOVER PARTITIONS;
if you are running Hive on EMR.
Read more details about both commands here: RECOVER PARTITIONS
I'm doing the following:
Delete hive partition using ALTER TABLE ... DROP IF EXISTS PARTITION (col='val1')
hdfs dfs -rm -r path_to_remove
Run ingestion program that creates this partition (col='val1') and creates avro files under the HDFS folder`
sqlContext.sql("select count(0) from table1 where col='val1'").show returns 0 until MSCK REPAIR TABLE.
Is it compulsory to do the repair step to see the data again in spark-sql? Please advise.
If it's an external table, yes, you need to repair the table. I don't think you need to do that with managed tables.
SparkSQL reads information from the Hive metastore, and without having information about the partition there, nothing can be counted, by Spark or any other tool that uses the metastore
I had a table in hive called as test at location say 'hdfs://location1/partition='x'' and moved all the data to 'hdfs://location2/partition='x''.
hdfs dfs -mv /location1 /location2
Then I did
alter table test set location 'hdfs://location2'.
On doing
hdfs dfs -ls /location2
I see all the data in the right partition
Querying to get counts i.e.
select count(*) from test
works fine.
But doing
select * from test
pulls no records.
Unable to figure what went wrong while moving.
You need to drop the existing partitions that was pointing to the original location "hdfs://location1/partition='x'" manually.
Use below command to drop all the partitions manually:
alter table test drop partition(partition='x');
Once all the partitions are dropped run the below command to update the new partitions in hive metastore:
msck repair table test;
Why this? Because since the location of table was changed but the hive metastore was not updated with the new partitions in new location. The hive metastore is still holding the information about the partitions from old location. Once you drop partition and run the
msck repair
command, the hive metastore will get updated with the new partitions from new location.