Hive: Add partitions for existing folder structure - hadoop

I have a folder structure in HDFS like below. However, no partitions were actually created on the table using the ALTER TABLE ADD PARTITION commands, even though the folder structure was setup as if the table had partitions.
How can I automatically add all the partitions to the Hive table? (Hive 1.0, external table)
/user/frank/clicks.db
/date=20190401
/file0004.csv
/date=20190402
/file0009.csv
/date=20190501
/file0000.csv
/file0001.csv
...etc

Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or
ALTER TABLE tablename RECOVER PARTITIONS;
if you are running Hive on EMR.
Read more details about both commands here: RECOVER PARTITIONS

Related

How to add partition in hive managed table?

I am creating a partition through the spark in hdfs path not directly in hive. And then I am copying it to the user/hive/warehouse/test.db/testtbl through CP command. But after show partitions command in hive-shell it will not be showing the partition. I also ran the repair table command to repair the table and add the partition. But it will not work. How can i add that partitions in hive? Is there another way to add it?
Anyone of the below command should work for you.
MSCK REPAIR TABLE <table_name>
ALTER TABLE <table_name> ADD PARTITION (<col_name>='<value>')

What happens if I move Hive table data files before moving the table?

I am trying to move the location of a table to a new directory. Let's say the original location is /data/dir. For example, I am trying something like this:
hadoop fs -mkdir /data/dir_bkp
hadoop fs -mv /data/dir/* /data/dir_bkp
I then do hive commands such as:
ALTER TABLE db.mytable RENAME TO db.mytable_bkp;
ALTER TABLE db.mytable_bkp SET LOCATION /data/dir_bkp;
Is it fine to move the directory files before changing the location of the table? After I run these commands, will the table mytable_bkp be populated as it was before?
After you executed mv command, your original table will become empty. because mv removed data files.
After you renamed table, it is empty, because it's location is empty.
After you executed ALTER TABLE SET LOCATION - the table is empty because partitions are mounted to old locations (now empty). Sorry for misleading you in this step previously. After rename table, partitions remain as they were before rename. Each partition can normally have it's own location outside table location.
If table is MANAGED, make it EXTERNAL:
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
Now drop table + create table with new location and run MSCK to create partitions:
MSCK [REPAIR] TABLE tablename;
If you are on Amazon EMR, run
ALTER TABLE tablename RECOVER PARTITIONS; instead of MSCK

How to create partitioned hive table on dynamic hdfs directories

I am having difficulty in getting hive to discover partitions which are created in HDFS
Here's the directory structure in HDFS
warehouse/database/table_name/A
warehouse/database/table_name/B
warehouse/database/table_name/C
warehouse/database/table_name/D
A,B,C,D being values from a column type
when I create a hive table using the following syntax
CREATE EXTERNAL TABLE IF NOT EXISTS
table_name(`name` string, `description` string)
PARTITIONED BY (`type` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:///tmp/warehouse/database/table_name'
I am unable to see any records when I query the table.
But when I create directories in HDFS as below
warehouse/database/table_name/type=A
warehouse/database/table_name/type=B
warehouse/database/table_name/type=C
warehouse/database/table_name/type=D
It works and discovers partitions when I check using show partitions table_name
Is there some configuration in hive to able to detect dynamic directories as partitions?
Creating external table on top of some directory is not enough, partitions needs to be mounted also. Discover partitions feature added in Hive 4.0.0. Use MSCK REPAIR TABLE for earlier versions:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
or it's equivalent on EMR:
ALTER TABLE table_name RECOVER PARTITIONS;
And when you creating dynamic partitions using insert overwrite, partition metadata is being created automatically and partition folders are in the form key=value.

Diffrence in behaviour while running "count(*) " in Tez and Map reduce

Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides.
I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted
"Partitions missing from filesystem:"
I tried running select count(*) <db.tablename>; (on tez) it failed with the following error:
Caused by: java.util.concurrent.ExecutionException:
java.io.FileNotFoundException:
But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" it worked fine without any issue.
I have two questions now :
How is this is possible?
How can I sync the hive metastore and an hdfs partition? For the
above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".)
Thanks in advance for help.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS.
However, this is available only from Hive version 3.0.. See - HIVE-17824
In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore.
Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement .
ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);
Run SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed.
This should sync the partitions in HMS as in HDFS.
Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. Because dropping an external table will not delete the underlying data.
Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually.
====
To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive.

Spark sql queries on partitioned table with removed partitions files fails

Below is what am trying in order,
create partitioned table in hive based on current hour.
use spark hive context and perform msck repair table.
delete the hdfs folders of one of the added partitions manually.
use spark hive context again and perform
a> msck repair
this does not remove the partition added already with no hdfs folder.
seems like known behavior with respect to "msck repair"
b> select * from tablexxx where (existing partition);
Fails with exception : Filenotfound exception pointing to hdfs folder
which was deleted manually.
Any insights on this behavior would be of great help.
Yes, MSCK REPAIR TABLE will only discover new partitions, not delete "old" ones.
Working with external hive tables where you deleted the HDFS folder, I see two solutions
drop the table (files will not be deleted because the table is external), then re-create the table using the same location, and then run MSCK REPAIR TABLE. This is my prefered solution.
Drop all the partitions you deleted using ALTER TABLE <table> DROP PARTITION <partition>
What you observe in your case is maybe related to these: https://issues.apache.org/jira/browse/SPARK-15044 and
https://issues.apache.org/jira/browse/SPARK-19187

Resources