hive/hdfs moving data not working as expected - hadoop

I had a table in hive called as test at location say 'hdfs://location1/partition='x'' and moved all the data to 'hdfs://location2/partition='x''.
hdfs dfs -mv /location1 /location2
Then I did
alter table test set location 'hdfs://location2'.
On doing
hdfs dfs -ls /location2
I see all the data in the right partition
Querying to get counts i.e.
select count(*) from test
works fine.
But doing
select * from test
pulls no records.
Unable to figure what went wrong while moving.

You need to drop the existing partitions that was pointing to the original location "hdfs://location1/partition='x'" manually.
Use below command to drop all the partitions manually:
alter table test drop partition(partition='x');
Once all the partitions are dropped run the below command to update the new partitions in hive metastore:
msck repair table test;
Why this? Because since the location of table was changed but the hive metastore was not updated with the new partitions in new location. The hive metastore is still holding the information about the partitions from old location. Once you drop partition and run the
msck repair
command, the hive metastore will get updated with the new partitions from new location.

Related

What happens if I move Hive table data files before moving the table?

I am trying to move the location of a table to a new directory. Let's say the original location is /data/dir. For example, I am trying something like this:
hadoop fs -mkdir /data/dir_bkp
hadoop fs -mv /data/dir/* /data/dir_bkp
I then do hive commands such as:
ALTER TABLE db.mytable RENAME TO db.mytable_bkp;
ALTER TABLE db.mytable_bkp SET LOCATION /data/dir_bkp;
Is it fine to move the directory files before changing the location of the table? After I run these commands, will the table mytable_bkp be populated as it was before?
After you executed mv command, your original table will become empty. because mv removed data files.
After you renamed table, it is empty, because it's location is empty.
After you executed ALTER TABLE SET LOCATION - the table is empty because partitions are mounted to old locations (now empty). Sorry for misleading you in this step previously. After rename table, partitions remain as they were before rename. Each partition can normally have it's own location outside table location.
If table is MANAGED, make it EXTERNAL:
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
Now drop table + create table with new location and run MSCK to create partitions:
MSCK [REPAIR] TABLE tablename;
If you are on Amazon EMR, run
ALTER TABLE tablename RECOVER PARTITIONS; instead of MSCK

Diffrence in behaviour while running "count(*) " in Tez and Map reduce

Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides.
I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted
"Partitions missing from filesystem:"
I tried running select count(*) <db.tablename>; (on tez) it failed with the following error:
Caused by: java.util.concurrent.ExecutionException:
java.io.FileNotFoundException:
But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" it worked fine without any issue.
I have two questions now :
How is this is possible?
How can I sync the hive metastore and an hdfs partition? For the
above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".)
Thanks in advance for help.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS.
However, this is available only from Hive version 3.0.. See - HIVE-17824
In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore.
Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement .
ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);
Run SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed.
This should sync the partitions in HMS as in HDFS.
Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. Because dropping an external table will not delete the underlying data.
Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually.
====
To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive.

Spark-Sql returns 0 records without repairing hive table

I'm doing the following:
Delete hive partition using ALTER TABLE ... DROP IF EXISTS PARTITION (col='val1')
hdfs dfs -rm -r path_to_remove
Run ingestion program that creates this partition (col='val1') and creates avro files under the HDFS folder`
sqlContext.sql("select count(0) from table1 where col='val1'").show returns 0 until MSCK REPAIR TABLE.
Is it compulsory to do the repair step to see the data again in spark-sql? Please advise.
If it's an external table, yes, you need to repair the table. I don't think you need to do that with managed tables.
SparkSQL reads information from the Hive metastore, and without having information about the partition there, nothing can be counted, by Spark or any other tool that uses the metastore

Spark sql queries on partitioned table with removed partitions files fails

Below is what am trying in order,
create partitioned table in hive based on current hour.
use spark hive context and perform msck repair table.
delete the hdfs folders of one of the added partitions manually.
use spark hive context again and perform
a> msck repair
this does not remove the partition added already with no hdfs folder.
seems like known behavior with respect to "msck repair"
b> select * from tablexxx where (existing partition);
Fails with exception : Filenotfound exception pointing to hdfs folder
which was deleted manually.
Any insights on this behavior would be of great help.
Yes, MSCK REPAIR TABLE will only discover new partitions, not delete "old" ones.
Working with external hive tables where you deleted the HDFS folder, I see two solutions
drop the table (files will not be deleted because the table is external), then re-create the table using the same location, and then run MSCK REPAIR TABLE. This is my prefered solution.
Drop all the partitions you deleted using ALTER TABLE <table> DROP PARTITION <partition>
What you observe in your case is maybe related to these: https://issues.apache.org/jira/browse/SPARK-15044 and
https://issues.apache.org/jira/browse/SPARK-19187

Hive error after alter table partition set location

I have a table TEST with one partition Profession.
After the execution of
Alter Table TEST PARTITION(Profession='50') set location 'hdfs:/apps/hive/warehouse1/TEST/Profession=50';
Command was executed without errors;
Next query failed with exception:
cannot find dir = hdfs:/xxxxxxxx/apps/hive/wharehouse/TEST/Profession=50
this was the directory where the partition was originally set.
Ever executing a Alter Table to move the location back to the original does not fix the information.
My goal is to move old partitions over time from a SSD hdfs volume to a HDD hdfs volume.
Any suggestion?
Thanks
Try to do msck repair table Test

Resources