Hive change table location for all partitions - hadoop

After moving my table's data to /some_loc, I am able to change its location with a command such as:
ALTER TABLE db.my_table SET LOCATION '/some_loc'; followed by
MSCK REPAIR TABLE db.my_table;
However, while this does move the table's location, the table is empty. When I do show partitions db.my_table; I see the partitions, but the are referencing the old location. I have to drop and recreate the table for the data to show up.
Is there a way make sure the partitions are pointing to the correct location when I change the location of the table?

Related

What happens if I move Hive table data files before moving the table?

I am trying to move the location of a table to a new directory. Let's say the original location is /data/dir. For example, I am trying something like this:
hadoop fs -mkdir /data/dir_bkp
hadoop fs -mv /data/dir/* /data/dir_bkp
I then do hive commands such as:
ALTER TABLE db.mytable RENAME TO db.mytable_bkp;
ALTER TABLE db.mytable_bkp SET LOCATION /data/dir_bkp;
Is it fine to move the directory files before changing the location of the table? After I run these commands, will the table mytable_bkp be populated as it was before?
After you executed mv command, your original table will become empty. because mv removed data files.
After you renamed table, it is empty, because it's location is empty.
After you executed ALTER TABLE SET LOCATION - the table is empty because partitions are mounted to old locations (now empty). Sorry for misleading you in this step previously. After rename table, partitions remain as they were before rename. Each partition can normally have it's own location outside table location.
If table is MANAGED, make it EXTERNAL:
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
Now drop table + create table with new location and run MSCK to create partitions:
MSCK [REPAIR] TABLE tablename;
If you are on Amazon EMR, run
ALTER TABLE tablename RECOVER PARTITIONS; instead of MSCK

How can we drop a HIVE table with its underlying file structure, without corrupting another table under the same path?

Assuming we have 2 hive tables created under the same HDFS file path.
I want to be able to drop a table WITH the HDFS files path, without corrupting the other table that's in the same shared path.
By doing the following:
drop table test;
Then:
hadoop fs -rm -r hdfs/file/path/folder/*
I delete both tables files, not just the one I've dropped.
In another post I found this solution:
--changing the tbl properties to to make the table as internal
ALTER TABLE <table-name> SET TBLPROPERTIES('EXTERNAL'='False');
--now the table is internal if you drop the table data will be dropped automatically
drop table <table-name>;
But I couldn't get passed the ALTER statement as I got a permission denied error (User does not have [ALTER] privilege on table)
Any other solution?
If you have two tables using the same location, then all files in this location belongs to both tables, does not matter how they were created.
Say if you have table1 with location hdfs/file/path/folder and table2 with the same location hdfs/file/path/folder and you inserted some data into table1, files are created and they are being read if you select from table2, and vice-versa: if you insert into table2, new files will be accessible from table1. This is because table data is being stored in the location, no matter how you put the files inside that location. You can insert data into table using SQL, put files into location manually, etc.
Each table or partition has it's location, you cannot specify files separately.
For better understanding, read also this answer with examples about multiple tables on top of the same location: https://stackoverflow.com/a/54038932/2700344

Different Hive location for different commands

I have a hive external table in my production (let's say table1). When I do desc formatted table1 I can see some location. When I do desc formatted table1 partition(date = 22042019) instead, it's getting different hdfs location.
E.g:
desc formatted table1
Location: user/hive/warehouse/db.db/loc1
Desc formatted table1 partition (date = 22042019")
Location: x/y/loc/date=22042019
Table and partition locations can be different. When you are adding partition without specifying location or dynamically creating partitions during insert, partition folders normally created inside table location. But you can use alter table add partition ...location ... or [alter table partition set location][1] In this case you can create partitions outside table location. Also you can alter table set location and set different location. All existing partitions and their locations in this case will remain as is and be accessible, though their base location and table location are different.

Does external hive table refreshes itself, when file is added to pointing directory

I have a directory in HDFS, everyday one processed file is placed in that directory with DateTimeStamp in file name, if I create external table on top of that Directory location, does external table refreshes itself when every day file comes and resides in that directory ??
If you add files into table directory or partition directory, does not matter, external or managed table in Hive, the data will be accessible for queries, you do not need to do any additional steps to make data available, no refresh is necessary.
Hive table/partition is a metadata (DDL, location, statistics, access permissions, etc) plus data files in the location. So, data is stored in the table/partition location in HDFS.
Only if you create new directory for new partition which is not created yet, then you will need to execute ALTER TABLE ADD PARTITION LOCATION=<new location> or MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
If you add files into already created table/partition locations, no refresh is necessary.
CBO can use statistics for query calculation without reading data files, for example count(*). It works for simple queries only, like count(*), max().
If you are using CBO with statistics for query calculation, you may need to refresh it using ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS. See this answer for more details: https://stackoverflow.com/a/39914232/2700344
If you do not need statistics and want your table location to be scanned every time you query it, switch it off: set hive.compute.query.using.stats=false;

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

Resources