Different Hive location for different commands - hadoop

I have a hive external table in my production (let's say table1). When I do desc formatted table1 I can see some location. When I do desc formatted table1 partition(date = 22042019) instead, it's getting different hdfs location.
E.g:
desc formatted table1
Location: user/hive/warehouse/db.db/loc1
Desc formatted table1 partition (date = 22042019")
Location: x/y/loc/date=22042019

Table and partition locations can be different. When you are adding partition without specifying location or dynamically creating partitions during insert, partition folders normally created inside table location. But you can use alter table add partition ...location ... or [alter table partition set location][1] In this case you can create partitions outside table location. Also you can alter table set location and set different location. All existing partitions and their locations in this case will remain as is and be accessible, though their base location and table location are different.

Related

How can we drop a HIVE table with its underlying file structure, without corrupting another table under the same path?

Assuming we have 2 hive tables created under the same HDFS file path.
I want to be able to drop a table WITH the HDFS files path, without corrupting the other table that's in the same shared path.
By doing the following:
drop table test;
Then:
hadoop fs -rm -r hdfs/file/path/folder/*
I delete both tables files, not just the one I've dropped.
In another post I found this solution:
--changing the tbl properties to to make the table as internal
ALTER TABLE <table-name> SET TBLPROPERTIES('EXTERNAL'='False');
--now the table is internal if you drop the table data will be dropped automatically
drop table <table-name>;
But I couldn't get passed the ALTER statement as I got a permission denied error (User does not have [ALTER] privilege on table)
Any other solution?
If you have two tables using the same location, then all files in this location belongs to both tables, does not matter how they were created.
Say if you have table1 with location hdfs/file/path/folder and table2 with the same location hdfs/file/path/folder and you inserted some data into table1, files are created and they are being read if you select from table2, and vice-versa: if you insert into table2, new files will be accessible from table1. This is because table data is being stored in the location, no matter how you put the files inside that location. You can insert data into table using SQL, put files into location manually, etc.
Each table or partition has it's location, you cannot specify files separately.
For better understanding, read also this answer with examples about multiple tables on top of the same location: https://stackoverflow.com/a/54038932/2700344

creating partition in external table in hive

I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps:
1-created a source table
2-loaded data from local into source table
3- created another table with partitions - partition_table
4- inserted the data to this table from source table resulting in creation of all the partitions dynamically
My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table??
example:
Step 1:
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
Step 2:
alter table table1 add partition(age)
location 'path/to/already/existing/partition'
I am not sure how to proceed with partitioning in external tables. Can somebody please help by giving step by step description of the same?.
Thanks in advance!
Yes, you have to tell Hive explicitly what is your partition field.
Consider you have a following HDFS directory on which you want to create a external table.
/path/to/dataFile/
Let's say this directory already have data stored(partitioned) department wise as follows:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
Each of these directories have bunch of files where each file
contains actual comma separated data for fields say name,age,height.
e.g.
/path/to/dataFile/dept1/file1.txt
/path/to/dataFile/dept1/file2.txt
Now let's create external table on this:
Step 1. Create external table:
CREATE EXTERNAL TABLE testdb.table1(name string, age int, height int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
Step 2. Add partitions:
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
Done, run select query once to verify if data loaded successfully.
1. Set below property
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
2. Create External partitioned table
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
3. Insert data to partitioned table from source table.
Basically , the process is same. its just that you create external partitioned table and provide HDFS path to table under which it will create and store partition.
Hope this helps.
The proper way to do it.
Create the table and mention it is partitioned.
create external table1 ( name string, age int, height int)
partitioned by (age int)
stored as ****(your format)
location 'path/to/dataFile/in/HDFS';
Now you have to refresh the partitions in the hive metastore.
msck repair table table1
This will take care of loading all your partitions into the hive metastore.
You can use msck repair table at any point during your process to have the metastore updated.
Follow the below steps:
Create a temporary table/Source table
create table source_table(name string,age int,height int) row format delimited by ',';
Use your delimiter as in the file instead of ',';
Load data into the source table
load data local inpath 'path/to/dataFile/in/HDFS';
Create external table with partition
create external table external_dynamic_partitions(name string,height int)
partitioned by (age int)
location 'path/to/dataFile/in/HDFS';
Enable dynamic partition mode to nonstrict
set hive.exec.dynamic.partition.mode=nonstrict
Load data to external table with partitions from source file
insert into table external_dynamic partition(age)
select * from source_table;
That's it.
You can check the partitions information using
show partitions external_dynamic;
You can even check if it is an external table or not using
describe formatted external_dynamic;
External table is a type of table in Hive where the data is not moved to the hive warehouse. That means even if U delete the table, the data still persists and you will always get the latest data, which is not the case with Managed table.

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

How to point one Hive Table to Multiple External Files?

I would like to be able to append multiple HDFS files to one Hive table while leaving the HDFS files in their original directory. These files are created are located in different directories.
The LOAD DATA INPATH moves the HDFS file to the hive warehouse directory.
As far as I can tell, an External Table must be pointed to one file, or to one directory within which multiple files with the same schema can be placed. However, my files would not be underneath a single directory.
Is it possible to point a single Hive table to multiple external files in separate directories, or to otherwise copy multiple files into a single hive table without moving the files from their original HDFS location?
Expanded Solution off of Pradeep's answer:
For example, my files look like this:
/root_directory/<job_id>/input/<dt>
Pretend the schema of each is (foo STRING, bar STRING, job_id STRING, dt STRING)
I first create an external table. However, note that my DDL does not contain an initial location, and it does not include the job_id and dt fields:
CREATE EXTERNAL TABLE hivetest (
foo STRING,
bar STRING
) PARTITIONED BY (job_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
Let's say I have two files I wish to insert located at:
/root_directory/b1/input/2014-01-01
/root_directory/b2/input/2014-01-02
I can load these two external files into the same Hive table like so:
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b1', dt='2014-01-01')
LOCATION '/root_directory/b1/input/2014-01-01';
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b2', dt='2014-01-02')
LOCATION '/root_directory/b2/input/2014-01-02';
If anyone happens to require the use of Talend to perform this, they can use the tHiveLoad component like so [edit: This doesn't work; check below]:
The code talend produces for this using tHiveLoad is actually LOAD DATA INPATH ..., which will remove the file off its original location in HDFS.
You will have to do the earlier ALTER TABLE syntax in a tHiveLoad instead.
The short answer is yes. A Hive External Table can be pointed to multiple files/directories. The long answer will depend on the directory structure of your data. The typical way you do this is to create a partitioned table with the partition columns mapping to some part of your directory path.
E.g. We have a use case where an external table points to thousands of directories on HDFS. Our paths conform to this pattern /prod/${customer-id}/${date}/. In each of these directories we have approx 100 files. In mapping this into a Hive Table, we created two partition columns, customer_id and date. So every day, we're able to load the data into Hive, by doing
ALTER TABLE x ADD PARTITION (customer_id = "blah", dt = "blah_date") LOCATION '/prod/blah/blah_date';
Try this:
LOAD DATA LOCAL INPATH '/path/local/file_1' INTO TABLE tablename;
LOAD DATA LOCAL INPATH '/path/local/file_2' INTO TABLE tablename;

Is it possible to have multiple hive tables represented within the same HDFS directory structure?

Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?
Here is my situation:
I have a table named "mytable", located in hdfs:/tables/mytable
CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';
So, HDFS will look like:
hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....
Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.
The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.
Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?
Thanks so much for any hints or help anyone can provide,
Yes, we can have multiple hive tables with the same underlying HDFS directory.
Example:
Create table emp and load data file file3 into it.
create table emp (id int, name string, salary int)
row format delimited
fields terminated by ','
-- default location would be used
load data
local inpath '/home/parv/testfiles/file3'
into table emp;
Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).
create table mirror (id int, name string, salary int)
row format delimited
fields terminated by ','
location 'hdfs:///user/hive/warehouse/parv.db/base';
Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).
load data
local inpath '/home/parv/testfiles/file4'
into table mirror;
Conclusion:
Same data files are shared among both tables emp and mirror.
But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.
Answering my own question:
It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:
A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
Only represent the data one time in HDFS
A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.
Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.
Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.
--Brandon

Resources