I'm working on Hive (tables) and I have some problem with updating and dropping external table.
I created 2 external tables : T1 and T2 with same attributes
create external table T1(
nom string,
prenom string,
age int);
With query :
insert overwrite table T2(
select
nom,
prenom,
age from T1;
I can update T2 with data in T1, but doing :
drop table T2;
and then recreating it create external table T2..... I get automatically all present in T2 before dropping, while I would to have an empty table.
Is it "normal". Anybody could explain to me, why? and/or recommandate some method?
thx.
Dropping the table would not have removed the data present in the HDFS. The files will be available in the folder
/user/hive/warehouse/dbname.db/tablename
Try creating the table second time by removing the data from HDFS or with some other location specified in the create query itself.
Related
I have a use case where in i need to implement SQL based data warehousing activities using Hive.
The software would generate a bunch of csv files. When it transforms into SQL table, an unique id called session is assigned for each csv file and loaded into a SQL table. Let's say, I have 3 columns in csv files. I will have four columns in the SQL table wherein the first column represent the session. This means that, values stored in first csv file is written into the SQL table with the sessios id '1', and values from the second csv file is appended to the SQL table with the session id '2', and so on.
In Hive,
I stored these csv files in hdfs directory and want to create one hive table with the additional columns that represents the session id. I am not sure how I can do it. Any help or clue will be highly appreciated.
Try below approaches:
Using Random session id:
create external table on top of source dataset:
create external table staging (a string, b string, c string) location 'xyz';
Assign a unique id to each row:
insert into table destination as select reflect("java.util.UUID", "randomUUID") AS session_id, s.* from staging;
Using sequence number as session id:
create external table on top of source dataset:
create external table staging (a string, b string, c string) location 'xyz';
first time data load:
CREATE TABLE IF NOT EXISTS max_session_id (session_id int);
Append a sequence id to each record:
insert into table destination
select cast(coalesce(t.session_id,0) + row_number() over () as INT) as session_id, t1.*
from max_session_id t join destination t1 on 1=1;
Maintain max session id in separate table:
DROP TABLE IF EXISTS tmp_max_session_id;
CREATE TABLE tmp_max_session_id AS SELECT COALESCE(MAX(session_id), 0) AS session_id FROM destination;
INSERT OVERWRITE TABLE max_session_id SELECT * FROM tmp_max_session_id;
if you want to tag a same session id per file then add each file as a partition, you may store reflect("java.util.UUID", "randomUUID") or max_session_id in separate table while adding partition use newly generated session_id as partition id.
I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps:
1-created a source table
2-loaded data from local into source table
3- created another table with partitions - partition_table
4- inserted the data to this table from source table resulting in creation of all the partitions dynamically
My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table??
example:
Step 1:
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
Step 2:
alter table table1 add partition(age)
location 'path/to/already/existing/partition'
I am not sure how to proceed with partitioning in external tables. Can somebody please help by giving step by step description of the same?.
Thanks in advance!
Yes, you have to tell Hive explicitly what is your partition field.
Consider you have a following HDFS directory on which you want to create a external table.
/path/to/dataFile/
Let's say this directory already have data stored(partitioned) department wise as follows:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
Each of these directories have bunch of files where each file
contains actual comma separated data for fields say name,age,height.
e.g.
/path/to/dataFile/dept1/file1.txt
/path/to/dataFile/dept1/file2.txt
Now let's create external table on this:
Step 1. Create external table:
CREATE EXTERNAL TABLE testdb.table1(name string, age int, height int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
Step 2. Add partitions:
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
Done, run select query once to verify if data loaded successfully.
1. Set below property
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
2. Create External partitioned table
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
3. Insert data to partitioned table from source table.
Basically , the process is same. its just that you create external partitioned table and provide HDFS path to table under which it will create and store partition.
Hope this helps.
The proper way to do it.
Create the table and mention it is partitioned.
create external table1 ( name string, age int, height int)
partitioned by (age int)
stored as ****(your format)
location 'path/to/dataFile/in/HDFS';
Now you have to refresh the partitions in the hive metastore.
msck repair table table1
This will take care of loading all your partitions into the hive metastore.
You can use msck repair table at any point during your process to have the metastore updated.
Follow the below steps:
Create a temporary table/Source table
create table source_table(name string,age int,height int) row format delimited by ',';
Use your delimiter as in the file instead of ',';
Load data into the source table
load data local inpath 'path/to/dataFile/in/HDFS';
Create external table with partition
create external table external_dynamic_partitions(name string,height int)
partitioned by (age int)
location 'path/to/dataFile/in/HDFS';
Enable dynamic partition mode to nonstrict
set hive.exec.dynamic.partition.mode=nonstrict
Load data to external table with partitions from source file
insert into table external_dynamic partition(age)
select * from source_table;
That's it.
You can check the partitions information using
show partitions external_dynamic;
You can even check if it is an external table or not using
describe formatted external_dynamic;
External table is a type of table in Hive where the data is not moved to the hive warehouse. That means even if U delete the table, the data still persists and you will always get the latest data, which is not the case with Managed table.
I am interested in loading specific columns into a table created in Hive.
Is it possible to load the specific columns directly or I should load all the data and create a second table to SELECT the specific columns?
Thanks
Yes you have to load all the data like this :
LOAD DATA [LOCAL] INPATH /Your/Path [OVERWRITE] INTO TABLE yourTable;
LOCAL means that your file is on your local system and not in HDFS, OVERWRITE means that the current data in the table will be deleted.
So you create a second table with only the fields you need and you execute this query :
INSERT OVERWRITE TABLE yourNewTable
yourSelectStatement
FROM yourOldTable;
It is suggested to create an External Table in Hive and map the data you have and then create a new table with specific columns and use the create table as command
create table table_name as select statement from table_name;
For example the statement looks like this
create table employee as select id as id,emp_name as name from emp;
Try this:
Insert into table_name
(
#columns you want to insert value into in lowercase
)
select columns_you_need from source_table;
I have a table with 3 columns. now i need to modify one of the column as a partition column.
Is there any possibility? If not, how can we add partition to existing table. I used the below syntax:
create table t1 (eno int, ename string ) row format delimited fields terminated by '\t';
load data local '/....path/' into table t1;
alter table t1 add partition (p1='india');
i am getting errors.........
Any one know how to add partition to existing table......?
Thanks in advance.
I don't think this is directly possible. Hive would have to completely rearrange and split the files in HDFS because adding the partition would impose a new directory structure.
What I suggest you do is simply create a new table with the desired schema and partition, and insert everything from the first into the second.
You can't add a partition to a created table.
But you can do something like these steps.
Create a new table and insert data from the old table to the new one.
/*Original table structure*/
CREATE TABLE original_table(
c1 string,
c2 string,
c3 string)
STORED AS ORC;
/*Partitioned table structure*/
CREATE TABLE partitioned_table(
c1 string,
c2 string)
partitioned by (c3 string)
STORED AS ORC;
/*load data from original_table to partitioned_table*/
insert into
table partitioned_table partition(c3)
select c1, c2, c3
from original_table;
/*rename original_table to old_table. You can just drop it if you want it*/
ALTER TABLE original_table RENAME TO old_table;
/*rename partitioned_table to original_table*/
ALTER TABLE partitioned_table RENAME TO original_table;
I think there is no way to convert an existing column of a table to partition.
If you want to add a partition in a table use ALTER command as you have already done. If you are dealing with the external table then specify the location field as well. I am not sure whether a partition can be added using ALTER command for managed tables.
I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.