I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;
Related
I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE a YEAR, MONTH, DAY partition every time I ingest data for that date; I can just do a simple INSERT INTO and create a new INGESTION_KEY partition.
I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT for each partition, I thought it would be quicker to use distcp to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION to the new table.
So, this is all I'm doing:
hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123
hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"
Both are managed tables, the new table dest_tbl is clustered by the same column into the same number of buckets as the src_tbl, and the only difference in schema is the addition of INGESTION_KEY.
So far my SELECT * FROM dest_tbl shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT to a managed, bucketed table this way, or is this an acceptable alternative to INSERT if no transformations are being done on the copied data?
Thanks!!
Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..
There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition.., and it can add many partitions at once :
MSCK REPAIR TABLE dest_tbl;
Keep using Hive default partitioning format : partionKey=partitionValue
When I am trying to rename all partition columns in an existing table for date range of one year which are partitioned - this is what I am getting.
hive> ALTER TABLE test.usage PARTITION ('date') RENAME TO PARTITION (partition_date);
FAILED: ValidationFailureSemanticException Partition spec {partition_date=null} contains non-partition columns.
I got that syntax from here: 1
Given that I am not 100% sure if what you want is to rename the value of the partition or actually change the column which the table is partitioned.
Let's suppose you want to rename the partition's value.
There is an issue in Hive 0.13 and in Hive 0.14 is working fine. Anyways, this should work:
set fs.hdfs.impl.disable.cache=false;
set fs.file.impl.disable.cache=false;
Now run the query by setting this property.
> hive> set fs.hdfs.impl.disable.cache=false;
> hive> set fs.file.impl.disable.cache=false;
> hive> ALTER TABLE test.usage PARTITION (date='oldValue') RENAME TO PARTITION (date='newValue');
Let's suppose you want to change the partition column
In this case what you will need to do is to recreate the table changing the column which the table will be partitioned.
Note: Remember that if you already had data inside the partitions, you will need to reinsert the data.
Please see also this answer
Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?
In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.
I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.
I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps:
1-created a source table
2-loaded data from local into source table
3- created another table with partitions - partition_table
4- inserted the data to this table from source table resulting in creation of all the partitions dynamically
My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table??
example:
Step 1:
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
Step 2:
alter table table1 add partition(age)
location 'path/to/already/existing/partition'
I am not sure how to proceed with partitioning in external tables. Can somebody please help by giving step by step description of the same?.
Thanks in advance!
Yes, you have to tell Hive explicitly what is your partition field.
Consider you have a following HDFS directory on which you want to create a external table.
/path/to/dataFile/
Let's say this directory already have data stored(partitioned) department wise as follows:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
Each of these directories have bunch of files where each file
contains actual comma separated data for fields say name,age,height.
e.g.
/path/to/dataFile/dept1/file1.txt
/path/to/dataFile/dept1/file2.txt
Now let's create external table on this:
Step 1. Create external table:
CREATE EXTERNAL TABLE testdb.table1(name string, age int, height int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
Step 2. Add partitions:
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
Done, run select query once to verify if data loaded successfully.
1. Set below property
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
2. Create External partitioned table
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
3. Insert data to partitioned table from source table.
Basically , the process is same. its just that you create external partitioned table and provide HDFS path to table under which it will create and store partition.
Hope this helps.
The proper way to do it.
Create the table and mention it is partitioned.
create external table1 ( name string, age int, height int)
partitioned by (age int)
stored as ****(your format)
location 'path/to/dataFile/in/HDFS';
Now you have to refresh the partitions in the hive metastore.
msck repair table table1
This will take care of loading all your partitions into the hive metastore.
You can use msck repair table at any point during your process to have the metastore updated.
Follow the below steps:
Create a temporary table/Source table
create table source_table(name string,age int,height int) row format delimited by ',';
Use your delimiter as in the file instead of ',';
Load data into the source table
load data local inpath 'path/to/dataFile/in/HDFS';
Create external table with partition
create external table external_dynamic_partitions(name string,height int)
partitioned by (age int)
location 'path/to/dataFile/in/HDFS';
Enable dynamic partition mode to nonstrict
set hive.exec.dynamic.partition.mode=nonstrict
Load data to external table with partitions from source file
insert into table external_dynamic partition(age)
select * from source_table;
That's it.
You can check the partitions information using
show partitions external_dynamic;
You can even check if it is an external table or not using
describe formatted external_dynamic;
External table is a type of table in Hive where the data is not moved to the hive warehouse. That means even if U delete the table, the data still persists and you will always get the latest data, which is not the case with Managed table.
I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;