I want to partition my table in hive so that for every unique item in the row it creates a partition. There are ~250 partitions for about a 4 billion row table so I would like to to something like a for loop or a distinct. Here is my thoughts in code (which obviously have not worked)
ALTER TABLE myTable ADD IF NOT EXISTS
PARTITION( myColumn = distinct myColumn);
or is there some kind of loop in Hive?
Does this require a UDF? A hive answer would be preferable if possible.
Thanks.
just use dynamic partitions
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
it does the partition creation on the go
Related
I want to use the ALTER TABLE ... CONCATENATE functionality in Hive, but it seems I have to give exact partition name. For example I have a table with two partition columns, date and group. I'd like to be able to do something like this:
alter table mytable partition (insert_date='2017-04-11',group='%') CONCATENATE;
But i can't find the way of doing it.
Concatenate doesn't support this.
Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.
I want to create a hive partition table with 2 partitions.
One with score less than 300 and the other greater than 300.
create table parttab(id int,name string) partitioned by (score int) row format delimited fields terminated by '\t' stored as textfile;
load data local inpath '/data/hive/input' into table newtab partition (score<300);
load data local inpath '/data/hive/newinput' into table newtab partition (score>300);
But, the load data statements give error because of the ">" and "<" symbols. So, how to create partitions for this scenario?
The reason why i give this way is that because when querying
select * from parttab where score<300;
it is easy..
If I give some name for that partition for.eg:
load data local inpath '/data/hive/input' into table newtab partition (score='lessthan300');
then, while querying, i will have to remember the name of the partitions!! :(
select * from parttab where score='lessthan300';
This doesn't sound good! So, is there a better way to partition it in an elegant way?
Hive partitions map to specific values. If you have only two partitions then having specific values for the two ranges isn't a bad compromise.
HIVE does not support < or > in partition definition. Also hive does not store the partition column in the underlying data instead it is only held in the partition folder name. If you some how manage to achieve your said partition with < or > This will lead to data loss for SCORE field as you will not be able to get back the actual SCORE value for each record.
The suggested approach will be to keep score as is and create a new column specifically for partition which has a value of "NEW" or "OLD" based on requirement and derive this column value based on score column
like if(score<300) then PART = "OLD" else PART = "NEW"
This is how I would do it:
Load the original data into a temp table.
From temp table (Parttab_temp) select the data with your logic (score <300 and >300) and INSERT into the hive table. You will have to run the INSERT INTO query a couple of times based on the condition that you require.
Instead of using "load data", use INSERT INTO:
INSERT INTO TABLE Parttab PARTITION1
INSERT INTO TABLE Parttab PARTITION (score)
SELECT * from Parttab_temp where score < 300; (score)
SELECT * from Parttab_temp where score <= 300; (I have used <=33, so records containing exactly 300 are not missed).
INSERT INTO TABLE Parttab PARTITION (score)
SELECT * from Parttab_temp where score > 300;
Hope this helps!
Alternative: To find specific partition you can use hive shell to get the partitions and then extract the specific partition by using grep. This worked well for me.
hive -e 'show partitions db.tablename;' | grep 202101*
hive -e 'show partitions db.tablename partition (type='abc');' | grep 202101*
Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.
I have a table which has two partitions (by range): first_half and second_half based on a column "INSERT_DAY".
I need to add subpartitions "SUCCESS" and "NONSUCCESS" based on the values of another column "STATUS" (subpartition by list) i.e. I need to transform my range partition to composite (range-list) partition.
I do not wish to drop existing tables or partitions. What is the ALTER query for this?
PS: The database is Oracle 9i
No alter query for adding subpartitions as far as i know.
To get the desired result performe the folowing steps
Create the table in the structure you want using create as select with the partitions and the sub partitions.
switch the names of the two tables.
you can also explore the use of dbms_Redefinition but if you have a luxury of a littel downtime it's not worth it.