I want to create a hive partition table with 2 partitions.
One with score less than 300 and the other greater than 300.
create table parttab(id int,name string) partitioned by (score int) row format delimited fields terminated by '\t' stored as textfile;
load data local inpath '/data/hive/input' into table newtab partition (score<300);
load data local inpath '/data/hive/newinput' into table newtab partition (score>300);
But, the load data statements give error because of the ">" and "<" symbols. So, how to create partitions for this scenario?
The reason why i give this way is that because when querying
select * from parttab where score<300;
it is easy..
If I give some name for that partition for.eg:
load data local inpath '/data/hive/input' into table newtab partition (score='lessthan300');
then, while querying, i will have to remember the name of the partitions!! :(
select * from parttab where score='lessthan300';
This doesn't sound good! So, is there a better way to partition it in an elegant way?
Hive partitions map to specific values. If you have only two partitions then having specific values for the two ranges isn't a bad compromise.
HIVE does not support < or > in partition definition. Also hive does not store the partition column in the underlying data instead it is only held in the partition folder name. If you some how manage to achieve your said partition with < or > This will lead to data loss for SCORE field as you will not be able to get back the actual SCORE value for each record.
The suggested approach will be to keep score as is and create a new column specifically for partition which has a value of "NEW" or "OLD" based on requirement and derive this column value based on score column
like if(score<300) then PART = "OLD" else PART = "NEW"
This is how I would do it:
Load the original data into a temp table.
From temp table (Parttab_temp) select the data with your logic (score <300 and >300) and INSERT into the hive table. You will have to run the INSERT INTO query a couple of times based on the condition that you require.
Instead of using "load data", use INSERT INTO:
INSERT INTO TABLE Parttab PARTITION1
INSERT INTO TABLE Parttab PARTITION (score)
SELECT * from Parttab_temp where score < 300; (score)
SELECT * from Parttab_temp where score <= 300; (I have used <=33, so records containing exactly 300 are not missed).
INSERT INTO TABLE Parttab PARTITION (score)
SELECT * from Parttab_temp where score > 300;
Hope this helps!
Alternative: To find specific partition you can use hive shell to get the partitions and then extract the specific partition by using grep. This worked well for me.
hive -e 'show partitions db.tablename;' | grep 202101*
hive -e 'show partitions db.tablename partition (type='abc');' | grep 202101*
Related
I have some twice-partitioned files in HDFS with the following structure:
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet
and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here
However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?
One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.
Are there any better ways for how to do this?
Typical solution is to build external partitioned table on top of hdfs directory:
create external table table_name (
column1 datatype,
column2 datatype,
...
columnN datatype
)
partitioned by (datekey int,
coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'
After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:
MSCK REPAIR TABLE table_name;
Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:
select
column1,
column2,
...
columnN,
--partition columns
datekey,
coeff
from table_name
where datekey = 20210506
;
Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).
What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?
Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :
create table table_name (
id int,
name string,
timestamp string
)
partitioned by (date string)
Then you load the data creating the partitions dynamically like this
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
SELECT tno.id, tno.name, tno.timestamp;
Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.
As partition is also one of the column in hive, every partition has value(assign using static or dynamic partition) and every partition is mapped to directory in HDFS, so it has to be additional column.
You may choose one the below option:
Let's say table DDL:
CREATE TABLE temp( id string) PARTITIONED BY (day int)
If the data is organised day wise then add static partition:
ALTER TABLE xyz
ADD PARTITION (day=00)
location '/2017/02/02';
or
INSERT OVERWRITE TABLE xyz
PARTITION (day=1)
SELECT id FROM temp
WHERE dayOfTheYear(**timestamp**)=1;
Generate day number using dynamic partition:
INSERT INTO TABLE xyz
PARTITION (day)
SELECT id ,
dayOfTheYear(day)
FROM temp;
Hive doesn't have any dayOfTheYear function you create it.
I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;
Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.
I want to partition my table in hive so that for every unique item in the row it creates a partition. There are ~250 partitions for about a 4 billion row table so I would like to to something like a for loop or a distinct. Here is my thoughts in code (which obviously have not worked)
ALTER TABLE myTable ADD IF NOT EXISTS
PARTITION( myColumn = distinct myColumn);
or is there some kind of loop in Hive?
Does this require a UDF? A hive answer would be preferable if possible.
Thanks.
just use dynamic partitions
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
it does the partition creation on the go