Hive (0.12.0) - Load data into table with partition, buckets and attached index - hadoop

Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you

Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.

Related

Hive load multiple partitioned HDFS file to table

I have some twice-partitioned files in HDFS with the following structure:
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet
and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here
However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?
One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.
Are there any better ways for how to do this?
Typical solution is to build external partitioned table on top of hdfs directory:
create external table table_name (
column1 datatype,
column2 datatype,
...
columnN datatype
)
partitioned by (datekey int,
coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'
After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:
MSCK REPAIR TABLE table_name;
Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:
select
column1,
column2,
...
columnN,
--partition columns
datekey,
coeff
from table_name
where datekey = 20210506
;

How to merge small files created by hive while inserting data into buckets?

I have a hive table which contains call data records(CDRs). I have the table partitioned on the phone number and bucketed on call_date. Now when I am inserting data into hive the back dated call_date are creating small files in my buckets which is creating name node metadata increase and performance slowdown.
Is there a way to merge these small files into one.
One way to control the size of files when inserting into a table using Hive, is to set the below parameters:
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000;
set hive.merge.smallfiles.avgsize=128000000;
This will work for both M/R and Tez engine and will ensure that all files created are at or below 128 MB in size (you can alter that size number according to your use case. Additional reading here: https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963).
The easiest way to merge the files of the table is to remake it, while having ran the above hive commands at runtime:
CREATE TABLE new_table LIKE old_table;
INSERT INTO new_table select * from old_table;
In your case, for ORC tables you can concatenate the files after creation:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value')] CONCATENATE;

How can I partition a hive table by (only) a portion of a timestamp column?

Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).
What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?
Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :
create table table_name (
id int,
name string,
timestamp string
)
partitioned by (date string)
Then you load the data creating the partitions dynamically like this
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
SELECT tno.id, tno.name, tno.timestamp;
Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.
As partition is also one of the column in hive, every partition has value(assign using static or dynamic partition) and every partition is mapped to directory in HDFS, so it has to be additional column.
You may choose one the below option:
Let's say table DDL:
CREATE TABLE temp( id string) PARTITIONED BY (day int)
If the data is organised day wise then add static partition:
ALTER TABLE xyz
ADD PARTITION (day=00)
location '/2017/02/02';
or
INSERT OVERWRITE TABLE xyz
PARTITION (day=1)
SELECT id FROM temp 
WHERE dayOfTheYear(**timestamp**)=1;
Generate day number using dynamic partition:
INSERT INTO TABLE xyz
PARTITION (day)
SELECT id ,
dayOfTheYear(day)
FROM temp;
Hive doesn't have any dayOfTheYear function you create it.

Dynamic partition in hive

I have created a table with dynamic partition in hive as below
create table sample(uuid String,date String,Name String,EmailID String,Comments String,CompanyName String,country String,url String,keyword String,source String) PARTITIONED BY (id String) Stored as parquet;
Also I have set the following in hive shell
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=100000000;
set hive.exec.max.dynamic.partitions.pernode=100000000;
set hive.exec.max.created.files = 100000000;
Is this a good practise as I am setting the values 100 million for each dynamic partitions configuration as shown above?
The dynamic partitions are designed to those tables which will have new partition values. If your table will be affected by INSERT clause it is okey, in case you don't have dynamic partition you have to execute another query to create the new ones, or you have to know the value of those before:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
In the official Hive tutorial you could check an example.
The best practise on partitioning are related with the kind of data stored. For example:
It is not recomendable to use unique values like Ids. (If each row will have a different id value, this is a bad practice)
The data have to have enough dispersion, if a partition have few different values (like use a boolean field or similar), it is a bad practice.

How to partition large Hive table with many categories

I want to partition my table in hive so that for every unique item in the row it creates a partition. There are ~250 partitions for about a 4 billion row table so I would like to to something like a for loop or a distinct. Here is my thoughts in code (which obviously have not worked)
ALTER TABLE myTable ADD IF NOT EXISTS
PARTITION( myColumn = distinct myColumn);
or is there some kind of loop in Hive?
Does this require a UDF? A hive answer would be preferable if possible.
Thanks.
just use dynamic partitions
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
it does the partition creation on the go

Resources