How to merge small files created by hive while inserting data into buckets? - hadoop

I have a hive table which contains call data records(CDRs). I have the table partitioned on the phone number and bucketed on call_date. Now when I am inserting data into hive the back dated call_date are creating small files in my buckets which is creating name node metadata increase and performance slowdown.
Is there a way to merge these small files into one.

One way to control the size of files when inserting into a table using Hive, is to set the below parameters:
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000;
set hive.merge.smallfiles.avgsize=128000000;
This will work for both M/R and Tez engine and will ensure that all files created are at or below 128 MB in size (you can alter that size number according to your use case. Additional reading here: https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963).
The easiest way to merge the files of the table is to remake it, while having ran the above hive commands at runtime:
CREATE TABLE new_table LIKE old_table;
INSERT INTO new_table select * from old_table;
In your case, for ORC tables you can concatenate the files after creation:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value')] CONCATENATE;

Related

Is there a way to prevent a Hive table from being overwritten if the SELECT query of the INSERT OVERWRITE does not return any results

I am developing a batch job that loads data into Hive tables from HDFS files. The flow of data is as follows
Read the file received in HDFS using an external Hive table
INSERT OVERWRITE the final hive table from the external Hive table applying certain transformations
Move the received file to Archive
This flow works fine if there is a file in the input directory for the external table to read during step 1.
If there is no file, the external table will be empty and as a result executing step 2 will empty the final table. If the external table is empty, I would like to keep the existing data in the final table (the data loaded during the previous execution).
Is there a hive property that I can set so that the final table is overwritten only if we are overwriting it with some data?
I know that I can check if the input file exists using an HDFS command and conditionally launch the Hive requests. But I am wondering if I can achieve the same behavior directly in Hive which would help me avoid this extra verification
Try to add dummy partition to your table, say LOAD_TAG and use dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE your_table PARTITION(LOAD_TAG)
select
col1,
...
colN,
'dummy_value' as LOAD_TAG
from source_table;
The partition value should always be the same in your case.

How to force CTAS to generate a single file?

I'm using HDP 2.5 with hive service. When i create hive table by using below query;
create table Sample_table
row format delimited
fields terminated by '|'
stored as textfile
AS
select *
from sample_table_unique
where state='AL';
Either i can able to create external table with specific location.
My question is when i create table/external table the stored file has been splitted ie. like below wise files has been splitted.
/apps/hive/warehouse/sampledb/sample_table:
00000_0,
00001_0,
00002_0,
00003_0,
I don't want those splitted file, i want one merged file like 00000_0. I don't know how it happen.Please tell me how do i resolve this issue.
The SELECT statement runs a mapper/mapreduce (depends on the select query) job to write data into the target table sample_table from the source table sample_table_unique.
Based on the number of tasks, the number of files generated may vary.
To merge them into one, you can set these properties either for the session on permanently in hive-site.xml
hive> SET hive.merge.mapfiles=true;
hive> SET hive.merge.mapredfiles=true;
hive> SET hive.merge.smallfiles.avgsize=16000000;
hive> SET hive.merge.size.per.task=256000000;
In case of TEZ execution engine, use
hive> SET hive.merge.tezfiles=true;
instead of mapfiles and mapredfiles.
When the average output file size of a job is less than this hive.merge.smallfiles.avgsize number, Hive will start an additional map-reduce job to merge the output files into bigger files.
The values for hive.merge.smallfiles.avgsize and hive.merge.size.per.task are default ones, change them accordingly to the input size.

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

Hive (0.12.0) - Load data into table with partition, buckets and attached index

Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Resources