Hive job taking too long to read data and insert in partitioned sorted bucketed table - hadoop

We have a job that reads from a hive table with around 3billion rows and inserts in a sorted bucketed table.
Files in both source and destination tables are having parquet format.
This job is taking too long to finish. We have had to stop the job after 3 days.
We recently migrated to a new cluster. The older cluster was 5.12 and the latest cluster is 6.3.1.
This job used to run fine and finish within 6 hours in the 5.12 cluster. However, it's taking too long in the new cluster.
We have tried the following things to solve this without any results:-
Removed the cap on reducers. Removed set hive.exec.reducers.max=200;
set mapreduce.job.running.reduce.limit=100;
Merged files at the source to make sure we are not reading small files. File size in the source table was increased to 1G each.
Reduce the no. of rows in the source table to reduce the data mappers are reading.
Reduce the max split size to 64MB to increase the no. of mappers.
Insert in a new table.
Insert in a new table that is not sorted or bucketed.
The query we are trying to run :-
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;
set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;

We found out that hive version 2.x in Cloudera 6.3 is using vectorization while hive 1.x in old Cloudera 5.12 is not using it.
So setting the below property fixed the issue for us. I have no explanations for this. Vectorization should speed up the query and not make it slow.
hive.vectorized.execution.enabled=false;

Related

hive select query poor performance

I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>, it is taking so much time to execute. What is the reason behind this?
Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks
I agree with #cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.
If you look at the Hive design and architecture you will see that a typical query will have some overhead. A query will be translated into code for distributed execution, send over to the cluster backend, executed there and then results are stored and collected for displaying. This will add latency to every of your queries even if the input data and the final result set are small.

Transfer big table from one Hive database to another

I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.
So, something like this:
-- initial settings
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;
INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
WHEN a is null THEN 'Missing'
ELSE a END AS a,
column1,
column2,
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;
The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?
I'm using map reduce engine for this task.
Thanks!
As all the data in the Hive tables are files on HDFS why don't you move/copy the files directly into the new table's HDFS location.
Example:
Assuming the table you want to move is already present in db1 as table_to_cpy;
create database db2;
create table db2.table_to_cpy like db1.table_to_cpy;
desc formatted db1.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db1.db/table_to_cpy
desc formatted db2.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db2.db/table_to_cpy
hadoop fs -cp /user/hive/warehouse/db1.db/table_to_cpy/* /user/hive/warehouse/db2.db/table_to_cpy/.
Few suggestions on how to speed-up your query:
Avoid unix_timestamp() if possible. This function is non-deterministic and prevents proper optimization of queries, it will be executed in each mapper or reducer and may return different values. Use instead
current_date() AS run_date
See also this answer for more details: https://stackoverflow.com/a/41140298/2700344
Tune mappers and reducers parallelism. If your process ending up with one big file (20 GB) instead of a few smaller then obviously there is not enough parallelism.
For mappers, play with these settings:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB
Decrease hive.exec.reducers.bytes.per.reducer to increase the number of reducers.
Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.
See this answer for details: https://stackoverflow.com/a/42842117/2700344
Try to add distribute by parition_col It will distribute data between reducers according to partition keys and as a result each reducer will create less partitions and consume less memory. Also it helps to avoid too many small output files. This setting should be used with hive.exec.reducers.bytes.per.reducer to avoid problem with uneven distribution between reducers and to avoid too big output files.

hive merge properties not working for small files

I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave.
Any help will be greatly appreciated
hive.merge.mapfiles=true;
hive merge mapredfiles = true
hive.merge.size.per.task=10000;
hive.merge.smallfiles.avgsize=100;
Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.
set hive.merge.smallfiles.avgsize = 134217728;
This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.
I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)
There's one restriction though, you can't have indexes in your hive tables to execute the merge command.
I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.
ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE
Hope it helps
Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
This will help to run hive job faster for given small files in hive.

Not able to apply dynamic partitioning for a huge data set in Hive

I have a table test_details with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par with records partitioned on visit_date. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error or GC Error.
A Simplified Snapshot of my DDLs is as follows:
CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);
INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;
I have tried setting these parameters, so that Hive executes my job in a better way:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode = 10000;
Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?
Neels,
Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.
You have 3 options as I see
Change your job to insert only a few partitions at a time.
Wait for Hive 13 to be released and try that (2-3 months to wait).
If you know how, build Hive from trunk and use it to complete your data load.

Hive index rebuild too slow in compare with PostgreSQL

I am trying to compare same functionality on my PostgreSQL data warehouse and newly created Hive data warehouse on same box with same data and same table structure . I am trying to understand Hive benefits, but... Despite the fact that data load into PostgreSQL running 3 times slower - the index creation/rebuild on PostgreSQL is 20 times faster, the index doesn't need to be rebuild every time like in Hive.
My question is: what I am missing in Hive configuration?
My setup is:
CREATE TABLE mytable
(
aa int,
bb string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/spaces/hadoop/hadoopfs';
LOAD DATA LOCAL INPATH '/data/Informix94/spaces/postgres/myfile_big' OVERWRITE INTO TABLE mytable;
CREATE INDEX mytable_indx ON TABLE mytable(aa) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD LOCATION '/data/spaces/hadoop/hadoopfs';
set hive.optimize.autoindex=true;
set hive.optimize.index.filter=true;
alter index mytable_indx ON mytable rebuild;
My Box is VM with 3 G ram with PostgreSQL running on it and taking ~ 1 G ram. He is serving as metadata store. I am using most recent stable versions of CentOS, Hadoop, Hive and didn't changed Hive default setting except matadata store location and statistics disabling.
The result:
index rebuild takes 4798 seconds on 260.000.000 rows or 80 seconds on 5.000.000 rows.
Hive only works well when your data doesn't fit on a single machine anymore. So the results you are seeing are expected results. So once you've collected Terabytes or Petabytes of data you'll be much happier with hive. In the use-case you describe PostgreSQL would be a much better match.

Resources