hive select query poor performance - hadoop

I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>, it is taking so much time to execute. What is the reason behind this?

Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks

I agree with #cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.

If you look at the Hive design and architecture you will see that a typical query will have some overhead. A query will be translated into code for distributed execution, send over to the cluster backend, executed there and then results are stored and collected for displaying. This will add latency to every of your queries even if the input data and the final result set are small.

Related

Transfer big table from one Hive database to another

I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.
So, something like this:
-- initial settings
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;
INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
WHEN a is null THEN 'Missing'
ELSE a END AS a,
column1,
column2,
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;
The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?
I'm using map reduce engine for this task.
Thanks!
As all the data in the Hive tables are files on HDFS why don't you move/copy the files directly into the new table's HDFS location.
Example:
Assuming the table you want to move is already present in db1 as table_to_cpy;
create database db2;
create table db2.table_to_cpy like db1.table_to_cpy;
desc formatted db1.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db1.db/table_to_cpy
desc formatted db2.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db2.db/table_to_cpy
hadoop fs -cp /user/hive/warehouse/db1.db/table_to_cpy/* /user/hive/warehouse/db2.db/table_to_cpy/.
Few suggestions on how to speed-up your query:
Avoid unix_timestamp() if possible. This function is non-deterministic and prevents proper optimization of queries, it will be executed in each mapper or reducer and may return different values. Use instead
current_date() AS run_date
See also this answer for more details: https://stackoverflow.com/a/41140298/2700344
Tune mappers and reducers parallelism. If your process ending up with one big file (20 GB) instead of a few smaller then obviously there is not enough parallelism.
For mappers, play with these settings:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB
Decrease hive.exec.reducers.bytes.per.reducer to increase the number of reducers.
Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.
See this answer for details: https://stackoverflow.com/a/42842117/2700344
Try to add distribute by parition_col It will distribute data between reducers according to partition keys and as a result each reducer will create less partitions and consume less memory. Also it helps to avoid too many small output files. This setting should be used with hive.exec.reducers.bytes.per.reducer to avoid problem with uneven distribution between reducers and to avoid too big output files.

Why is hive join taking too long?

I am running a code which basically goes like this:
Create table abc as
select A.* from
table1 A
Left outer join
table2 B
on
A.col1=B.col1 and A.col2=B.col2;
Number of records in table1=7009102
Number of records in table2=1787493
I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.
In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers
In Tez, its only 22% in 1 hour.
Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.
I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.
General tips to improve Hive queries to run faster
1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
CREATETABLEA_ORC (
customerIDint, namestring, age int, address string
)
2. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;
3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
Optimizing Joins In Hive
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.
Hive use MapReduce and this is the main reason why it's slow, but if you want to find more information see the link bellow
https://community.hortonworks.com/content/supportkb/48808/a-hive-join-query-is-slow-because-it-is-stuck-for.html

Hive query optimisation

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:
INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm).
Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.
First of all, the map/reduce model does not guarantee that your queries will take less. The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.
First thing to check is if the source table is partitioned by time. If not, it should as you'd be reading the whole table every single time.
Second, you're calculating the max as well every time, also, on the whole destination table. You could make it a lot faster if you just calculate the max on the last partition, so change this
JOIN (select max(load_ts) max_load_ts from target_temp) mt
to this (you didn't write the partition column so I am going to assume it's called 'dt'
JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt
since we know the max load_ts is going to be in the last partition.
Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.
JOIN is slower than variable in the WHERE clause. But the main problem with performance here is that your query performs full scan of target table and source table. I would recommend:
Query only the latest partition for max(load_ts).
Enable statistics gathering and usage
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.autogather=true;
Compute statistics on both tables for columns.
Statistics will make queries like selecting MAX(partition) or max(ts) executing faster
Try to put source partition files into target partition folder instead of INSERT if applicable (target and source tables partitioning and storage format should enable this). It works fine for example for textfile storage format and if source table partition contain only rows>max(target_partition). You can combine both copy files method(for those source partitions that exactly contain rows to be inserting without filtering) and INSERT(for partitions containing mixed data that need to be filtering).
Hive may be merging your files during INSERT. This merge phase takes additional time and adds additional stage job. Check hive.merge.mapredfiles option and try to switch it off.
And of course use pre-calculated variable instead of join.
Use Cost-Based Optimisation Technique by enabling below properties
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.stats.fetch.column.stats=true;
set hive.compute.query.using.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.parallel=true;
Also analyze the table
ANALYZE TABLE temp_db.source_temp COMPUTE STATISTICS [comma_separated_column_list];
ANALYZE TABLE target_temp PARTITION(DATA_DT) COMPUTE STATISTICS;

Tuning Hive Queries That Uses Underlying HBase Table

I've got a table in Hbase let's say "tbl" and I would like to query it using
Hive. Therefore I mapped a table to hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");
Queries like:
select * from tbl", "select id from tbl", "select id, data
from tbl
are really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345"
select id from tbl where data["777"] IS NOT NULL
are incredibly slow.
In the contrary when running from Hbase shell:
"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!
When I looked into the mapred job generated by hive on jobtracker I
discovered that "map.input.records" counts ALL the items in Hbase table,
meaning the job makes a full table scan before it even starts any mappers!!
Moreover, I suspect it copies all the data from Hbase table to hdfs to
mapper tmp input folder before executuion.
So, my questions are - Why hbase storage handler for hive does not translate
hive queries into appropriate hbase functions? Why it scans all the records
and then slices them using "where" clause? How can it be improved?
Any suggestions to improve the performance of Hive queries(mapped to HBase Table).
Can we create secondary index on HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive queries.
Lots of questions!, I'll try to answer all and give you a few performance tips:
The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.
Secondary indexes or alternative query paths are not supported by HBase (more info).
Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.
The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.
Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.
EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
tbl
TableScan
alias: tbl
filterExpr:
expr: ((id>= '12345') and (id < '12346'))
type: boolean
Filter Operator
....
Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)
Now, here are a few tips in order to speed up your queries (by a lot):
Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)
SET hbase.scan.cache=10000;
SET hbase.client.scanner.cache=10000;
Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.
SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;
SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];
Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *
Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)
Always EXPLAIN SELECT your queries before executing them.

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

Resources