Hive Full Table Scan Issue (Partitioned Columns Used) - performance

I have a BIG table in Hive 0.13 - it has approx 250 GB of data per day. Per hour, it is, hence, approx, 10 GB of data. I have a BI Tool which would like to access this table's data on per day or per hour basis for which I need to test the queries which the BI tool would generate and run on Hive.
One of the queries, when BI is used for daily data for yesterday, looks like below:
select count(*)
from my_table
where
yyyy=year(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and mm=month(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and dd=day(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
;
My Table in Hive in MY_TABLE while YYYY, MM and DD are the partitioned columns in MY_TABLE. It is already stored in ORC Format.
The above query runs for very good amount of time, post which when I see the EXPLAIN EXTENDED output, I clearly see that it is doing a FULL TABLE SCAN of MY_TABLE irrespective of filter conditions.
How can we avoid this issue ?
Kindly advise.
Note again : Hive version is 0.13. We're in middle of an upgrade.
Thanks,
Suddhasatwa
Note:
The solution provided here (Why partitions elimination does not happen for this query?) is not applicable in my case, since I am using Hive 0.13 while CURRENT_DATE function is available only post Hive version 1.+.

Related

Why is hive join taking too long?

I am running a code which basically goes like this:
Create table abc as
select A.* from
table1 A
Left outer join
table2 B
on
A.col1=B.col1 and A.col2=B.col2;
Number of records in table1=7009102
Number of records in table2=1787493
I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.
In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers
In Tez, its only 22% in 1 hour.
Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.
I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.
General tips to improve Hive queries to run faster
1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
CREATETABLEA_ORC (
customerIDint, namestring, age int, address string
)
2. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;
3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
Optimizing Joins In Hive
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.
Hive use MapReduce and this is the main reason why it's slow, but if you want to find more information see the link bellow
https://community.hortonworks.com/content/supportkb/48808/a-hive-join-query-is-slow-because-it-is-stuck-for.html

HIVE query takes very long time to fetch 20 gb records

Hi I have a hive table on HBASE that has 200gb of records .
I am running simple hive query to fetch 20 gb records .
But this takes around 4 hours of time .
I can not create partition on HIVE table cause it is integrated on HBASE.
Please suggest any idea to improve performance
This is my HIVE query
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FundamentalAnalytic/FundamentalAnalytic_2014.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FundamentalAnalytic where FilePartition='ThirdPartyPrivate' and FilePartitionDate='2014';
If you can, then I think Apache Phoenix will speed things up.
https://phoenix.apache.org/faq.html
Very simple and intuitive to use and super fast.

Loading more records than actual in HIve

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?
My query would be looking like this:
insert overwrite table_a
select col1,col2,col3,... from table_b;
My table_b consists of 6405465 records.
After inserting from table_b to table_a, i found total records in table_a are 6406565.
Can any one please help here ?
If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.
In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.
See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

Resources