Why is hive join taking too long? - hadoop

I am running a code which basically goes like this:
Create table abc as
select A.* from
table1 A
Left outer join
table2 B
on
A.col1=B.col1 and A.col2=B.col2;
Number of records in table1=7009102
Number of records in table2=1787493
I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.
In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers
In Tez, its only 22% in 1 hour.
Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.
I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.

General tips to improve Hive queries to run faster
1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
CREATETABLEA_ORC (
customerIDint, namestring, age int, address string
)
2. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;
3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
Optimizing Joins In Hive
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.

Hive use MapReduce and this is the main reason why it's slow, but if you want to find more information see the link bellow
https://community.hortonworks.com/content/supportkb/48808/a-hive-join-query-is-slow-because-it-is-stuck-for.html

Related

hive select query poor performance

I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>, it is taking so much time to execute. What is the reason behind this?
Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks
I agree with #cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.
If you look at the Hive design and architecture you will see that a typical query will have some overhead. A query will be translated into code for distributed execution, send over to the cluster backend, executed there and then results are stored and collected for displaying. This will add latency to every of your queries even if the input data and the final result set are small.

Few Hive Interview Questions

I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?
Questions:
Difference between "Sort By" and "Group by" in Hive. How they work?
If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
How to optimize Hive Performance?
Difference between "Internal Table" and "External Table"
What is the main difference between Hive and SQL
Please provide me few useful resources, so that I can learn in the better way. Thanks
PFB the answers:
1. Difference between "Sort By" and "Group by" in Hive. How they work?
Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
Ans. I think Reducer will work, because as per Hive documentation --
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Having to randomly pick, it has to have complete result output from Reducer.
- How to optimize Hive Performance?
Ans. These links should answer this
5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER
5 Tips for efficient Hive queries with Hive Query Language
- Difference between "Internal Table" and "External Table"
Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.
"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.
- What is the main difference between Hive and SQL
Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.
SQL is a standard RDBMS language for accessing and manipulating databases.
I am new to Hadoop and Hive as well so I can't give you a complete answer.
From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).
As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.
The Main Difference
From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.
Information obtained from Hadoop The Definitive Guide
Really good book and gives a good overview of all the technologies involved.
EDIT:
For external and internal tables, check out this response:
Difference between Hive internal tables and external tables?
Information regarding Sort By and Group By
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
(Taken from the link provided maybe this will help with the difference between Group By and Sort By)
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
Group By:
Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
Information on Optimizing Hive Performance
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Optimizing Joins
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/
General Hive Performance Tips
https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips
Some extra resources
SQL to Hive Cheat Sheet
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Hive LIMIT Documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
Best of luck in your interview!
From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set
1. Difference between "Sort By" and "Group by" in Hive. How they work?
SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.
GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement.
select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.
3. How to optimize Hive Performance?
Using Tez session
Using bucketing and Partitioning
Using Orc file format
Using vectorisation
Using CBO
4. Difference between "Internal Table" and "External Table"
Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.
External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.
5. What is the main difference between Hive and SQL
Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Reducer will not run if we use limit in select clause.
select * from table_name limit 5;

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Tuning Hive Queries That Uses Underlying HBase Table

I've got a table in Hbase let's say "tbl" and I would like to query it using
Hive. Therefore I mapped a table to hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");
Queries like:
select * from tbl", "select id from tbl", "select id, data
from tbl
are really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345"
select id from tbl where data["777"] IS NOT NULL
are incredibly slow.
In the contrary when running from Hbase shell:
"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!
When I looked into the mapred job generated by hive on jobtracker I
discovered that "map.input.records" counts ALL the items in Hbase table,
meaning the job makes a full table scan before it even starts any mappers!!
Moreover, I suspect it copies all the data from Hbase table to hdfs to
mapper tmp input folder before executuion.
So, my questions are - Why hbase storage handler for hive does not translate
hive queries into appropriate hbase functions? Why it scans all the records
and then slices them using "where" clause? How can it be improved?
Any suggestions to improve the performance of Hive queries(mapped to HBase Table).
Can we create secondary index on HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive queries.
Lots of questions!, I'll try to answer all and give you a few performance tips:
The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.
Secondary indexes or alternative query paths are not supported by HBase (more info).
Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.
The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.
Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.
EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
tbl
TableScan
alias: tbl
filterExpr:
expr: ((id>= '12345') and (id < '12346'))
type: boolean
Filter Operator
....
Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)
Now, here are a few tips in order to speed up your queries (by a lot):
Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)
SET hbase.scan.cache=10000;
SET hbase.client.scanner.cache=10000;
Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.
SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;
SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];
Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *
Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)
Always EXPLAIN SELECT your queries before executing them.

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

Resources