any ideas for count job on hive? - hadoop

we are having issues some issues in our data stored in hive. we have more than 50 tables that has PB of data. In order to fix the issue, we are taking hive count and then analyzing the issues. so, i have to spend around 2-3 hours approximately everyday for count job since our tables are huge.
I am just wondering if there is any tools/application or ideas to reduce the amount of time to spend for count job.
I could not find anything in google about this.

You have two options -
if you want correct and actual count, you can use pyspark or spark. Use count like select count(1) from mytable and do not use count(*).
But this can give you perf problem if you have table of PB size.
if you want somewhat close count, you can use show table stats mytab, this shows rowcount (#Rows). If your hive system is set to gather table stats daily/regularly, you will get a count which is close to the actual count. If your table is partitioned, you need to add all partitions up.

Related

hive select query poor performance

I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>, it is taking so much time to execute. What is the reason behind this?
Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks
I agree with #cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.
If you look at the Hive design and architecture you will see that a typical query will have some overhead. A query will be translated into code for distributed execution, send over to the cluster backend, executed there and then results are stored and collected for displaying. This will add latency to every of your queries even if the input data and the final result set are small.

Why is hive join taking too long?

I am running a code which basically goes like this:
Create table abc as
select A.* from
table1 A
Left outer join
table2 B
on
A.col1=B.col1 and A.col2=B.col2;
Number of records in table1=7009102
Number of records in table2=1787493
I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.
In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers
In Tez, its only 22% in 1 hour.
Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.
I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.
General tips to improve Hive queries to run faster
1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
CREATETABLEA_ORC (
customerIDint, namestring, age int, address string
)
2. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;
3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
Optimizing Joins In Hive
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.
Hive use MapReduce and this is the main reason why it's slow, but if you want to find more information see the link bellow
https://community.hortonworks.com/content/supportkb/48808/a-hive-join-query-is-slow-because-it-is-stuck-for.html

Hive Full Table Scan Issue (Partitioned Columns Used)

I have a BIG table in Hive 0.13 - it has approx 250 GB of data per day. Per hour, it is, hence, approx, 10 GB of data. I have a BI Tool which would like to access this table's data on per day or per hour basis for which I need to test the queries which the BI tool would generate and run on Hive.
One of the queries, when BI is used for daily data for yesterday, looks like below:
select count(*)
from my_table
where
yyyy=year(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and mm=month(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and dd=day(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
;
My Table in Hive in MY_TABLE while YYYY, MM and DD are the partitioned columns in MY_TABLE. It is already stored in ORC Format.
The above query runs for very good amount of time, post which when I see the EXPLAIN EXTENDED output, I clearly see that it is doing a FULL TABLE SCAN of MY_TABLE irrespective of filter conditions.
How can we avoid this issue ?
Kindly advise.
Note again : Hive version is 0.13. We're in middle of an upgrade.
Thanks,
Suddhasatwa
Note:
The solution provided here (Why partitions elimination does not happen for this query?) is not applicable in my case, since I am using Hive 0.13 while CURRENT_DATE function is available only post Hive version 1.+.

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

Microstrategy / Oracle - slow performance

We have a Microstrategy / Oracle setup which has a fact table with 50+ billion rows (that is 50,000,000,000+ rows).
The system performance is very unstable; sometimes it runs OK but at other times it is very slow, i.e. simple reports will take 20 minutes to run!
The most weird part: if we add more constraints to a report (i.e. more where clauses) that end up in LESS data coming back, the report actually slows down further.
We are able to pick up the SQL from Microstrategy, and we find that the SQL itself runs quite slowly as well. However, since the SQL is generated by Microstrategy, we do not have much control over the SQL.
Any thoughts as to where we should look?
Look at the SQL and see if you can add any more useful indexes. Check that the query is using the indexes you think it should be.
Check that every column that is filtered has an index.
Remember to update the statistics for all the tables involved: with tables so big it is very important.
Look at the query plan and check that there aren't table scan on large tables (you can accept them on small look up tables)
EnableDescribeParam=1 in ODBC driver
If your environment is like mine then what i will provide may help with your request if not it may help others. we too have a table like that and after weeks of trying to add this index or that index the ultimate solution was setting parallel on the table and at the index level.
report runtime 25 mins
alter table TABLE_NAME parallel(degree 4 instances 4);
alter index INDEX_NAME parallel(degree 4 instances 4);
report runtime 6 secs.
there is criteria for a table to have parallel set up on it such as must be larger tha 1G, but play with parallel threads to get most optimal time.

Resources