hive analyze query taking lot of time - performance

In order to speed up ETL queries on large tables, we run many analyze queries on these tables and date columns in the evening.
but these analyze queries on columns take lot of memory and time.
we are using tez.
is there any way to optimize analyze query also like some set commands.

If you are loading tables using insert overwrite then statistics can be gathered automatically by setting hive.stats.autogather=true during insert overwrite queries.
If the table is partitioned and partitions are being loaded incrementally, then you can analyze only last partitions.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
See examples here: https://cwiki.apache.org/confluence/display/Hive/StatsDev
For ORC files it's possible to specify hive.stats.gather.num.threads to incraase parallelism.
See full list of statistic settings here: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

Related

reading a hadoop.hive.ql.io.HiveSequenceFileOutputFormat hive table in spark

I have a hive table in hadoop, which has an output format of
hadoop.hive.ql.io.HiveSequenceFileOutputFormat
I am reading this table using the spark sql
spark.sql('select * from testtable where y = 2021 and month = 12 and day =12')
The spark job runs super slow, i have tried adjusting the number of executors and memory per executor, but nothing seems to improve the performance. I read on a blog that SequenceFile are not the best when it comes to hive table.
Is there a better way of reading this table ?
Thanks in advance for any help.
You should consider partitioning your table by date if you will continue to access it regularly by date. (Lookups on the partition will be very fast, at the cost of queries that don't use partions).
You should also look into the "small files problem" with hadoop. You can get some nice speed out of making files larger.
You should look at using Parquet or Orc. They're compress nicely and often boost performance.
You should also look at running table stats on the hive table, this also helps to increase performance.

Can we sort a column of a Hive table just before query?

My Hive table is in ORC format and queries in it run fastest when columns in where clause are sorted. But in my case there are not currently. What is the syntax to sort a column just before query.
If I understand your question properly, you have an unsorted ORC table. And you want to query that table but want to "sort" the data "before" querying! This does not make any sense since you would be firing some "query" to have sorted data to fire another query on top of it.
Sort can be a costly operation depending on how you implement it. However, there are a bunch of other options that you can use while querying the data which can speed up your queries. Follows some details.
Use Tez execution engine. It is way faster than traditional MR jobs launched by Hive.
Enable predicate pushdown (PPD) to filter at the storage layer:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
Vectorized query execution processes data in batches of 1024 rows instead of one by one:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics:
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
Control reducer output:
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
Also, you may want to look at the best practices to create ORC tables, mentioned here, so that you can have the maximum of your queries in the minimum of time!
Hope that helps!

Time based directory structure Apache Drill

I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
SQL queries will almost always contain the time frame.
Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
There is a field called 'trans' in every CSV file, which contains the date and time.
The CSV file is put under appropriate directory based on the value of 'trans' field.
CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.
See below for more details.
https://drill.apache.org/docs/how-to-partition-data/

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

How to optimize Hive queires with external table and serde

Part 1: my enviroment
I have following files uploaded to Hadoop:
The are plain text
Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}
code are numbers from 1 to 3000,
customerId are total up to 4 millions, daily up to 0.5 millon
All files are gzip
In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
All files from each date is stored in separate directory - and I use it as partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n]
which contains data for all my customers, so rows are up to 4 millions.
I query and filter my data in following way:
Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like
SELECT customerId
FROM CUSTOMER_DATA
JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: question
Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Performance on search:
Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Bucketing the data on your searching columns would give a lot of performance gains. But whether you can bucket you data or not depends on your use case.
You can consider more partitioning (if possible) to get more gains if you can on code/customer id. Hopefully you don't have to many unique code or customer id.
Rather than trying these things out on your Textual Json formatted data, I would strongly suggest you to move away from JSON test data. Parsing JSON(Text) is a big performance killer.
These days there are a lot of file format which work pretty good. If cant change the component which produces the data, you use a series of queries and tables to convert to other file formats. This will be one time job for each partition data. After that your search queries will run faster on newer file formats.
for eg. RCFile format is support by hive. If you pull out code, customerid as separate columns in RCFILE then the query engine can completely skip data col for not matching code in (1,4,5,33,6784) , reducing IO heavily.
Also storing data in RCFile ie columnar storage will help your joins. With RCFile when you run a query with join the hive execution engine will only read in required columns, again significantly reducing IO. On top of this if you bucketted your columns which are a part of JOIN keys it will lead to more performance gains.
If you need to have JSON due to nesting nature of data then I would suggesting you look at Parquet
It will give you performance gains of RCFile + binary (avro, thrift etc)
At my work we had 2 columns of heavily nested JSON data. We tried storing this as compressed text and sequence file format. We then broke up the complex nested JSON columns to lesser nested multiple columns and pulled out some frequently searched keys into other columns. We stored this as RCfile and performance gains we observed on searching were huge.
Rightnow with more burst in data we need to improve more. After trying a few more things and talking to Cloudera guys there is only one big area to improve. Move away from JSON parsing. Parquet seems to be ideal candidate for this.
Yes you can use Indexes with External Tables. Index do optimize the search Queries.
CREATE INDEX your_index_name ON TABLE your_table_name(field_you_want_to_index) AS 'COMPACT' WITH DEFERRED REBUILD;
indexing takes a lot of time for a huge dataset, so we can do a deferred rebuild, i.e after production hours :)
ALTER INDEX your_index_name ON your_table_name REBUILD;
you can even rebuild a specific partition.
ALTER INDEX your_index_name ON your_table_name PARTITION(your_field = 'any_thing') REBUILD;
when you JOIN two tables BUCKETING is the best option to go with, does alot of optimization.

Resources