hive query performance is bad - performance

I am joining 3 huge tables (billion row tables) in HIVE. All the statistics are collected, but still the performance is very bad (query taking 40 minutes odd).
Is there any parameter which I can set in the HIVE prompt to get better performance?
When I am trying execution I am seeing info like
Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetFileReader: reading another 1 footers
All the tables are created in BigSql with storage parameter as "STORED AS PARQUETFILE"
How can I suppress the job progress details when a HIVE query is running?
Regarding HIVE version
hive> set system:sun.java.command;
system:sun.java.command=org.apache.hadoop.util.RunJar /opt/ibm/biginsights/hive/lib/hive-cli-0.12.0.jar org.apache.hadoop.hive.cli.CliDriver -hiveconf hive.aux.jars.path=file:///opt/ibm/biginsights/hive/lib/hive-hbase-handler-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hive-contrib-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-client-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-common-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-hadoop2-compat-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-prefix-tree-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-protocol-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-server-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/htrace-core-2.01.jar,file:///opt/ibm/biginsights/hive/lib/zookeeper-3.4.5.jar,file:///opt/ibm/biginsights/sheets/libext/piggybank.jar,file:///opt/ibm/biginsights/sheets/libext/pig-0.11.1.jar,file:///opt/ibm/biginsights/sheets/libext/avro-1.7.4.jar,file:///opt/ibm/biginsights/sheets/libext/opencsv-1.8.jar,file:///opt/ibm/biginsights/sheets/libext/json-simple-1.1.jar,file:///opt/ibm/biginsights/sheets/libext/joda-time-1.6.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets-serdes-1.0.0.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-column-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-common-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-encoding-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-generator-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hadoop-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hive-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-thrift-1.3.2.jar,file:///opt/ibm/biginsights/hive/lib/guava-11.0.2.jar

Koushik - This question I asked a month back will give you a good insight to performance of ORC vs Parquet.
Let me ask this question! What is the structure of your data? Is this nested or flatter? If this is a flatter data, example can be data ingested from an RDBMS, ORC is better since it has light indexes stored alongside the data and makes data retrieval faster.
Hope this helps

Related

reading a hadoop.hive.ql.io.HiveSequenceFileOutputFormat hive table in spark

I have a hive table in hadoop, which has an output format of
hadoop.hive.ql.io.HiveSequenceFileOutputFormat
I am reading this table using the spark sql
spark.sql('select * from testtable where y = 2021 and month = 12 and day =12')
The spark job runs super slow, i have tried adjusting the number of executors and memory per executor, but nothing seems to improve the performance. I read on a blog that SequenceFile are not the best when it comes to hive table.
Is there a better way of reading this table ?
Thanks in advance for any help.
You should consider partitioning your table by date if you will continue to access it regularly by date. (Lookups on the partition will be very fast, at the cost of queries that don't use partions).
You should also look into the "small files problem" with hadoop. You can get some nice speed out of making files larger.
You should look at using Parquet or Orc. They're compress nicely and often boost performance.
You should also look at running table stats on the hive table, this also helps to increase performance.

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

hive data processing taking longer time than expected

I'm facing an issue with ORC type data in hive. Needed some suggestions if someone faced similar problem.
I've huge data stored in hive table (partitioned & ORCed). The ORC data size is around 4 TB. I'm trying to copy this data to an uncompressed normal hive table (same table structure).
The process is running forever & occupying huge amount of non DFS storage in the pursuit. At present the process is running for 12 hours & has occupied 130 TB of non-DFS. That's very much abnormal for a Hadoop cluster with 20 servers.
Below are my parameters:
Hadoop running: HDP 2.4
Hive: 0.13
No. of servers: 20 (2 NN included)**
I wonder what a simple join or a normal analytics operation on this ORCed table would do. And theory tells that ORC format data increases performance for basic DML queries.
Can someone please let me know if I'm doing something wrong or is this a normal behavior? With ORCed data, this is my first experience.
Well, on a starters I saw that yarn log files are getting created in huge size. Mostly it shows the error logs only in heavy.
Thanks

Performance Issue in Hadoop,HBase & Hive

I am working on Migrating a Data from SQL Database to Hadoop, in which I have used HBase & Hadoop as well. I have successfully imported my data from SQL db to Hadoop, HBase and Hive. But the problem is the Performance of the System. I was getting the results of millions of entries within 5-10 minutes in SQL Db, but it takes around 1 hr to fetch 10 million of data from HBase & Hive. Can anyone help me on this to improve the Performance of my Hadoop System.
Data in HBase is only 'indexed' by rowkey. If you're querying in Hive on anything other than rowkey prefixes, you will generally be performing a full table scan.
There are some optimizations that can be made with HBase filters e.g., when using a FamilyFilter, you may be able to skip entire regions, but I doubt Hive is doing that.
How to improve performance depends on how your data is shaped and what analysis you need to perform on it. When performing frequent ad-hoc analysis, you may be better served by exporting data from HBase into something like Parquet files on HDFS and running your analysis against those with Hive (or Drill or Spark, Imapala, etc).

How to create a data pipeline from hive table to relational database

Background :
I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.
To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.
Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.
This is leading to erroneous analytics output.
Questions:
1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.
2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.
Technical Details:
Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2
You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.
(source: Sqoop documentation)
Hive and Hadoop are such great technologies that can allow your analytics to run inside MapReduce tasks, performing the analytics very fast by utilizing multiple nodes.
Use that to your benefit. First of all partition your Hive table.
I guess that you store all logs in a single Hive table. Thus when you run your queries and you have a
SQL .... WHERE LOG_DATA > '17/10/2013 00:00:00'
Then you effictivelly query all the data that you have collected so far.
Instead if you use partitions - let's say one per day you can define in your query
WHERE p_date=20131017 OR p_date=20131016
Hive is partitioned and now knows to read only those two files
So let's say you got 10 GB of logs per day - then a HIVE QUERY should succeed in a few seconds in a decent Hadoop cluster

Resources