Improving performance of hive jdbc - performance

Does aynyone know how to increase performance for HIVE JDBC connection.
Detailed problem:
When I query hive from Hive CLI, I get a response within 7 sec but from HIVE JDBC connection I get a response after 14 sec. I was wondering if there is any way (configuration changes) with which I can improve performance for query through JDBC connection.
Thanks in advance.

Using Connection pooling helped me increase hive JDBC performance.
As in hive there are many transformations happening while we query so using existing connection objects from connection pool instead of opening a new connection and closing for each request was quite helpful.
Please let me know if any one else if facing same issue will post a detailed answer.

Can you please try the below options.
If your query has joins then try setting the hive.auto.convert.join to true.
Try changing the configuration of Java Heap Size and Garbage Collection reference Link
Change the execution engine to Tez using set hive.execution.engine=tez
To check currently set engine use hive.execution.engine.
Other Hive performance configuration tips can be found in the Link
Please let me know the results.

If your database is Oracle you can try the Oracle Table Access for Hadoop and Spark (OTA4H) which can also be used from Hive QL. OTA4H will optimize the JDBC queries to retrieve the data from Oracle using splitters in order to get the best performance. You can join Hive tables with external tables inside Oracle directly in your hive queries.

To improve the performance of jdbc connection
Use the standard jdbc performance improvement features -,connection pooling , prepared statement pooling (starting with jdbc 3.0) performance improvement of hive cli can be done by changing these configuration parameters
-- enable cost based optimizer
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
--collects statistics
analyze table <TABLENAME> compute statistics for columns;
--enable vectorization of queries.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Hope thos helps

Related

Impala query with LIMIT 0

Being production support team member, I investigate issues with various Impala queries and while researching on an issue , I see a team submits an Impala query with LIMIT 0 which obviously do not return any rows and then again without LIMIT 0 which gives them result. I guess they submit these queries from IBM Datastage. Before I question them why they do so.. wanted to check what could be a reason for someone to run with LIMIT 0. Is it just to check syntax or connection with Impala? I see a similar question discussed here in context of SQL but thought to ask anyway in Impala perspective. Thanks Neel
I think you are partially correct.
Pls note, limit will process all the data and then apply limit clause.
LIMIT 0 is mostly used to -
to check if syntax of SQL is correct. But impala do fetch all the records before applying limit. so SQL is completely validated. Some system may use this to check out the sql they generated automatically before actually applying it in server.
limit fetching lots of rows from a huge table or a data set every time you run a SQL.
sometime you want to create an empty table using structure of some other tables but do not want to copy store format, configurations etc.
dont want to burden the hue/any interface that is interacting with impala. All data will be processed but will not be returned.
performance test - this will somewhat give you an idea of run time of SQL. i used the word somewhat because its not actual time to complete but estimated time to complete a SQL.

r2dbc-oracle back pressure implementation vs. fetchsize

it seems that the r2dbc-oracle doesnt have a proper back pressure implementation. If i select a bigger amoount of rows (say 10k) then it is way slower than a regular jdbc/JPA query. If i manually set the fetchsize to 1000 then the query is approx 8 times(!) faster.
so
can you confirm that back pressure is/is not implemented? if not: is that planned?
is there an easier way to set the fetchsize (maybe even global...) then using manual databaseclient.sql()-queries ?
Thanks for sharing these findings.
I can confirm that request signals from a Subscriber do not affect the fetch size of Oracle R2DBC's Row Publisher. Currently, the only supported way to configure the fetch size is by calling io.r2dbc.spi.Statement.fetchSize(int).
This behavior can be attributed to Oracle JDBC's implementation of oracle.jdbc.OracleResultSet.publisherOracle(Function). The Oracle R2DBC Driver is using Oracle JDBC's Publisher to fetch rows from the database.
I can also confirm that the Oracle JDBC Team is aware of this issue, and is working on a fix. The fix will have the publisher use larger fetch sizes when demand from a subscriber exceeds the value configured with Statement.fetchSize(int).
Source: I wrote the code for Oracle R2DBC and Oracle JDBC's row publisher.

Oracle connection with Spark SQL

I'm trying to connect to Oracle DB from Spark SQL with following code:
val dataTarget=sqlcontext.read.
  format("jdbc").
  option("driver", config.getString("oracledriver")).
  option("url", config.getString("jdbcUrl")).
  option("user", config.getString("usernameDH")).
  option("password", config.getString("passwordDH")).
  option("dbtable", targetQuery).
  option("partitionColumn", "ID").
  option("lowerBound", "5").
  option("upperBound", "499999").
  option("numPartitions", "10").
  load().persist(StorageLevel.DISK_ONLY)
By default when we connect with Oracle through Spark SQL it'll create one connection for one partition will be created for the entire RDD. This way I loose parallelism and performance issues comes when there is huge data in a Table. In my code I have passed option("numPartitions", "10")
which will create 10 connection. Please correct if I'm wrong as I know, the number of connections with Oracle will be equal to the number of partitions we pass.
I'm getting below error if I use more connection because may be there is a connection limit to Oracle.
java.sql.SQLException: ORA-02391: exceeded simultaneous
SESSIONS_PER_USER limit
To create more partitions for parallelism if I use more partitions, error comes but if I put less I face performance issues. Is there any other way to create a single connection and load data into multiple partitions (this will save my life).
Please suggest.
Is there any other way to create a single connection and load data into multiple partitions
There is not. In general partitions are processed by different physical nodes and different virtual machines. Considering all the authorization and authentication mechanisms, you cannot just take connection and pass it from node to node.
If problem is just in exceeding SESSIONS_PER_USER just contact the DBA and ask for increasing the value for the Spark user.
If problem is throttling you can try to keep the same number partitions, but decrease number of Spark cores. Since this is mostly micromanaging it might be better to drop JDBC completely, use standard export mechanism (COPY FROM) and read the files directly.
One work around might be to load the data using a single Oracle connection (partition) and then simply repartition:
val dataTargetPartitioned = dataTarget.repartition(100);
You can also partition by a field (if partitioning a dataframe):
val dataTargetPartitioned = dataTarget.repartition(100, "MY_COL");

jdbc with oracle DB - out of memory

I've written a simple code that reads a table from oracle DB.
I try to run in on a very big table and I see that it consumes a huge amount of memory.
I thought that using fetchsize will cause it to optimize memory usage (that what happens when using it on SQLSERVER), but it didn't. tried it with various values - from 10 to 100000.
Can't see how I manage to perform a simple task - export a very big oracle table to a csv file.
I use ojdbc6.jar as a driver.
also I use
connection.setAutoCommit(false);
Any idea?
Seems like creating the statement with ResultSet.TYPE_FORWARD_ONLY solved this problem.

Hibernate with Oracle JDBC issue

I have a select query which takes 10 min to complete as it runs thru 10M records. When I run thru TOAD or program using normal JDBC connection I get the results back, but while running a Job which uses Hibernate as ORM does not return any results. It just hangs up ...even after 45 min? Please help
Are you saying you trying to retrieve 10M records using an ORM like hibernate?
If this is the case you have one big problems, you need to redesign your application because this is not going to work, and about why it hangs up, well, I bet is because it runs out of memory.
Have you enabled SQL output for Hibernate? You need to set hibernate.show_sql to true in order to do that.
Once that's done, compare the generated SQL with the one you're been running through TOAD. Are they exactly the same or not?
I'm going to venture a guess here and say they're not because once SQL is generated Hibernate does nothing fancy - connection is taken from a pool; prepared statement is created and executed - so it should be no different from JDBC.
Thus the question most likely is how can your HQL be optimized. If you need any help with that you'll have to post the HQL in question as well as appropriate mappings / table schemas. Running explain on query would help as well.

Resources