modifying hive tez container size from IBM datastage with JDBC connector takes too long - jdbc

In a mapr cluster using yarn and tez engine, we need to query hive data from datastage using jdbc connector. In some cases we need to increase tez container size due to data size. We do that in before sql statement in a parallel job, and then we query data in main job statement.
The problem is the before sql statement SET hive.tez.container.size=3000 is taking hours, but the query to data is running fine (few seconds).
Could it be related to how busy is the cluster at that time? many jobs in the queue??
Don't think so because it always crashes in set statement, but never in select statement.
Thanks in advance!

I would suggest to use IBM provided Hive JDBC driver and Hive Connector stage which allows to set Hive parameters via built-in stage property.
When a DataStage job runs slow, it could be for several reasons, from what you are saying, setting hive.tez.container.size=3000 in before sql statement is what taking hours, I would suggest to look on Hive DB side while running DataStage job.
If you are not using IBM provided Hive JDBC driver, then it's better to engage official support of 3rd party Hive JDBC driver to enable JDBC driver tracing.

Related

How hive manage the Non-Tez and Non-MapReduce based queries

Create table t1(id int)
I was firing above query on Hive 2.3.6 (MapR Hadoop Distribution 6.3.0).
Default hive engine was tez.
So after firing the query I was not able to see any TEZ application is launched on the yarn resource manager web ui
So I've changed the execution engine to MapReduce.
set hive.execution.engine=mr
And tried to run the same query again.
Same I was not able to see any MR application was launched on the yarn resource manager web ui
So my questions are how hive manage such types of queries?
And where the details of this queries are stored like application id, start time so on?
create table - is a metadata operation only, data is not being processed. It creates records in the metastore database, no distributed processing framework like Tez or MR is necessary for this, Yarn is not used.
Compiler translates DDL to the metastore query only if possible.
Also some simple DQL queries can be executed as metastore only if statistics exists and this feature is enabled: https://stackoverflow.com/a/41021682/2700344, without using Tez or MR.
Also small tables can be queried without distributed framework, using fetch-only task, see this: Why is Fetch task in Hive works faster than Map-only task?

Is Hive and Impala integration possible?

Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.

Oracle Hadoop Connectors vs Sqoop

I have used Sqoop to ingest data from Oracle to Hadoop and it worked well. It took only 4 mins to bring 86 million records from Oracle to Hive table without using partitions on Sqoop. Can anyone give some details about Oracle Hadoop connectors, Will it perform better than Sqoop?
Most of connectors would have the performance close to same as you'll have have a set of MapReduce jobs on the very end of your workflow and this would play the main role in your overall performance.
Oracle provides a set of different connectors for accessing the Hive and you could check a nice overview about standard solutions but I doubt that on the very end you will expect significant performance differences other then you see in Sqoop:
https://docs.oracle.com/cd/E37231_01/doc.20/e36961/start.htm#BDCUG119
Sqoop is a generic tool for working with the relational databases from Hadoop realm, and it is not limited by Oracle only. Besides it has an integration with other Hadoop solutions like Oozie for making complicated workflows, which makes it a good candidate over other types of connectors.
Personally myself I prefer Sqoop for Hadoop-driven import-export operations and connector approach for querying the data in Hadoop.
Sqoop will leverage a standard JDBC connection. Oracles connector will work with a fastloader/fastexport class integrated into the sqoop connection. It should be faster that Sqoop.

Does Hive depend on/require Hadoop?

Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.

Hive JDBC Vs CLI client

I need to access data using Hive programatically (data in the order of GBs per query). I was evaluating CLI driver Vs Hive JDBC driver.
When we use JDBC, there is an extra overhead of thrift server & I am trying to understand how heavy is that. Also can it be a single point bottleneck if multiple clients connect to single thrift server? Or is it a common practice that people configure multiple thrift servers on Hadoop and do some load balancing stuff?
I am looking for the better performance rather than faster prototyping.
Thanks in advance.
Shengjie's link doesn't work- This might properly automagically linkify:
http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/
From performance point of view, yes, thrift server can potentially be the bottleneck and the SPF. I've seen people set up multiple thrift servers talking to mysql metastore. Take a look at this http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/.Hope it helps.
You can try using connection pooling. I had a similar issue while submitting hive query through JDBC was taking more time than hive cli.
Also in your connection string mention few parameters as below:
jdbc:hive2://servername:portno/;hive.execution.engine=tez;tez.queue.name=alt;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled=true;

Resources