I have run several large queries using impala-shell and found the performance to be satisfactory. These queries typically write 100k-1m rows to disk. However, when I run the very same queries programmatically using JDBC, the results take much, much longer to write to disk. For example, a query which takes five minutes from impala-shell takes up to thirty minutes over JDBC.
I have tried both the Hive and Cloudera JDBC drivers but get similarly bad performance. I have tried various fetch sizes but it has not made any difference. Is Impala streaming over JDBC fundamentally slow or could I do something else to speed up the streaming?
This is on CDH 5.9.1.
This turned out to be a client-side issue. I was using curl to test a web application which was making the Impala queries. Switching from curl to a client written in Scala code removed the latency.
Related
In a mapr cluster using yarn and tez engine, we need to query hive data from datastage using jdbc connector. In some cases we need to increase tez container size due to data size. We do that in before sql statement in a parallel job, and then we query data in main job statement.
The problem is the before sql statement SET hive.tez.container.size=3000 is taking hours, but the query to data is running fine (few seconds).
Could it be related to how busy is the cluster at that time? many jobs in the queue??
Don't think so because it always crashes in set statement, but never in select statement.
Thanks in advance!
I would suggest to use IBM provided Hive JDBC driver and Hive Connector stage which allows to set Hive parameters via built-in stage property.
When a DataStage job runs slow, it could be for several reasons, from what you are saying, setting hive.tez.container.size=3000 in before sql statement is what taking hours, I would suggest to look on Hive DB side while running DataStage job.
If you are not using IBM provided Hive JDBC driver, then it's better to engage official support of 3rd party Hive JDBC driver to enable JDBC driver tracing.
I would like to know how Hadoop components are used in real time.
here are my questions:
data Importing/export:
I know the options available in Sqoop but like to know how Sqoop is used in real time implementations (in common)
if I'm correct
1.1 sqoop commands placed in shell script and being called from schedulers/event triggers. can I have real time code-example on this, specifically passing parameters to Sqoop dynamically (such as table name) in shell script.
1.2 believe Ooozie workflow could also be used. any examples please
Pig
how pig commands are commonly called in real time? via java programs?
any realtime code-examples would be a great help
if I am correct Pig is commonly used for data quality checks/cleanups on staging data before loading them in to actual hdfs path or as hive tables.
and we could see pig scripts in shell scripts (in real time projects)
please correct me or add if I missed any
Hive
where we will see Hive commands in real time scenarios?
in shell scripts or in java api calls for reporting?
HBase
Hbase commands are commonly called as api calls in languages like Java.
am I correct?
sorry for too many questions. I don't see any article/blog on how these components are used in real time scenarios.
Thanks in advance.
The reason you don't see articles on the use of those components for realtime scenarios, is because those components are not realtime oriented, but batch oriented.
Scoop: not used in realtime - it is batch oriented.
I would use something like Flume to ingest data.
Pig, Hive: Again, not realtime ready. Both are batch oriented. The setup time of each query/script can take tens of seconds.
You can replace both with something like Spark Streaming (it even supports Flume).
HBase: It is a NoSQL database on top of HDFS. Can be used for realtime. Quick on inserts. It can be used from spark.
If you want to use those systems to help realtime apps, think of something like a Lambda architecture, that has a batch layer (using hive, pig and what not) and a speed layer, using streaming/realtime technologies.
Regards.
I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case
Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.
I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.
First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.
I need to access data using Hive programatically (data in the order of GBs per query). I was evaluating CLI driver Vs Hive JDBC driver.
When we use JDBC, there is an extra overhead of thrift server & I am trying to understand how heavy is that. Also can it be a single point bottleneck if multiple clients connect to single thrift server? Or is it a common practice that people configure multiple thrift servers on Hadoop and do some load balancing stuff?
I am looking for the better performance rather than faster prototyping.
Thanks in advance.
Shengjie's link doesn't work- This might properly automagically linkify:
http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/
From performance point of view, yes, thrift server can potentially be the bottleneck and the SPF. I've seen people set up multiple thrift servers talking to mysql metastore. Take a look at this http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/.Hope it helps.
You can try using connection pooling. I had a similar issue while submitting hive query through JDBC was taking more time than hive cli.
Also in your connection string mention few parameters as below:
jdbc:hive2://servername:portno/;hive.execution.engine=tez;tez.queue.name=alt;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled=true;