Is there a way to find out which Queries are active on a Derby Database
and how long they are running?
Something like Performance Monitoring.
Yes, there are several useful performance monitoring features in Derby. See this question for some discussion: apache derby - explain select
Related
I am pretty new to Oracle Golden Gate, wanted to understand if it possible to create a bidirectional sync between Oracle 12x and Cassandra(DSE) using Oracle Golden Gate? Searched several places in internet but most examples are replicating data between Oracle databases. I started wondering if it is even possible to do so. Can anyone help me with any documentation?
There is a separate module called Oracle GoldenGate for BigData. It supports many NoSQL replication targets.
One of the supported BigData databases is also Apache Cassandra.
There is a separate manual explaining how to use it.
There is no separate module that allows you to connect Apache Cassandra as the source of your replication. If you need such replication you need to provide some intermediate step. The source of replication for Oracle GoldenGate can only be a database (Oracle, TimesTen, DB2, Informix, MySQL, MS SQL Server, NonStop SQL/MX, SAP/Sybase ASE, Teradata) or a JMS queue.
We have an Oracle Database that resides tables. We would like to implement a new project as I mentioned in title; Oracle to Cassandra real-time replication.
But this new Cassandra environment would be as a reporting service. From the application (in-house), datas is inserted to Oracle production environment. Then our custom service (or what ever) will read delta and insert to Cassandra (this would be like Goldengate may be).
Briefly, does the Cassandra will answer our needs for this scenario?
In our case, we have 20 oracle DBs in different locations (these 20 dbs has similar implementation) 1 central report DB that is daily refresh from these 20 DBs. We use "outdated" snapshot technology, every night our central single report DB (REPORTDB) with fast refresh option, we gather the daily delta from these 20 dbs within oracle ss. we need a structure that reads data from 20 dbs and real-time injection to new cassandra database just like REPORDB
These days you can run spark jobs on Cassandra, thanks to Datastax so yes it can be used as a reporting tool. It's best utilized as a key value store if your number of writes are high compared to your reads.
Reading delta is not real time so you should try using Oracle's AQs. I've been doing real time replication of Oracle to Cassandra using Oracle's AQ and Apache Storm for almost 4 years now and it's running flawlessly.
I don't understand this Oracle/Cassandra architecture running alongside.
Either Oracle suits your needs then you should stick with it. Or it doesn't and you need scalability/high availability then switch to Cassandra.
Can you elaborate on the reasons that make you choose Cassandra for the reporting service ?
According to this page: https://spark.apache.org/sql/ you can connect existing BI tools to Spark SQL via ODBC or JDBC:
I don't mean Shark as this is basically EOL:
It is for this reason that we are ending development in Shark as a separate project and moving all our development resources to Spark SQL, a new component in Spark.
How would a BI tool (like Tableau) connect to shark sql via ODBC?
With the release of Spark SQL 1.1 you also have thrift JDBC driver see https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Simba provides the ODBC driver that Databricks uses, however that is only for the Databricks distribution. We are launching the public version for use with Apache tomorrow (Wed, Dec 3rd) at www.simba.com. You'll be able to download and trial the driver for use with Tableau then.
As Carlos said, Stratio Meta is a module that acts as a parser, validator, planner and coordinator layer over the different persistence layers (currently, only Cassandra and Mongo, but also HDFS in the short term). This modules offers a Shell with a SQL-like language, a Java/Scala API, a REST API and ODBC (JDBC shortly). It also uses another Stratio module, Stratio Deep, which allows us to use Apache Spark in order to execute query in an efficent and fast way.
Disclaimer: I am currently employed by Stratio Big Data
Please take a look at: http://www.openstratio.org/blog/connecting-to-the-stratio-big-data-platform-using-odbc-2/
Stratio is a platform that includes a certified Spark distribution that allows you to connect Spark to any type of data repository (like Cassandra, MongoDB,...). It has an ODBC Driver so you can write SQL queries that will be translated to Spark jobs, or even faster, direct queries to Cassandra -or whichever database you want to connect to it - if possible. This way, it is pretty simple to connect Tableau into Spark and your data repository. If you need any help, we will be more than glad to assist you.
Disclaimer: I'm one of Stratio's ODBC developers
Simba will offer one: http://databricks.com/blog/2014/04/30/Databricks-selects-Simba-ODBC-driver-for-shark.html. No known official release date.
[update]
Use HIVE's ODBC driver to connect to Spark SQL as described here and here.
For Spark on Azure HDInsight, you can connect Tableau (or PowerBI) as described here https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/. The ODBC driver is here: http://www.microsoft.com/en-us/download/details.aspx?id=47713
Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.
I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.
First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.
I need to access data using Hive programatically (data in the order of GBs per query). I was evaluating CLI driver Vs Hive JDBC driver.
When we use JDBC, there is an extra overhead of thrift server & I am trying to understand how heavy is that. Also can it be a single point bottleneck if multiple clients connect to single thrift server? Or is it a common practice that people configure multiple thrift servers on Hadoop and do some load balancing stuff?
I am looking for the better performance rather than faster prototyping.
Thanks in advance.
Shengjie's link doesn't work- This might properly automagically linkify:
http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/
From performance point of view, yes, thrift server can potentially be the bottleneck and the SPF. I've seen people set up multiple thrift servers talking to mysql metastore. Take a look at this http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/.Hope it helps.
You can try using connection pooling. I had a similar issue while submitting hive query through JDBC was taking more time than hive cli.
Also in your connection string mention few parameters as below:
jdbc:hive2://servername:portno/;hive.execution.engine=tez;tez.queue.name=alt;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled=true;