Connect to Spark SQL via ODBC - hadoop

According to this page: https://spark.apache.org/sql/ you can connect existing BI tools to Spark SQL via ODBC or JDBC:
I don't mean Shark as this is basically EOL:
It is for this reason that we are ending development in Shark as a separate project and moving all our development resources to Spark SQL, a new component in Spark.
How would a BI tool (like Tableau) connect to shark sql via ODBC?

With the release of Spark SQL 1.1 you also have thrift JDBC driver see https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine

Simba provides the ODBC driver that Databricks uses, however that is only for the Databricks distribution. We are launching the public version for use with Apache tomorrow (Wed, Dec 3rd) at www.simba.com. You'll be able to download and trial the driver for use with Tableau then.

As Carlos said, Stratio Meta is a module that acts as a parser, validator, planner and coordinator layer over the different persistence layers (currently, only Cassandra and Mongo, but also HDFS in the short term). This modules offers a Shell with a SQL-like language, a Java/Scala API, a REST API and ODBC (JDBC shortly). It also uses another Stratio module, Stratio Deep, which allows us to use Apache Spark in order to execute query in an efficent and fast way.
Disclaimer: I am currently employed by Stratio Big Data

Please take a look at: http://www.openstratio.org/blog/connecting-to-the-stratio-big-data-platform-using-odbc-2/
Stratio is a platform that includes a certified Spark distribution that allows you to connect Spark to any type of data repository (like Cassandra, MongoDB,...). It has an ODBC Driver so you can write SQL queries that will be translated to Spark jobs, or even faster, direct queries to Cassandra -or whichever database you want to connect to it - if possible. This way, it is pretty simple to connect Tableau into Spark and your data repository. If you need any help, we will be more than glad to assist you.
Disclaimer: I'm one of Stratio's ODBC developers

Simba will offer one: http://databricks.com/blog/2014/04/30/Databricks-selects-Simba-ODBC-driver-for-shark.html. No known official release date.
[update]
Use HIVE's ODBC driver to connect to Spark SQL as described here and here.

For Spark on Azure HDInsight, you can connect Tableau (or PowerBI) as described here https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/. The ODBC driver is here: http://www.microsoft.com/en-us/download/details.aspx?id=47713

Related

MapR 5.2.2 clients

I have a task which requires me to create a Go program to read from an HBASE table.
HBASE is installed in a MapR cluster.
Every other application (Java) uses a MapR client to connect to the MapR cluster so as to retrieve the data.
However, I am unable to find a way to connect to HBASE with a Go application.
I have found HBASE package, but it does not support integration with MapR.
It would be great if anyone could guide me in this situation.
I also have seen that for MapR 6 and above has Go support through OJAI, but sadly, upgrading MapR is not an option.
Can someone advice me how to proceed in this situation?
If you are actually running HBase in MapR, then the Go package for HBase should work (assuming version match and such).
If you are actually using the MapR DB Binary tables (which are roughly HBase compatible) the likely best approach would be to use the Thrift API or REST.
The OJAI lightweight client should work well in Go since it uses gRPC to talk to the underlying table (and thus gains lots of portability). The problem in your case won't be so much that you need to upgrade the platform so much as the lightweight client only works with MapR DB JSON (the document oriented version of MapR DB).
Ping me directly if you would like more information.

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we need to have any additional connector for that? I have done comprehensive research to check if this is supported but could not find anything. Did anyone already try this? If so, can you specify the steps and link to any documentation?
Informatica version: 9.6.1 (Hotfix 2)
You can use the odbc driver provided by cloudera.
http://www.cloudera.com/downloads/connectors/impala/odbc/2-5-22.html
For Irene, the you can use the same driver the above one is based the simba driver.
http://www.simba.com/drivers/hbase-odbc-jdbc/

how to integrate Cassandra with Hadoop to take advantage of Hive

It is almost 3 days that I've been looking for a solution at year 2015 to integrate Cassandra on Hadoop and lots of resources on the net are outdated or vanished from the net and the Datastax Enterprise offers no free of charge solution for such integration.
What are the options for doing such? I want to use Hive query language to get data from my Cassandra and I think the first step is to integrate the Cassandra with Hadoop.
The easiest (but also paid option) is to use Datastax Enterprise packaging of C* with Hadoop + Hive. This provides an automatic connection and registration of Hive tables with C* and includes and setups up a Hadoop execution platform if you need one.
http://www.datastax.com/products/datastax-enterprise
The second easiest way is to utilize Spark instead. The Spark Cassandra Connector is open source and allows HiveQL to be used to access C* tables. This is done running on Spark as an execution platform instead of Hadoop but has similar (if not better) performance.
With this solution I would standup a stand alone Spark Cluster (since you don't have an existing hadoop infra) and then use the spark-sql-thrift server to run queries against C* tables.
https://github.com/datastax/spark-cassandra-connector
There are other options but these are the ones I am most familiar with (and conflict of interest notice, also develop :D )

Hadoop remote data sources connectivity

What are the available hadoop remote data sources connectivity options?
I know about drivers for MongoDB, MySQL and Vertica connectivity but my question is what are other available data sources that have driver for hadoop connectivity?
These are the few I am aware of :
Oracle
ArcGIS Geodatabase
Teradata
Microsoft SQL Server 2008 R2 Parallel Data Warehouse (PDW)
PostgreSQL
IBM InfoSphere warehouse
Couchbase
Netezza
Tresata
But I am still wondering about the intent of this question. Every data source fits into a particular use case. Like, Couchbase for document data storage, Tresata for financial data storage and so on. Are you going to decide your store based on the connector availability??I don't think so.
Your list will be too long to be useful.
Just one reference: cascading gives you access to almost anything you want to access. More, you're not limited with Java. For example there is scalding component which provides very good framework for Scala programmers.

Cassandra - Hive integration

What's the best practice for integrating Cassandra and Hive?
An old question on Stackoverflow (Cassandra wih Hive) points to Brisk, which has now become a subscription-only Datastax Enterprise product.
A google search only points to two open jira issues,
https://issues.apache.org/jira/browse/CASSANDRA-4131
https://issues.apache.org/jira/browse/HIVE-1434
but none of them has resulted in any code committed in one of the two projects.
Is the only way to integrate Cassandra and Hive patching the Cassandra/Hive source code? Which solution are you using in your stack?
I did the same research a month ago, to reach to the same conclusion.
Brisk is no longer available as a community download, and besides patching the Cassandra/Hive code, the only way to throw map/reduce jobs at your Cassandra database is to use DSE -- Datastax Enterprise, which I believe is free for any use but production clusters.
You might have a look at HBase which is based on HDFS.
There's an open source Cassandra Storage Handler for Hive currently maintained by Datastax.
here is a git de cassandra hive driver with cassandra 2.0 and hadoop 2,
https://github.com/2013Commons/hive-cassandra
and others for cassandra 1.2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
You can use an integration framework or integration suite for this problem. Take a look at my presentation "Big Data beyond Hadoop - How to integrate ALL your data" for more information about how to use open source integration frameworks and integration suites with Hadoop.
For example, Apache Camel (integration framework) and Talend Open Studio for Big Data (integration suite) are two open source solutions which offer connectors to both Cassandra and Hadoop.

Resources