how to connect apache-drill with cassandra in embedded mode - cassandra-2.0

I want to connect drill with cassandra . One blog I found but when I follow instructions of that blog I got error . Any one know how to connect both ?

Drill (till 1.4) does not provide direct support for Cassandra. You can write your own patch over core drill to achieve this.
This article can be referred but it was tested for an older version of drill. It does not work for Drill 1.0+
In future releases, we can expect direct support for cassandra from Drill People..:)

Related

Apache Sqoop moved into the Attic in 2021-06

I have installed hadoop version 3.3.1 and sqoop 1.4.7 which doesn't seem compatible , I am getting depreciated API implemented error while importing rdbms table.
As I tried to google for compatible versions I found apache sqoop is moved to appache attiq .and version 1.4.7 which is last stable version states in its documentation says that " Sqoop is currently supporting 4 major Hadoop releases - 0.20, 0.23, 1.0 and 2.0. "
Would you please explain what does it mean and what should I do.
could you please suggest What are the alternatives of SQOOP .
It means just what the board minutes say: Sqoop has become inactive and is now moved to the Apache Attic. This doesn't mean Sqoop is deprecated in favor of some other project, but for practical purposes you should probably not build new implementations using it.
Much of the same functionality is available in other tools, including other Apache projects. Possible options are Spark, Kafka, Flume. Which one to use is very dependent on the specifics of your use case, since none of these quite fill the same niche as Sqoop. The database connectivity capabilities of Spark make it the most flexible solution, but it also could be the most labor-intensive to set up. Kafka might work, although it's not quite as ad-hoc friendly as Sqoop (take a look at Kafka Connect). I probably wouldn't use Flume, but it might be worth a look (it is mainly meant for shipping logs).

MapR 5.2.2 clients

I have a task which requires me to create a Go program to read from an HBASE table.
HBASE is installed in a MapR cluster.
Every other application (Java) uses a MapR client to connect to the MapR cluster so as to retrieve the data.
However, I am unable to find a way to connect to HBASE with a Go application.
I have found HBASE package, but it does not support integration with MapR.
It would be great if anyone could guide me in this situation.
I also have seen that for MapR 6 and above has Go support through OJAI, but sadly, upgrading MapR is not an option.
Can someone advice me how to proceed in this situation?
If you are actually running HBase in MapR, then the Go package for HBase should work (assuming version match and such).
If you are actually using the MapR DB Binary tables (which are roughly HBase compatible) the likely best approach would be to use the Thrift API or REST.
The OJAI lightweight client should work well in Go since it uses gRPC to talk to the underlying table (and thus gains lots of portability). The problem in your case won't be so much that you need to upgrade the platform so much as the lightweight client only works with MapR DB JSON (the document oriented version of MapR DB).
Ping me directly if you would like more information.

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we need to have any additional connector for that? I have done comprehensive research to check if this is supported but could not find anything. Did anyone already try this? If so, can you specify the steps and link to any documentation?
Informatica version: 9.6.1 (Hotfix 2)
You can use the odbc driver provided by cloudera.
http://www.cloudera.com/downloads/connectors/impala/odbc/2-5-22.html
For Irene, the you can use the same driver the above one is based the simba driver.
http://www.simba.com/drivers/hbase-odbc-jdbc/

Cassandra integration with Hadoop

I am newbie to Cassandra. I am posting this question as different documentations were providing different details with respect to integeting Hive with Cassandra and I was not able to find the github page.
I have installed a single node Cassandra 2.0.2 (Datastax Community Edition) in one of the data nodes of my 3 node HDP 2.0 cluster.
I am unable to use hive to access Cassandra using 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'. I am getting the error ' return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
I have copied all the jars in /$cassandra_home/lib/* to /$hive-home/lib and also included the /cassandra_home/lib/* in the $HADOOP_CLASSPATH.
Is there any other configuration changes that I have to make to integrate Cassandra with Hadoop/Hive?
Please let me know. Thanks for the help!
Thanks,
Arun
Probably these are starting points for you:
Hive support for Cassandra, github
Top level article related to your topic with general information: Hive support for Cassandra CQL3.
Hadoop support, Cassandra Wiki.
Actually your question is not so narrow, there could be lot of reasons for this. But what you should remember Hive is based on MapReduce engine.
Hope this helps.

Cassandra - Hive integration

What's the best practice for integrating Cassandra and Hive?
An old question on Stackoverflow (Cassandra wih Hive) points to Brisk, which has now become a subscription-only Datastax Enterprise product.
A google search only points to two open jira issues,
https://issues.apache.org/jira/browse/CASSANDRA-4131
https://issues.apache.org/jira/browse/HIVE-1434
but none of them has resulted in any code committed in one of the two projects.
Is the only way to integrate Cassandra and Hive patching the Cassandra/Hive source code? Which solution are you using in your stack?
I did the same research a month ago, to reach to the same conclusion.
Brisk is no longer available as a community download, and besides patching the Cassandra/Hive code, the only way to throw map/reduce jobs at your Cassandra database is to use DSE -- Datastax Enterprise, which I believe is free for any use but production clusters.
You might have a look at HBase which is based on HDFS.
There's an open source Cassandra Storage Handler for Hive currently maintained by Datastax.
here is a git de cassandra hive driver with cassandra 2.0 and hadoop 2,
https://github.com/2013Commons/hive-cassandra
and others for cassandra 1.2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
You can use an integration framework or integration suite for this problem. Take a look at my presentation "Big Data beyond Hadoop - How to integrate ALL your data" for more information about how to use open source integration frameworks and integration suites with Hadoop.
For example, Apache Camel (integration framework) and Talend Open Studio for Big Data (integration suite) are two open source solutions which offer connectors to both Cassandra and Hadoop.

Resources