Cassandra integration with Hadoop - hadoop

I am newbie to Cassandra. I am posting this question as different documentations were providing different details with respect to integeting Hive with Cassandra and I was not able to find the github page.
I have installed a single node Cassandra 2.0.2 (Datastax Community Edition) in one of the data nodes of my 3 node HDP 2.0 cluster.
I am unable to use hive to access Cassandra using 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'. I am getting the error ' return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
I have copied all the jars in /$cassandra_home/lib/* to /$hive-home/lib and also included the /cassandra_home/lib/* in the $HADOOP_CLASSPATH.
Is there any other configuration changes that I have to make to integrate Cassandra with Hadoop/Hive?
Please let me know. Thanks for the help!
Thanks,
Arun

Probably these are starting points for you:
Hive support for Cassandra, github
Top level article related to your topic with general information: Hive support for Cassandra CQL3.
Hadoop support, Cassandra Wiki.
Actually your question is not so narrow, there could be lot of reasons for this. But what you should remember Hive is based on MapReduce engine.
Hope this helps.

Related

does Apache Kylin need a Apache Derby or Mysql for run the sample cube

I installed Java and Hadoop and Hbase and Hive and Spark and Kylin.
hadoop-3.0.3
hbase-1.2.6
apache-hive-2.3.3-bin
spark-2.2.2-bin-without-hadoop
apache-kylin-2.3.1-bin
I will be grateful if someone in Help me with Kyle's installation and configuration them.
http://kylin.apache.org/docs/ this may help you. You can send email to dev#kylin.apache.org, then the questions will be discussed and answered in the mailing list. There are some tips for sending the email: 1. provide Kylin version 2. provide log information 3.provide the usage scenario. If you want to get a quick start, you can run Kylin in a Hadoop sandbox VM or in the cloud, for example, start a small AWS EMR or Azure HDInsight cluster and then install Kylin in one of the nodes. When you use Kylin-2.3.1, I suggest you use Spark-2.1.2.

PiG + Cassandra + Hadoop

I have a Hadoop (2.7.2) setup over a Cassandra (3.7) Cluster. I have no problem with using Hadoop MapReduce. Similarly, I have no problem to create tables and keyspace in CQLSH. However, I have been trying to install PIG over hadoop, so as to access the tables in Cassandra. (Installation of PIG is as such fine) It is where I'm having trouble.
I have come across numerous websites, most are either for outdated versions of Cassandra or just plain vague.
The one thing I gleaned from this website is that we can load access the cassandra tables in pig using CqlStorage / CqlNativeStorage. However, in the latest version, it seems this support has been removed (since 2015).
Now my question is, are there any workarounds?
I would be running mapreduce jobs over cassandra tables, and use PiG for querying, mostly.
Thanks in Advance.
All pig support was Deprecated in 2.2 and removed in 3.0. https://issues.apache.org/jira/browse/CASSANDRA-10542
So I think you are a bit out of luck here. You may be able to use old classes with modern C* but Pig is very niche right now. SparkSql is definitely the current favorite child (I may be biased since I work on the Spark + Cassandra Connector) and allows for very flexible querying of C* data.

How to integrate Cassandra with Hadoop

I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR

Cassandra - Hive integration

What's the best practice for integrating Cassandra and Hive?
An old question on Stackoverflow (Cassandra wih Hive) points to Brisk, which has now become a subscription-only Datastax Enterprise product.
A google search only points to two open jira issues,
https://issues.apache.org/jira/browse/CASSANDRA-4131
https://issues.apache.org/jira/browse/HIVE-1434
but none of them has resulted in any code committed in one of the two projects.
Is the only way to integrate Cassandra and Hive patching the Cassandra/Hive source code? Which solution are you using in your stack?
I did the same research a month ago, to reach to the same conclusion.
Brisk is no longer available as a community download, and besides patching the Cassandra/Hive code, the only way to throw map/reduce jobs at your Cassandra database is to use DSE -- Datastax Enterprise, which I believe is free for any use but production clusters.
You might have a look at HBase which is based on HDFS.
There's an open source Cassandra Storage Handler for Hive currently maintained by Datastax.
here is a git de cassandra hive driver with cassandra 2.0 and hadoop 2,
https://github.com/2013Commons/hive-cassandra
and others for cassandra 1.2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
You can use an integration framework or integration suite for this problem. Take a look at my presentation "Big Data beyond Hadoop - How to integrate ALL your data" for more information about how to use open source integration frameworks and integration suites with Hadoop.
For example, Apache Camel (integration framework) and Talend Open Studio for Big Data (integration suite) are two open source solutions which offer connectors to both Cassandra and Hadoop.

integrate pentaho community with hadoop

i want to integrate hadoop to pentaho data-integration,I found on pentaho site, in that site there is pentaho for hadoop, but it's commercial.i want to make my data-integration community edtion to integrate with hadoop.
How i can solve this ?
Tks
In New version(PDI 4.2.0), you can see hadoop components In PDI.
visit: http://sourceforge.net/projects/pentaho/files/Data%20Integration/
Actually since PDI 4.3.0 ( which got released yesterday ) all the hadoop stuff is now included in the open source version! So just go straight to sourceforge and download! All the docs are on infocenter.pentaho.com
The most recent work for integrating Kettle (ETL) with Hadoop and other various NoSQL data stores can be found in the Pentaho Big Data Plugin. This is a Kettle plugin and provides connectors to HDFS, MapReduce, HBase, Cassandra, MongoDB, CouchDB that work across many Pentaho products: Pentaho Data Integration, Pentaho Reporting, and the Pentaho BA Server. The code is hosted on Github: https://github.com/pentaho/big-data-plugin.
There's a community landing page with more information on the Pentaho Wiki. You'll find How To guides, configuration options, and documentation for the Java Developer here: http://community.pentaho.com/bigdata

Resources