integrate pentaho community with hadoop - hadoop

i want to integrate hadoop to pentaho data-integration,I found on pentaho site, in that site there is pentaho for hadoop, but it's commercial.i want to make my data-integration community edtion to integrate with hadoop.
How i can solve this ?
Tks

In New version(PDI 4.2.0), you can see hadoop components In PDI.
visit: http://sourceforge.net/projects/pentaho/files/Data%20Integration/

Actually since PDI 4.3.0 ( which got released yesterday ) all the hadoop stuff is now included in the open source version! So just go straight to sourceforge and download! All the docs are on infocenter.pentaho.com

The most recent work for integrating Kettle (ETL) with Hadoop and other various NoSQL data stores can be found in the Pentaho Big Data Plugin. This is a Kettle plugin and provides connectors to HDFS, MapReduce, HBase, Cassandra, MongoDB, CouchDB that work across many Pentaho products: Pentaho Data Integration, Pentaho Reporting, and the Pentaho BA Server. The code is hosted on Github: https://github.com/pentaho/big-data-plugin.
There's a community landing page with more information on the Pentaho Wiki. You'll find How To guides, configuration options, and documentation for the Java Developer here: http://community.pentaho.com/bigdata

Related

Can Tableau connect with apache hadoop ? or it should be with only major hadoop distributions?

Need a help on reporting tool Basically we are looking for a best reporting tool that can connect to hive and pull the report. So thought of using Tableau. We are using our own hadoop distribution ( not from hortonworks, cloudera, Mapr Etc). Will tableau connects to apache distribution of hadoop also. If not please suggest some good reporting tool. Freeware is highly recommended.
thank you
Yes tableau will connect with your apache hadoop free distribution.
you will have to put all necessary jar file like hadoop core jars, hadoop common jars into your tableau lib directory. also in your hadoop lib directory you have to put your tableau driver correct version.
then with the help of hiveserver2 also known as hive thrift server. you can give your driver name and connection string
for more details:
http://kb.tableau.com/articles/knowledgebase/connecting-to-hive-server-2-in-secure-mode
http://kb.tableau.com/articles/knowledgebase/administering-hadoop-hive

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we need to have any additional connector for that? I have done comprehensive research to check if this is supported but could not find anything. Did anyone already try this? If so, can you specify the steps and link to any documentation?
Informatica version: 9.6.1 (Hotfix 2)
You can use the odbc driver provided by cloudera.
http://www.cloudera.com/downloads/connectors/impala/odbc/2-5-22.html
For Irene, the you can use the same driver the above one is based the simba driver.
http://www.simba.com/drivers/hbase-odbc-jdbc/

how to integrate Cassandra with Hadoop to take advantage of Hive

It is almost 3 days that I've been looking for a solution at year 2015 to integrate Cassandra on Hadoop and lots of resources on the net are outdated or vanished from the net and the Datastax Enterprise offers no free of charge solution for such integration.
What are the options for doing such? I want to use Hive query language to get data from my Cassandra and I think the first step is to integrate the Cassandra with Hadoop.
The easiest (but also paid option) is to use Datastax Enterprise packaging of C* with Hadoop + Hive. This provides an automatic connection and registration of Hive tables with C* and includes and setups up a Hadoop execution platform if you need one.
http://www.datastax.com/products/datastax-enterprise
The second easiest way is to utilize Spark instead. The Spark Cassandra Connector is open source and allows HiveQL to be used to access C* tables. This is done running on Spark as an execution platform instead of Hadoop but has similar (if not better) performance.
With this solution I would standup a stand alone Spark Cluster (since you don't have an existing hadoop infra) and then use the spark-sql-thrift server to run queries against C* tables.
https://github.com/datastax/spark-cassandra-connector
There are other options but these are the ones I am most familiar with (and conflict of interest notice, also develop :D )

Cassandra integration with Hadoop

I am newbie to Cassandra. I am posting this question as different documentations were providing different details with respect to integeting Hive with Cassandra and I was not able to find the github page.
I have installed a single node Cassandra 2.0.2 (Datastax Community Edition) in one of the data nodes of my 3 node HDP 2.0 cluster.
I am unable to use hive to access Cassandra using 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'. I am getting the error ' return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
I have copied all the jars in /$cassandra_home/lib/* to /$hive-home/lib and also included the /cassandra_home/lib/* in the $HADOOP_CLASSPATH.
Is there any other configuration changes that I have to make to integrate Cassandra with Hadoop/Hive?
Please let me know. Thanks for the help!
Thanks,
Arun
Probably these are starting points for you:
Hive support for Cassandra, github
Top level article related to your topic with general information: Hive support for Cassandra CQL3.
Hadoop support, Cassandra Wiki.
Actually your question is not so narrow, there could be lot of reasons for this. But what you should remember Hive is based on MapReduce engine.
Hope this helps.

Cassandra - Hive integration

What's the best practice for integrating Cassandra and Hive?
An old question on Stackoverflow (Cassandra wih Hive) points to Brisk, which has now become a subscription-only Datastax Enterprise product.
A google search only points to two open jira issues,
https://issues.apache.org/jira/browse/CASSANDRA-4131
https://issues.apache.org/jira/browse/HIVE-1434
but none of them has resulted in any code committed in one of the two projects.
Is the only way to integrate Cassandra and Hive patching the Cassandra/Hive source code? Which solution are you using in your stack?
I did the same research a month ago, to reach to the same conclusion.
Brisk is no longer available as a community download, and besides patching the Cassandra/Hive code, the only way to throw map/reduce jobs at your Cassandra database is to use DSE -- Datastax Enterprise, which I believe is free for any use but production clusters.
You might have a look at HBase which is based on HDFS.
There's an open source Cassandra Storage Handler for Hive currently maintained by Datastax.
here is a git de cassandra hive driver with cassandra 2.0 and hadoop 2,
https://github.com/2013Commons/hive-cassandra
and others for cassandra 1.2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
You can use an integration framework or integration suite for this problem. Take a look at my presentation "Big Data beyond Hadoop - How to integrate ALL your data" for more information about how to use open source integration frameworks and integration suites with Hadoop.
For example, Apache Camel (integration framework) and Talend Open Studio for Big Data (integration suite) are two open source solutions which offer connectors to both Cassandra and Hadoop.

Resources