PiG + Cassandra + Hadoop - hadoop

I have a Hadoop (2.7.2) setup over a Cassandra (3.7) Cluster. I have no problem with using Hadoop MapReduce. Similarly, I have no problem to create tables and keyspace in CQLSH. However, I have been trying to install PIG over hadoop, so as to access the tables in Cassandra. (Installation of PIG is as such fine) It is where I'm having trouble.
I have come across numerous websites, most are either for outdated versions of Cassandra or just plain vague.
The one thing I gleaned from this website is that we can load access the cassandra tables in pig using CqlStorage / CqlNativeStorage. However, in the latest version, it seems this support has been removed (since 2015).
Now my question is, are there any workarounds?
I would be running mapreduce jobs over cassandra tables, and use PiG for querying, mostly.
Thanks in Advance.

All pig support was Deprecated in 2.2 and removed in 3.0. https://issues.apache.org/jira/browse/CASSANDRA-10542
So I think you are a bit out of luck here. You may be able to use old classes with modern C* but Pig is very niche right now. SparkSql is definitely the current favorite child (I may be biased since I work on the Spark + Cassandra Connector) and allows for very flexible querying of C* data.

Related

MapR 5.2.2 clients

I have a task which requires me to create a Go program to read from an HBASE table.
HBASE is installed in a MapR cluster.
Every other application (Java) uses a MapR client to connect to the MapR cluster so as to retrieve the data.
However, I am unable to find a way to connect to HBASE with a Go application.
I have found HBASE package, but it does not support integration with MapR.
It would be great if anyone could guide me in this situation.
I also have seen that for MapR 6 and above has Go support through OJAI, but sadly, upgrading MapR is not an option.
Can someone advice me how to proceed in this situation?
If you are actually running HBase in MapR, then the Go package for HBase should work (assuming version match and such).
If you are actually using the MapR DB Binary tables (which are roughly HBase compatible) the likely best approach would be to use the Thrift API or REST.
The OJAI lightweight client should work well in Go since it uses gRPC to talk to the underlying table (and thus gains lots of portability). The problem in your case won't be so much that you need to upgrade the platform so much as the lightweight client only works with MapR DB JSON (the document oriented version of MapR DB).
Ping me directly if you would like more information.

Hive on Spark in Mapr Distribution

Currently we are working on Hive, which by default uses map reduce as processing framework in our MapR cluster. Now we want to change from map reduce to spark for better performance. As per my understanding we need to set hive.execution.engine=spark.
Now my question is Hive on spark is currently supported by MapR ? if yes, what are configuration changes that we need to do ?
Your help is very much appreciated. Thanks
No, MapR (5.2) doesn't support that. From their docs,
MapR does not support Hive on Spark. Therefore, you cannot use Spark as an execution engine for Hive. However, you can run Hive and Spark on the same cluster. You can also use Spark SQL and Drill to query Hive tables.
Cheers.
I know and understand that your question is about using Spark as data processing engine for Hive; and as you can see in the various answer it is today not officially supported by MapR.
However, if you goal is to make Hive faster, and do not use MapReduce you can switch to Tez, for this install the MEP 3.0.
See: http://maprdocs.mapr.com/home/Hive/HiveandTez.html

how to integrate Cassandra with Hadoop to take advantage of Hive

It is almost 3 days that I've been looking for a solution at year 2015 to integrate Cassandra on Hadoop and lots of resources on the net are outdated or vanished from the net and the Datastax Enterprise offers no free of charge solution for such integration.
What are the options for doing such? I want to use Hive query language to get data from my Cassandra and I think the first step is to integrate the Cassandra with Hadoop.
The easiest (but also paid option) is to use Datastax Enterprise packaging of C* with Hadoop + Hive. This provides an automatic connection and registration of Hive tables with C* and includes and setups up a Hadoop execution platform if you need one.
http://www.datastax.com/products/datastax-enterprise
The second easiest way is to utilize Spark instead. The Spark Cassandra Connector is open source and allows HiveQL to be used to access C* tables. This is done running on Spark as an execution platform instead of Hadoop but has similar (if not better) performance.
With this solution I would standup a stand alone Spark Cluster (since you don't have an existing hadoop infra) and then use the spark-sql-thrift server to run queries against C* tables.
https://github.com/datastax/spark-cassandra-connector
There are other options but these are the ones I am most familiar with (and conflict of interest notice, also develop :D )

Difference between MapR-DB and Hbase

I am bit new in MapR but i am aware about hbase. I was going through one of the video where I found that Mapr-DB is a NoSQL DB in MapR and it similar to Hbase. In addition to this Hbase can also be run on MapR. I am confused between MapR-Db and Hbase. What is the exact difference between them ?
When to use Mapr-DB and when to use Hbase?
Basically I have one java code which do bulk load in Hbase on MapR , Now here if I use same code that i have used for Apache hadoop , will that code work here?
Please help me to avoid this confusion.
They are both NOSQL, wide column stores.
HBase is open source and can be installed as a part of a Hadoop installation.
MapR-DB is a proprietary (not open source) NOSQL database that MapR offers. A core difference that MapR will detail with MapR-DB (along with their file system (they do not use HDFS)) is that MapR-DB offers significant performance and scalability over HBase (unlimited tables, columns, re-architecture to name a few).
MapR maintains that you can use MapR-DB or HBase interchangeably. I suggest testing on both extensively before committing to one vs the other. You also need to realize that MapR-DB is MapR's proprietary NOSQL HBase equivalent and if you require support for MapR-DB you'll have to get that from MapR (HBase support can come from any of the other Hadoop distributions as well as the open source community).
Some links you should look at:
http://www.theregister.co.uk/2013/05/01/mapr_hadoop_m7_edition_solr/
https://www.mapr.com/blog/get-real-hadoop-enterprise-grade-nosql#.VVfHuvlVhBc
They are similar but not same. MapR claims that MapR DB is faster and more efficient as they have migrated the critical functionality in native C/C++ code and interface is kept the same. But end of the day MapR DB is propriatory and you depend on the support of MapR for any thing which is done differently than HBase. I didn't liked MapR-DB because it's not compatible with Apache Phoenix(HBase coprocessors are not present in MapR DB) - the SQL way of accessing HBase kind of NoSQL databases.
Limitations that i have taken from MapR documentation:
Custom HBase filters are not supported.
User permissions for column families are not supported. User
permissions for tables and columns are supported.
HBase authentication is not supported.
HBase replication is handled with Mirror Volumes.
Bulk loads using the HFiles workaround are not supported and not
necessary. HBase coprocessors are not supported.
Filters use a different regular expression library
Co processors are not supported
So i second previous answer - try out your solution in both(MapR DB vs HBase) before going too far. I didn't liked to very idea of MapR DB from MapR as it's propitiatory and the code is not open source. If any Hadoop distributor is enhancing hadoop - they should also make it available to open source community. Why one should totally rely on commercial support when using opensource.

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

Resources