Hadoop and Cassandra integration - substituting HDFS with Cassandra - hadoop

I want to efficiently develop a Hadoop job using Cassandra as input and output.
As I know MapReduce jobs in Hadoop uses HDFS to store intermediate results.
Is it possible to make Hadoop store intermediate results in Cassandra File System? If yes then how to achieve that?
I wonder if it possible to completely disable HDFS if I am using Hadoop only Cassandra as underlying data storage system.
I am using Cassandra 2.0.11 and Hadoop 1.0.4 (If the above is possible only in Hadoop 2.x I would also apreciate that information)

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

How is Hadoop different from database?

I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?
Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.

why I could use HBase without starting Hadoop/HDFS?

I am new to HBase, recently I installed HBase and tried to start it on my Mac. Everything is fine and I could play with HBase. In some articles, it said I should start Hadoop first when using HBase, I am wondering if this prerequisite changed?
Hadoop is not a hard requirement for HBase unless you are running fully distributed which you are not. Running on a single node like you are you can use the local filesystem. See HBase run modes: Standalone and Distributed for more information.
Your local filesystem (the file:// URI) is Hadoop-compatible. Hbase requires a Hadoop compatible storage layer, but that does not mean that it must literally be HDFS.
HDFS will simply provide scalability and reliability

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Resources