I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.
Related
I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!
Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.
Well I am new to Apache Storm and after some search and read tutorials, I didn't get that how fault tolerance, load balancing and other resource manager duties takes place in Storm cluster? Should it be configured on top of YARN or it doest the resource management job itself? Does it have its HDFS part, or there should be an existing HDFS configured in a cluster first?
Storm can manage its resources by itself or can run on top of YARN. If you have a shared cluster (ie, with other system like Hadoop, Spark, or Flink running), using YARN should be the better choice to avoid resource conflicts.
About HDFS: Storm is independent of HDFS. If you want to run in on top of HDFS, you need to setup HDFS by yourself. Furthermore, Storm provides Spouts/Bolt to access HDFS: https://storm.apache.org/documentation/storm-hdfs.html
I am working on a solution providing run-time ressources addition to an Hadoop Yarn cluster. The purpose is to handle heavy peaks on our application.
I am not an expert and I need help in order to approve / contest what I understand.
Hadoop YARN
This application an run in a cluster-mode. It provides ressource management (CPU & RAM).
A spark a application, for example, ask for a job to be done. Yarn handles the request and provides an executor computing on the Yarn cluster.
HDFS - Data & Executors
The datas are not shared through executors, so they have to be stored in a file System. In my case : HDFS. That means I will have to run a copy of my spark streaming application in the new server (hadoop node).
I am not sure of this :
The yarn cluster and HDFS are different, writing on HDFS won't write on the new hadoop node local data (because it is not an HDFS node).
As I will only write on HDFS new data from a spark streaming application, creating a new application should not be a problem.
Submit the job to yarn
--- peak, resources needed
Instance new server
Install / configure Hadoop & YARN, making it a slave
Modifying hadoop/conf/slaves, adding it's ip adress (or dns name from host file)
Moddifying dfs.include and mapred.include
On host machine :
yarn -refreshNodes
bin/hadoop dfsadmin -refreshNodes
bin/hadoop mradmin -refreshNodes
Should this work ? refreshQueues sounds not really useful here as it seems to only take care of the process queue.
I am not sure if the running job will increase it's capacity. Another idea is to wait for the new ressources to be available and submit a new job.
Thanks for you help
I want to efficiently develop a Hadoop job using Cassandra as input and output.
As I know MapReduce jobs in Hadoop uses HDFS to store intermediate results.
Is it possible to make Hadoop store intermediate results in Cassandra File System? If yes then how to achieve that?
I wonder if it possible to completely disable HDFS if I am using Hadoop only Cassandra as underlying data storage system.
I am using Cassandra 2.0.11 and Hadoop 1.0.4 (If the above is possible only in Hadoop 2.x I would also apreciate that information)