How to add Spark worker nodes on cloudera with Yarn - hadoop

We have cloudera 5.2 and the users would like to start using Spark with its full potential (in distributed mode so it can get advantage of data locality with HDFS), the service is already installed and available in cloudera manager Status(in home page) but when clicking the service and then "Instances" it just shows a History Server role and in other nodes a Gateway server role. From my understanding of Spark's architecture you have a master node and worker nodes(which lives together with HDFS datanodes) so in cloudera manager i tried "Add role instances" but there's only "Gateway" role available . How do you add Sparks worker node(or executor) role to the hosts where you have HDFS datanodes? Or is it unnecessary (i think that because of yarn ,yarn takes charge of creating the executor and application master )? And what's the case of the masternode? Do i need to configure anything so the users can use Spark at its full distributed mode?

The master and worker roles are part of Spark Standalone service. You can either choose Spark to run with YARN (in which Master and Worker nodes are irrelevant) or the Spark (Standalone).
As you have started the Spark service instead of Spark (Standalone) in Cloudera Manager, Spark is already using YARN. In Cloudera Manager 5.2 and higher, there are two separate Spark services (Spark and Spark (Standalone)). The Spark service runs Spark as a YARN application with only gateway roles in addition to the Spark History Server role.
How do you add Sparks worker node(or executor) role to the hosts where
you have HDFS datanodes?
Not required. Only Gateway roles are required on these hosts.
Quoting from CM Documentation:
In Cloudera Manager Gateway roles take care of propagation of client configurations to the other hosts in your cluster. So, ensure that you assign the gateway roles to hosts in the cluster. If you do not have gateway roles, client configurations are not deployed.

Related

When do YARN and NameNode interact

When a job is submitted, when do YARN and NameNode interact? When a job is submitted, who does it get sent to? Could someone explain the end-to-end flow - how hadoop ecosystem works?
Thanks!
Namenode: Stores the meta-data of all the data stored in data nodes and monitors the health of data nodes. Basically, it is a master-slave architecture.
YARN: It stands for Yet Another Resource Negotiator. The yarn has mainly two components.
1.> Scheduling
2.> Application Manager
Yarn also contains the master, i.e Resource Manager and Slave, i.e Node Manager.
For scheduling purpose, there are 3 Schedulers:
1.> FIFO 2.> Capacity 3.> Fair-share
There is a component called Application Master assigned by Resource Manager under the Node Manager.
One application master is assigned to one application.
The job is directly submitted by the client and Resource Manager assigns the job to the Application Master and Node manager monitors the liveliness of Application Master
Now, whenever the job comes in, Resource manager creates a job id and assign an Application Master for that job. Resource Manager contacts to the Namenode to retrieve the information about the metadata of the required data on which the task has to be performed. And the information received by Resource Manager is then passed to Application Master.
This is the basic overview of the working of Yarn with Namenode. You can also read in detail from YARN
Also, NameNode interaction is just in the Hadoop applications running within YARN that talk to the NameNode. Not all YARN applications need to communicate with HDFS
Basically there is no direct interaction between YARN and HDFS, see https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
However YARN jobs require some files (libraries, configuration, etc) which usually resides on HDFS

Hadoop Key-Value store with remote deploy

My application is launched from remotely pc via spark-submit in yarn-cluster mode with Kerberos keytab and principals by this guide: https://spark.apache.org/docs/latest/running-on-yarn.html. The advantages of this approach are that I have my own version of the spark at any cluster.
Is it possible to automatically deploy Ignite/Hazelcast/Accumulo/Kudu or other NoSQL DB with random access on read/write into a Hadoop YARN cluster without sftp/ssh only by running a bash-script with HADOOP_CONF_DIR/YARN_CONF_DIR configs?
Deploying Hazelcast on a YARN cluster is possible and easy, take a look at https://github.com/hazelcast/hazelcast-yarn

Spark Job Submission with AWS Hadoop cluster setup

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

Problems applying AMBARI to existing system

I'm going to apply AMBARI to my system.
But my system already has hadoop.
How do I add existing Hadoop clusters to my new AMBARI environment
Sorry for my English.
Ambari can only manage clusters that it provisioned. Your pre-existing hadoop cluster was not provisioned with Ambari so it cannot be managed by Ambari.
Ambari is designed around a Stack concept where each stack consists of several services. A stack definition is what allows Ambari to install, manage and monitor the services in the cluster.
You can not do right now because already hadoop is installed in the system and you want to apply AMBARI over that for managing the hadoop cluster that's not possible.
Detailed description about the Apache Ambari :---
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Resources