I have setup a cluster using Ambari that consists of 3 nodes.
When I want to refer to HDFS , YARN etc (services installed using Ambari) do I have to use URI for individual nodes ? Or is there any unified URI that represents whole cluster ?
Maybe providing some more context into what you are trying to use the URI for will help us better answer your question.
However, in general each service consists of one or more components. It's common for components to be installed on some nodes and not others, in which case a unified URI would not be useful. You would address a component by the node and port its running on if it has a running process (Master or Slave component).
For example the Yarn service has a few components, some of which are: Resource Manager, Node Manager, and client. The yarn client will be installed on 1 or more nodes (A cardinality of 1+). The Resource manager has a cardinality of 1. Node Manager has a cardinality of 1+.
Related
How can I access to YARN metrics such as status of resource manager and node manager?
Moreover, the same question about running yarn containers. I would like to do it via web interface.
You can use the Yarn Resource Manager UI, which is usually accessible at port 8088 of your resource manager (although the port can be configured). Here you get an overview over your cluster.
Details about the nodes of the cluster can be found in this UI in the Cluster menu, submenu Nodes. Here one finds the health information, some information about the hardware, currently running jobs and the software version of the node manager. For more details, you would have to log into each node and check the nodemanager logs (for the node where the resource manager is running, this log is also available via Tools menu -> Local logs, but this would not be sufficient if you have more than one node in your cluster).
More details about the Resource manager (including runtime statistics) are available in the Tools menu -> Server metrics.
If you want to access these information programatically, you can use the Resource Manager's REST API.
Another option might be to use Ambari. This tools is an Hadoop management tool that can be used to monitor the different services within an Hadoop cluster and to trigger alerts in case of unusual or unexpected events. However, it requires some installation and configuration efforts.
I know that the client machine consult the name node to store the data it contains.
Also the client machine will have Hadoop installed in it with cluster settings.
What cluster settings are present ?
Whenever an HDFS command is invoked, the Client has to send a request to the Namenode and to do so fs.defaultFS property is required. Similarly when submitting a YARN job, it needs yarn.resourcemanager.address to connect with the ResourceManager.
File level HDFS properties like dfs.blocksize, dfs.replication are determined at the Client node. If they need to be changed from their default, add the respective properties at the Client node.
Normally, the same set of configuration properties (*-site.xml files) defined in the nodes of the Cluster would be defined in the Client node as well. Having a uniform cluster settings among all the nodes of the Cluster inclusive of the Client nodes is considered the best practice.
I have a cluster setup with nodes that are not reliable and can go down (They are aws spot instances). I am trying to make sure that my application master only launches on the reliable nodes (aws on demand instances) of the cluster. Is there a workaround for the same? My cluster is being managed by hortonworks ambari.
This can be achieved by using node labels. I was able to use the property in spark spark.yarn.am.nodeLabelExpression to restrict my application master to a set of nodes while running spark on yarn. Add the node labels to whichever nodes you want to use for application masters.
I want to know whether I can request specific nodes from yarn resource manager when running a MapReduce?
In more detail, let say that there is a yarn cluster deployed with the following nodes nodeA, nodeB, nodeC.
Can I submit a MR job that will run only on nodeB and nodeC?
No, There is no property till current versions of CDH and yarn which allow you to dynamically choose the nodes on which you want to run your jobs. This is taken care by the Resource Manager only.
I am considering the following hadoop services for setting up a cluster using HDP 2.1
- HDFS
- YARN
- MapReduce2
- Tez
- Hive
- WebHCat
- Ganglia
- Nagios
- ZooKeeper
There are 3 node types that I can think of
NameNodes (ex: primary, secondary)
Application Nodes (from where I will access hive service most often and also copy code repositories and any other code artifacts)
Data Nodes(The workhorses of the cluster)
Given above I know that there are these best practices and common denominators
Zookeeper services should be running on atleast 3 data nodes
DataNode service should be running on all data nodes
Ganglia monitor should be running on all data nodes
Name node service should be running on name nodes
NodeManager should be installed on all nodes containing DataNode component.
This still leaves lots of open questions ex:
which is the ideal node to install a lot the servers needed ex: Hive Server, App Timeline Server, WebHCat Server, Nagios Server, Ganglia Server, MySQL server. Is it Application nodes? should each get its own node? should we have a separate 'utilities' node?
is there some criterion to choose where zookeeper should be installed?
I thinking the more generic question is there a table with "Hadoop components to nodes mapping essentially what components should be installed where"
Seeking advice/insight/links or documents on this topic.