How to get number of hosts in Hadoop Cluster, their IP and rack - hadoop

I'm working on a cluster but I don't know how many hosts it has exactly, which are their IPs and what rack they belong to.
I've previously worked with clusters managed via Cloudera and got that information from the cloudera api (http://cloudera.github.io/cm_api/apidocs/v16/), in particular this (http://cm_server_host:7180/api/v16/hosts) gave me all the info I was looking for. But how can I do that if the cluster doesn't use Cloudera? It has spark as well, but since there is Hadoop and HDFS I think the information is more likely to be found there.
Thanks in advance!

You can find those information via http api, that by default should be available under this url:
http://<namenodehost>:50070
and via YARN http api, that by default should be available under this url:
http://<resourcemanagerhost>:8088/cluster/nodes
Alternatively you can use ResourceManager REST API’s.
http://<resourcemanagerhost>:8088/ws/v1/cluster/nodes
More about the topic you can find for example here:
https://www.datadoghq.com/blog/collecting-hadoop-metrics/

Related

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Accessing Hadoop data using REST service

I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.

JobTracker web UI - not working in psedo-distributed mode v 2.7.1

I have installed Hadoop 2.7.1 in psuedo distributed mode (all daemons on single machine). It's up and running and I'm able to access HDFS through command line and run the jobs and I'm able to see the output.
I can access http://localhost:50070/dfshealth.html#tab-overview. it shows version and cluster status and can access hadoop file system.
I found one link and applied its accepted solution but that does not work for me. When I am trying to access http://127.0.0.1:54310, I am getting below error message
It looks like you are making an HTTP request to a Hadoop IPC port. This is
not the correct port for the web interface on this daemon.
Any help is appreciated.
Thanks..
I am using MR2 and not able to track my job on 8088. When I run map reduce job, it submit the job on http://localhost:8080 and thats url is not opening to track the job.
Use port 50030 if you are using MRV1 for YARN use port 8088 for accessing resource manager.

monitoring hadoop cluster with ganglia

I'm new to hadoop and trying to monitor the working of a multi node cluster using ganglia, The setup of gmond is done on all nodes and ganglia monitor only on the master.However,there are hadoop metrics graphs only for the master node and just system metrics for slaves. Do these hadoop metrics on the master include the slave metrics as well?Or is there any mistake in configuration files? Any help would be appreciated.
I think you should read this in order to understand how metrics flow between master and slave.
However, I would like to brief that, in genral, hadoop based or hbase based metrics are directly emitted/ sent to the master server (By master server, I mean the server on which gmetad is installed). All other OS related metrics are first collected by gmond installed on the corresponding slave and then redirected to the gmond installed on the master server.
So, if you are not getting any OS related metrics of slave servers then there is some misconfiguration in your gmond.conf. To know more about how to configure ganglia, please read this. This has helped me and could help you for sure, if you go through carefully.
There is a mistake in your configuration files.
More precisely, in transmitting / collecting the data, whichever approach you use.

Resources