Find port number where HDFS is listening - hadoop

I want to access hdfs with fully qualified names such as :
hadoop fs -ls hdfs://machine-name:8020/user
I could also simply access hdfs with
hadoop fs -ls /user
However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names.
I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name. But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf.
There doesn't seem to a way hadoop tells me what's defined in fs.default.name with it's command line options.
How can I get the value defined in fs.default.name reliably from command line?
The test will always be running on namenode, so machine name is easy. But getting the port number(8020) is a bit difficult. I tried lsof, netstat.. but still couldn't find a reliable way.

Below command available in Apache hadoop 2.7.0 onwards, this can be used for getting the values for the hadoop configuration properties. fs.default.name is deprecated in hadoop 2.0, fs.defaultFS is the updated value. Not sure whether this will work incase of maprfs.
hdfs getconf -confKey fs.defaultFS # ( new property )
or
hdfs getconf -confKey fs.default.name # ( old property )
Not sure whether there is any command line utilities available for retrieving configuration properties values in Mapr or hadoop 0.20 hadoop versions. In case of this situation you better try the same in Java for retrieving the value corresponding to a configuration property.
Configuration hadoop conf = Configuration.getConf();
System.out.println(conf.get("fs.default.name"));

fs.default.name is deprecated.
use : hdfs getconf -confKey fs.defaultFS

I encountered this answer when I was looking for HDFS URI. Generally that's a URL pointing to the namenode. While hdfs getconf -confKey fs.defaultFS gets me the name of the nameservice but it won't help me building the HDFS URI.
I tried the command below to get a list of the namenodes instead
hdfs getconf -namenodes
This gave me a list of all the namenodes, primary first followed by secondary. After that constructing the HDFS URI was simple
hdfs://<primarynamenode>/

you can use
hdfs getconf -confKey fs.default.name

Yes, hdfs getconf -namenodes will show list of namenodes.

Related

regarding core-ste.xml file entries with start-dfs.sh and map reduce task - Hadoop

Am new to big data modules and am running hadoop on ubuntu.
for map reduce jobs, the below entry from core-site.xml needs to be suppressed
fs.default.name
hdfs://localhost:8020
start-dfs.sh does not execute with the above entry suppressed.
kindly assist and do update if multiple core-site.xml files or entries are permitted?
fs.defaultFS is the preferred property over the deprecated fs.default.name . One of them is required, and they cannot be "suppressed".
If you define multiple matching properties in the XML, only one will be used.
You can't have multiple files with the same name in the same hadoop config directory, anyway. This includes "core-site.xml"

Namenode and Jobtracker information on Hadoop cluster

How can i get the following information on the Hadoop Cluster ?
1. namenode and jobtracker name
2. list of all nodes with their roles on the cluster
To get namenode info:
hdfs getconf -confKey fs.defaultFS
For jobtracker
hdfs getconf -confKey yarn.resourcemanager.address.rm2
I am using cloudera based cluster and also working on EMR.
In both the clusters I can find the information from the configuration dir.
To get the namenode information go into core-site.xml file and look for the fs.defaultFS as #daemon12 said
Here is the straight way to get it.
For namenode information use the below command
cat /etc/hadoop/conf/core-site.xml | grep '8020'
Here is the result
<value>hdfs://10.872.22.1:8020</value>
The values inside the value tag is the name node information.
Similarly to get the jobtracker information do the below
cat /etc/hadoop/conf/yarn-site.xml | grep '8032'
Here is the result
<value>10.872.12.32:8032</value>
Again the jobtracker value is inside the value tag.
Generally the NN and JT information is used to run the Oozie jobs and this method will help you for that purpose.
DISCLAIMER: I am grepping the result of cat based on the namenode and jobtracker port number which is 8020 and 8032 respectively. This is widely known ports for NN and JT in Hadoop. If your organization uses a different one, please use that to get more appropriate result.
Along with the command-line way of getting information, you can get the similar information in the browser also:
http://<namenode>:50070 (For in general hadoop informtion)
http://<namenode>:50030 (For JobTracker related information)
These are default ports. You can check here for more information.
With the correct granted authorization, (like sudo -u hdfs ), you may try :
hdfs dfsadmin -report

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

Hadoop filesystem reads linux filesystem instead of hdfs?

I have a strange thing happening, when I read hadoop filesystem it shows me linux filesystem not the hadoop one, anyone is familiar with this issue?
Thanks,
Mika
This will happen if a valid hadoop configuration is not found.
e.g. if you do:
hadoop fs -ls
and there is no configuration is found at the default location, then you will see the linux filesystem. You can test this by adding either the -conf option after the "hadoop" command e.g.
hadoop -conf=<path-to-conf-files> fs -ls

hadoop hdfs points to file:/// not hdfs://

So I installed Hadoop via Cloudera Manager cdh3u5 on CentOS 5. When I run cmd
hadoop fs -ls /
I expected to see the contents of hdfs://localhost.localdomain:8020/
However, it had returned the contents of file:///
Now, this goes without saying that I can access my hdfs:// through
hadoop fs -ls hdfs://localhost.localdomain:8020/
But when it came to installing other applications such as Accumulo, accumulo would automatically detect Hadoop Filesystem in file:///
Question is, has anyone ran into this issue and how did you resolve it?
I had a look at HDFS thrift server returns content of local FS, not HDFS , which was a similar issue, but did not solve this issue.
Also, I do not get this issue with Cloudera Manager cdh4.
By default, Hadoop is going to use local mode. You probably need to set fs.default.name to hdfs://localhost.localdomain:8020/ in $HADOOP_HOME/conf/core-site.xml.
To do this, you add this to core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost.localdomain:8020/</value>
</property>
The reason why Accumulo is confused is because it's using the same default configuration to figure out where HDFS is... and it's defaulting to file://
We should specify data node data directory and name node meta data directory.
dfs.name.dir,
dfs.namenode.name.dir,
dfs.data.dir,
dfs.datanode.data.dir,
fs.default.name
in core-site.xml file and format name node.
To format HDFS Name Node:
hadoop namenode -format
Enter 'Yes' to confirm formatting name node. Restart HDFS service and deploy client configuration to access HDFS.
If you have already did the above steps. Ensure client configuration is deployed correctly and it points to the actual cluster endpoints.

Resources