CDH 5.9 dfs.datanode.data.dir configuration - hadoop

I've installed CDH 5.9 using Cloudera Manager Installer, where I've specified directories for HDFS metadata (/dfs/nn) and actual data (/dfs/dn).
After installation HDFS works correctly and stores metadata and data in defined in Claudera Manager locations, but in /etc/hadoop/hdfs-site.xml there is no setting for dfs.datanode.data.dir parameter.
Running following command returns default location for data.dir:
# hdfs getconf -confKey dfs.datanode.data.dir
file:///tmp/hadoop-root/dfs/data.
Can anyone tell where in CDH5.9 I can find configuration for HDFS that reflects my setup?
Regards,

Search for dfs.datanode.data.dir from cloudera Manager, you will get default values, from there you can change

Related

how do you create a hdfs data directory?

everytime my hadoop server reboots, I have to format the namenode to start the hadoop. This removes all of the files in my hadoop installation.
I need to move my hadoop hdfs location from /tmp file to permenant location where whenever the server reboots, I don't have to format the namenode etc.
I am very new to hadoop.
How do I create a hdfs file in another directory?
How do I reference this data directory in config file so that I don't have to format the namenode?
These two properties of the hdfs-site.xml determine where local files are stored.
The defaults are under /tmp
dfs.namenode.name.dir
dfs.datanode.data.dir
You typically have to format a namenode only when the HDFS processes failed to terminate correctly (such as a power failure or forced shutdown). It is encouraged to run a standby Namenode to prevent these scenarios.

Find port number where HDFS is listening

I want to access hdfs with fully qualified names such as :
hadoop fs -ls hdfs://machine-name:8020/user
I could also simply access hdfs with
hadoop fs -ls /user
However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names.
I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name. But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf.
There doesn't seem to a way hadoop tells me what's defined in fs.default.name with it's command line options.
How can I get the value defined in fs.default.name reliably from command line?
The test will always be running on namenode, so machine name is easy. But getting the port number(8020) is a bit difficult. I tried lsof, netstat.. but still couldn't find a reliable way.
Below command available in Apache hadoop 2.7.0 onwards, this can be used for getting the values for the hadoop configuration properties. fs.default.name is deprecated in hadoop 2.0, fs.defaultFS is the updated value. Not sure whether this will work incase of maprfs.
hdfs getconf -confKey fs.defaultFS # ( new property )
or
hdfs getconf -confKey fs.default.name # ( old property )
Not sure whether there is any command line utilities available for retrieving configuration properties values in Mapr or hadoop 0.20 hadoop versions. In case of this situation you better try the same in Java for retrieving the value corresponding to a configuration property.
Configuration hadoop conf = Configuration.getConf();
System.out.println(conf.get("fs.default.name"));
fs.default.name is deprecated.
use : hdfs getconf -confKey fs.defaultFS
I encountered this answer when I was looking for HDFS URI. Generally that's a URL pointing to the namenode. While hdfs getconf -confKey fs.defaultFS gets me the name of the nameservice but it won't help me building the HDFS URI.
I tried the command below to get a list of the namenodes instead
hdfs getconf -namenodes
This gave me a list of all the namenodes, primary first followed by secondary. After that constructing the HDFS URI was simple
hdfs://<primarynamenode>/
you can use
hdfs getconf -confKey fs.default.name
Yes, hdfs getconf -namenodes will show list of namenodes.

Cloudera Manager - dfs.datanode.du.reserved not working

I have set dfs.datanode.du.reserved property to 10 GB using Cloudera Manager. But when I check the map-reduce job.xml file, I find dfs.datanode.du.reserved is still set to 0. How do I verify whether the property is set ??
PS: I am using Cloudera Standard 4.7.2 with CDH 4.4.0
This flag is set in the hdfs-site.xml and not in the mapred-site.xml.
You will not be able to see this flag in the client configurations (/etc/hadoop/conf/hdfs-site.xml) without tweaking configuration.
It is only set in the datanode configuration that is regenerated by Cloudera Manager. This configuration can be found in /var/run/cloudera-scm-agent/process/XXXXXX-hdfs-DATANODE/hdfs-site.xml, where XXXXXX is a incremented number of some kind (used by Cloudera Manager).
From within Cloudera Manager you can see this flag on Datanode (), click Processes, then Configuration files/Environment - Show and then you find the hdfs-site.xml for a datanode.

hadoop hdfs points to file:/// not hdfs://

So I installed Hadoop via Cloudera Manager cdh3u5 on CentOS 5. When I run cmd
hadoop fs -ls /
I expected to see the contents of hdfs://localhost.localdomain:8020/
However, it had returned the contents of file:///
Now, this goes without saying that I can access my hdfs:// through
hadoop fs -ls hdfs://localhost.localdomain:8020/
But when it came to installing other applications such as Accumulo, accumulo would automatically detect Hadoop Filesystem in file:///
Question is, has anyone ran into this issue and how did you resolve it?
I had a look at HDFS thrift server returns content of local FS, not HDFS , which was a similar issue, but did not solve this issue.
Also, I do not get this issue with Cloudera Manager cdh4.
By default, Hadoop is going to use local mode. You probably need to set fs.default.name to hdfs://localhost.localdomain:8020/ in $HADOOP_HOME/conf/core-site.xml.
To do this, you add this to core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost.localdomain:8020/</value>
</property>
The reason why Accumulo is confused is because it's using the same default configuration to figure out where HDFS is... and it's defaulting to file://
We should specify data node data directory and name node meta data directory.
dfs.name.dir,
dfs.namenode.name.dir,
dfs.data.dir,
dfs.datanode.data.dir,
fs.default.name
in core-site.xml file and format name node.
To format HDFS Name Node:
hadoop namenode -format
Enter 'Yes' to confirm formatting name node. Restart HDFS service and deploy client configuration to access HDFS.
If you have already did the above steps. Ensure client configuration is deployed correctly and it points to the actual cluster endpoints.

Where HDFS stores files locally by default?

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally.
Any ideas?
Thanks.
You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data and note that the ${hadoop.tmp.dir} is actually in core-default.xml described here.
The configuration options are described here. The description for this setting is:
Determines where on the local
filesystem an DFS data node should
store its blocks. If this is a
comma-delimited list of directories,
then data will be stored in all named
directories, typically on different
devices. Directories that do not exist
are ignored.
Seems like for the current version(2.7.1) the dir is
/tmp/hadoop-${user.name}/dfs/data
Based on dfs.datanode.data.dir, hadoop.tmp.dir setting from:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml
As "more recent answer" and to clarify hadoop version numbers:
If you use Hadoop 1.2.1 (or something similar), #Binary Nerd's answer is still true.
But if you use Hadoop 2.1.0-beta (or something similar), you should read the configuration documentation here and the option you want to set is: dfs.datanode.data.dir
For hadoop 3.0.0, the hdfs root path is as given by the property "dfs.datanode.data.dir"
Run this in the cmd prompt, and you will get the HDFS location:
bin/hadoop fs -ls /

Resources