Where HDFS stores files locally by default? - hadoop

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally.
Any ideas?
Thanks.

You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data and note that the ${hadoop.tmp.dir} is actually in core-default.xml described here.
The configuration options are described here. The description for this setting is:
Determines where on the local
filesystem an DFS data node should
store its blocks. If this is a
comma-delimited list of directories,
then data will be stored in all named
directories, typically on different
devices. Directories that do not exist
are ignored.

Seems like for the current version(2.7.1) the dir is
/tmp/hadoop-${user.name}/dfs/data
Based on dfs.datanode.data.dir, hadoop.tmp.dir setting from:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml

As "more recent answer" and to clarify hadoop version numbers:
If you use Hadoop 1.2.1 (or something similar), #Binary Nerd's answer is still true.
But if you use Hadoop 2.1.0-beta (or something similar), you should read the configuration documentation here and the option you want to set is: dfs.datanode.data.dir

For hadoop 3.0.0, the hdfs root path is as given by the property "dfs.datanode.data.dir"

Run this in the cmd prompt, and you will get the HDFS location:
bin/hadoop fs -ls /

Related

why there is a need of hadoop commands in Pseudo-distributed mode?

It might be a stupid question but I needed to know.
For example: Why do we need hadoop fs -ls command to list files? Instead why can't just ls be used?
If in pseudo-distributed mode, is that case part of filesystem is given to hadoop file system that is only accessible to hadoop namenode daemon...this is my guess. Please explain.
ls will list all file spaces available to your computer
You can set the fs.defaultFS property to be file:///, the default, then both will act the same, but this is not considered pseudodistributed mode.
Pseudodistributed node requires that you specify a list of datanode and namenode volumes on each respective system in the cluster, and hdfs dfs commands will only list those files that are known by the namenode.
And its called pseudodistributed only because it's a single node. Once you have that working, adding another node should be straightforward given appropriate networking connections

Why Hive will search its configuration profile in HADOOP_CONF_DIR first?

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

Hadoop copying file to hadoop filesystem

I have copied a file from a local to the hdfs file system and the file got copied -- /user/hduser/in
hduser#vagrant:/usr/local/hadoop/hadoop-1.2.1$ bin/hadoop fs -copyFromLocal /home/hduser/afile in
Question:-
1.How does hadoop by default copies the file to this directory -- /user/hduser/in ...Where is this mapping specified in the conf file?
If you write the command like above, the file gets copied to your user's HDFS home directory, which is /home/username. See also here: HDFS Home Directory.
You can use an absolute pathname (one starting with "/") just like in a Linux filesystem, if you want to write the file to a different location.
Are u using a default vm? Basically if you configure hadoop from binaries without using the preconfigure yum package. It doesnt have a default path. But if you use yum via hortin or cloudera vm. It comes with default path i guess
Check the hdfs-site.xml to see the default fs path. So "/" will point to the base URL set in the above mentioned XML. Any folder mentioned in the command without the use of home path will be appended to that.
hadoop picks the default path defined in hdfs-site.xml and write data.
below image clear how writes works in HDFS.

Explanation of the hadoop file system

Can any one help me understand the data storage concept of hadoop?
As I understand it, hadoop deals with fs image and data blocks, and fsimage and edit logs paths are stored hdfs-site.xml. But what about the data blocks? Can anyone help me in this? I am little bit confused where the /user and /tmp dir is actually present in the filesystem.
I used this link to set up a single node hadoop cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Files are split into blocks and stored in the Hadoop Distributed File System (HDFS). Consult the HDFS module of Yahoo's Hadoop Tutorial for a description of HDFS. The directories stored in HDFS can be viewed by typing the following command into a terminal: hadoop dfs -ls
The Namenode's FSImage keeps track of which Datanode has which files. In the hdfs-site.xml file, the configuration 'dfs.data.dir' defines where the datanode stores the underlying files on the filesystem. This can be a comma separated list of directories (think multiple disks).

"hadoop namenode -format" formats wrong directory

I'm trying to install Hadoop 1.1.2.21 on CentOS 6.3
I've configured dfs.name.dir in /etc/hadoop/conf/hdfs-site.xml file
<name>dfs.name.dir</name>
<value>/mnt/ext/hadoop/hdfs/namenode</value>
But when I run "hadoop namenode -format" command, it formats /tmp/hadoop-hadoop/dfs/name instead.
What am I missing?
I ran into this problem and solved it. So updating this answer.
Make sure your environment variable HADOOP_CONF_DIR points to the directory where it can find all you xml files for used for configuration. It solved it for me.
It might be taking the path /tmp/hadoop-hadoop/dfs/name from hdfs-default.xml. Not sure why the value from hdfs-site.xml is not taken. Is dfs.name.dir marked as final in hdfs-default.xml?
Check if some Hadoop Process is running in the background already. This happens if you have aborted a previous process and it has not been killed and has become a ZOMBIE process
If that is the case kill the process and then again try to format the system
Also you can check the permission of the Directory.
Try to give a different location for the directory, if it is reflected
Please don't set HADOOP_CONF_DIR. You can check .bashrc file and remove it.

Resources