Where HDFS stores data - hadoop

I am trying to understand where hadoop stores data in HDFS. I refer to the config files viz: core-site.xml and hdfs-site.xml
The property that I have set is:
In core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp</value>
</property>
In hdfs-site.xml:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/hdfs/datanode</value>
</property>
With the above arrangement, like dfs.datanode.data.dir, the data blocks should be stored in this directory. Is this correct?
I referred to the apache hadoop link, and from that i see this:
core-default.xml: hadoop.tmp.dir --> A base for other temporary directories.
hdfs-default.xml dfs.datanode.data.dir --> Determines where on the local filesystem an DFS data node should store its blocks.
The default value for this property being -> file://${hadoop.tmp.dir}/dfs/data
Since I explicitly provided the value for dfs.datanode.data.dir (hdfs-site.xml), does it mean data would be stored in that location? If so, would dfs/data be added to the directory to ${dfs.datanode.data.dir}, specifically would it become -> /hadoop/hdfs/datanode/dfs/data?
However I didn't see this directory structure getting created.
One observation that I saw in my env:
I saw that after I run some MapReduce programs, this directory is created viz:
/hadoop/tmp/dfs/data is getting created.
So, not sure if data gets stored in the directory as suggested by the property dfs.datanode.data.dir.
Does anyone have similar experience?

The data for hdfs files will be stored in the directory specified in dfs.datanode.data.dir, and the /dfs/data suffix that you see in the default value will not be appended.
If you edit hdfs-site.xml, you'll have to restart the DataNode service for the change to take effect. Also remember that changing the value will eliminate the ability of the DataNode service to supply blocks that were stored in the previous location.
Lastly, above you have your values specified with file:/... instead of file://.... File URI's do need that extra slash, so that might be causing these values to revert to the defaults.

Related

hadoop-Not able to run namenode without a format

I know that this question has been asked before,saw the following solution:
<!-- to be modified in hdfs-site.xml-->
<property>
<name>dfs.name.dir</name>
<value>/home/hduser/hadoop/data</value>
</property>
I modified my hdfs-site.xml file, also removed the contents of the temp directory.But still,without formatting the namenode,it dosen't start.
Also,should the path of the directory given in the value of the value tag be already existing?
Any suggestions what point am i missing in the above updation?
without formatting the namenode,it dosen't start
Blockquote
As mentioned here, you need to format the namenode.
should the path of the directory given in the value of the value tag be already existing?
Yes. Both the namenode and the datanode directories should be present and most importantly should have proper permissions to HDFS user.

Where exactly should hadoop.tmp.dir be set? core-site.xml or hdfs-site.xml?

I'm asking about the Hadoop 2.x series. There's conflicting advice about this on the Internet. Like in this case where he asks to specify it in core-site.xml and this SO answer where it is mentioned that hadoop.tmp.dir be set in hdfs-site.xml. Which is the right place to put it?
hadoop.tmp.dir (A base for other temporary directories) is property, that need to be set in core-site.xml, it is like export in linux
Ex:
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/name</value>
You can use reference of hadoop.tmp.dir in hdfs-site.xml like above
For more core-site.xml and hdfs-site.xml
There're three HDFS properties which contain hadoop.tmp.dir in their values
dfs.name.dir: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name.
dfs.data.dir: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data.
fs.checkpoint.dir: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary

Hadoop keeps on writing mapred intermediate outuput in /tmp directory

I have limited capacity in /tmp so I want to move all the intermediate output of mapred in a bigger partition, say /home/hdfs/tmp_data .
If I understand correctly, I just need to set
<property>
<name>mapred.child.tmp</name>
<value>/home/hdfs/tmp_data</value>
in mapred-site.xml
I restart the cluster through Ambari, I check everything is written in the conf file,
however, when I run a pig script, it keeps writing in:
/tmp/hadoop-hdfs/mapred/local/taskTracker/hdfs/jobcache/job_localXXX/attempt_YY/output
I have also modified hadoop.tmp.dir in core-site.xml to be /home/hdfs/tmp_data , but nothing changes.
Is there any parameter that overwrite my settings?
Try override the following property in tasktracker nodes mapred-site.xml file and restart it.
<property>
<name>mapred.local.dir/name>
<value>/home/hdfs/tmp_data</value>
</property>

Hadoop/MR temporary directory

I've been struggling with getting Hadoop and Map/Reduce to start using a separate temporary directory instead of the /tmp on my root directory.
I've added the following to my core-site.xml config file:
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
</property>
I've added the following to my mapreduce-site.xml config file:
<property>
<name>mapreduce.cluster.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>${hadoop.tmp.dir}/mapred/system</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>${hadoop.tmp.dir}/mapred/temp</value>
</property>
No matter what job I run though, it's still doing all of the intermediate work out in the /tmp directory. I've been watching it do it via df -h and when I go in there, there are all of the temporary files it creates.
Am I missing something from the config?
This is on a 10 node Linux CentOS cluster running 2.1.0.2.0.6.0 of Hadoop/Yarn Mapreduce.
EDIT:
After some further research, the settings seem to be working on my management and namednode/secondarynamed nodes boxes. It is only on the data nodes that this is not working and it is only with the mapreduce temporary output files that are still going to /tmp on my root drive, not the my data mount where I have set in the configuration files.
If you are running Hadoop 2.0, then the proper name of the config file you need to change is mapred-site.xml, not mapreduce-site.xml.
An example can be found on the Apache site: http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
and it uses the mapreduce.cluster.local.dir property name, with a default value of ${hadoop.tmp.dir}/mapred/local
Try renaming your mapreduce-site.xml file to mapred-site.xml in your /etc/hadoop/conf/ directories and see if that fixes it.
If you are using Ambari, you should be able to just go to use the "Add Property" button on the MapReduce2 / Custom mapred-site.xml section, enter 'mapreduce.cluster.local.dir' for the property name, and a comma separated list of directories you want to use.
I think you need to specify this property in hdfs-site.xml rather than core-site.xml.Try setting this property in hdfs-site.xml. I hope this will solve your problem
The mapreduce properties should be in mapred-site.xml.
I was facing a similar issue where some nodes would not honor the hadoop.tmp.dir set in the config.
A reboot of the misbehaving nodes fixed it for me.

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Resources