what is the difference between hadoop.tmp.dir and mapred.temp.dir and mapreduce.cluster.temp.dir - hadoop

I want to know what is a difference between hadoop.tmp.dir and mapred.temp.dir also how mapred.temp.dir [deprecated] differs from mapreduce.cluster.temp.dir

hadoop.tmp.dir is the highest-level temporary directory. It defaults to /tmp/hadoop-${user.name}.
By default, mapred.temp.dir refers to a directory under hadoop.tmp.dir: ${hadoop.tmp.dir}/mapred/temp.
Logically, this makes sense because MapReduce (like Hive, Spark, etc) is a subset of Hadoop, so it should be stored under Hadoop's temp directory.

Related

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

Is pig.temp.dir property mandatory?

Pig Execution Mode = Local
In that case do we need to set pig.temp.dir=/temp property and this /temp folder needs to be present inside HDFS.
Note:
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property's default value is "/tmp" which is the same as the hardcoded location in Pig 0.7.0 and earlier versions.
As per: http://pig.apache.org/docs/r0.14.0/start.html#req Storing Intermediate Results heading
You'll still need to have some temp directory, but it needs to be present in your local file system. In local mode Pig (and MapReduce) does all operations on local filesystem by default.

Restarting datanodes after reformating namenode in a hadoop cluster

Using the basic configuration provided in the hadoop setup official documentation, I can run a hadoop cluster and submit mapreduce jobs.
The problem is whenever I stop all the daemons and reformat the namenode, when I subsequently start all the daemons, the datanode does not start.
I've been looking around for a solution and it appears that it is because the formatting only formats the namenode and the disk space for the datanode needs to be erased.
How can I do this? What changes do I need to make to my config files? After those changes are made, how do I delete the correct files when formatting the namenode again?
Specifically if you have provided configuration of below 2 parameters which can be defined in hdfs-site.xml
dfs.name.dir: Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir: Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
if you have provided the specific directory location for above 2 parameters then you need to delete those directories as well before formating namenode .
if you have not provided the above 2 parameter so by default it gets created under below parameter :
hadoop.tmp.dir which can be configured in core-site.xml
Again if you have specified this parameter then you need to remove that directory before formating namenode .
if you have not defined so by default it gets created in /tmp/hadoop-$username(hadoop) user so you need to remove this directory .
Summary: you have to delete the name node and data node directory before formating the system. By default it gets created at /tmp/ location .

How can I specify Hadoop XML configuration variables via the Hadoop shell scripts?

I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At checkin time: I can modify the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At run time: I generate the temporary directory that I want to use for my data storage.
I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR and HADOOP_PID_DIR, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir in core-default.xml and dfs.datanode.data.dir in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir?
We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.
export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"
This method can be used to configure other configurations also, like namenode url.

Where HDFS stores files locally by default?

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally.
Any ideas?
Thanks.
You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data and note that the ${hadoop.tmp.dir} is actually in core-default.xml described here.
The configuration options are described here. The description for this setting is:
Determines where on the local
filesystem an DFS data node should
store its blocks. If this is a
comma-delimited list of directories,
then data will be stored in all named
directories, typically on different
devices. Directories that do not exist
are ignored.
Seems like for the current version(2.7.1) the dir is
/tmp/hadoop-${user.name}/dfs/data
Based on dfs.datanode.data.dir, hadoop.tmp.dir setting from:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml
As "more recent answer" and to clarify hadoop version numbers:
If you use Hadoop 1.2.1 (or something similar), #Binary Nerd's answer is still true.
But if you use Hadoop 2.1.0-beta (or something similar), you should read the configuration documentation here and the option you want to set is: dfs.datanode.data.dir
For hadoop 3.0.0, the hdfs root path is as given by the property "dfs.datanode.data.dir"
Run this in the cmd prompt, and you will get the HDFS location:
bin/hadoop fs -ls /

Resources