regarding core-ste.xml file entries with start-dfs.sh and map reduce task - Hadoop - hadoop

Am new to big data modules and am running hadoop on ubuntu.
for map reduce jobs, the below entry from core-site.xml needs to be suppressed
fs.default.name
hdfs://localhost:8020
start-dfs.sh does not execute with the above entry suppressed.
kindly assist and do update if multiple core-site.xml files or entries are permitted?

fs.defaultFS is the preferred property over the deprecated fs.default.name . One of them is required, and they cannot be "suppressed".
If you define multiple matching properties in the XML, only one will be used.
You can't have multiple files with the same name in the same hadoop config directory, anyway. This includes "core-site.xml"

Related

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

How to find installation mode of Hadoop 2.x

what is the quickest way of finding the installation mode of the Hadoop 2.x?
I just want to learn the best way to find the mode when I login first time into a Hadoop installed machine.
In hadoop 2 - go to /etc/hadoop/conf folder and check the Fs.defaultFS in core-site.xml and Yarn.resourcemanager.hostname property in yarn-site.xml. The values for those properties decide which mode you are running in.
Fs.defaultFS
Standalone mode - file:///
pseudo distributed- hdfs://localhost:8020/
Fully distributed - hdfs://namenodehostname:8020/
Yarn.resourcemanager.hostname
Standalone mode - file:///
pseudo distributed - hdfs://localhost:8021/
Fully ditributed - hdfs://resourcemanagerhostname:8021/
Alternatively you can use jps command to check the mode. if you see namenode/secondary namenode /jobtracker daemons running separately then it is distributed.
similarly in MR1 go to /etc/hadoop/conf folder and check the fs.default.name in core-site.xml and mapred.job.tracker property in mapred-site.xml.

Find port number where HDFS is listening

I want to access hdfs with fully qualified names such as :
hadoop fs -ls hdfs://machine-name:8020/user
I could also simply access hdfs with
hadoop fs -ls /user
However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names.
I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name. But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf.
There doesn't seem to a way hadoop tells me what's defined in fs.default.name with it's command line options.
How can I get the value defined in fs.default.name reliably from command line?
The test will always be running on namenode, so machine name is easy. But getting the port number(8020) is a bit difficult. I tried lsof, netstat.. but still couldn't find a reliable way.
Below command available in Apache hadoop 2.7.0 onwards, this can be used for getting the values for the hadoop configuration properties. fs.default.name is deprecated in hadoop 2.0, fs.defaultFS is the updated value. Not sure whether this will work incase of maprfs.
hdfs getconf -confKey fs.defaultFS # ( new property )
or
hdfs getconf -confKey fs.default.name # ( old property )
Not sure whether there is any command line utilities available for retrieving configuration properties values in Mapr or hadoop 0.20 hadoop versions. In case of this situation you better try the same in Java for retrieving the value corresponding to a configuration property.
Configuration hadoop conf = Configuration.getConf();
System.out.println(conf.get("fs.default.name"));
fs.default.name is deprecated.
use : hdfs getconf -confKey fs.defaultFS
I encountered this answer when I was looking for HDFS URI. Generally that's a URL pointing to the namenode. While hdfs getconf -confKey fs.defaultFS gets me the name of the nameservice but it won't help me building the HDFS URI.
I tried the command below to get a list of the namenodes instead
hdfs getconf -namenodes
This gave me a list of all the namenodes, primary first followed by secondary. After that constructing the HDFS URI was simple
hdfs://<primarynamenode>/
you can use
hdfs getconf -confKey fs.default.name
Yes, hdfs getconf -namenodes will show list of namenodes.

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Running multiple hadoop instances on same machine

I wish to run a second instance of Hadoop on a machine which already has an instance of Hadoop running. After untar'ing hadoop distribution, some config files need to changed from hadoop-version/conf directory. The linux user will be same for both the instances. I have identified the following attributes, but, I am not sure if this is good enough.
hdfs-site.xml : dfs.data.dir and dfs.name.dir
core-site.xml : fs.default.name and hadoop.tmp.dir
mapred-site.xml : mapred.job.tracker
I couldn't find the attribute names for the port number of job tracker/task tracker/DFS web interface. Their default values are 50030, 50060 and 50070 respctively.
Are there any more attributes that need to be changed to ensure that the new hadoop instance is running in its own environment?
Look for ".address" in src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml, and you'll find plenty attributes defined there.
BTW, I had a box with firewall enabled, and I observed that the effective ports in default configuration are 50010, 50020, 50030, 50060, 50070, 50075 and 50090.

Resources