How to find installation mode of Hadoop 2.x - hadoop

what is the quickest way of finding the installation mode of the Hadoop 2.x?
I just want to learn the best way to find the mode when I login first time into a Hadoop installed machine.

In hadoop 2 - go to /etc/hadoop/conf folder and check the Fs.defaultFS in core-site.xml and Yarn.resourcemanager.hostname property in yarn-site.xml. The values for those properties decide which mode you are running in.
Fs.defaultFS
Standalone mode - file:///
pseudo distributed- hdfs://localhost:8020/
Fully distributed - hdfs://namenodehostname:8020/
Yarn.resourcemanager.hostname
Standalone mode - file:///
pseudo distributed - hdfs://localhost:8021/
Fully ditributed - hdfs://resourcemanagerhostname:8021/
Alternatively you can use jps command to check the mode. if you see namenode/secondary namenode /jobtracker daemons running separately then it is distributed.
similarly in MR1 go to /etc/hadoop/conf folder and check the fs.default.name in core-site.xml and mapred.job.tracker property in mapred-site.xml.

Related

regarding core-ste.xml file entries with start-dfs.sh and map reduce task - Hadoop

Am new to big data modules and am running hadoop on ubuntu.
for map reduce jobs, the below entry from core-site.xml needs to be suppressed
fs.default.name
hdfs://localhost:8020
start-dfs.sh does not execute with the above entry suppressed.
kindly assist and do update if multiple core-site.xml files or entries are permitted?
fs.defaultFS is the preferred property over the deprecated fs.default.name . One of them is required, and they cannot be "suppressed".
If you define multiple matching properties in the XML, only one will be used.
You can't have multiple files with the same name in the same hadoop config directory, anyway. This includes "core-site.xml"

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

where is the hadoop task manager UI

I installed the hadoop 2.2 system on my ubuntu box using this tutorial
http://codesfusion.blogspot.com/2013/11/hadoop-2x-core-hdfs-and-yarn-components.html
Everything worked fine for me and now when I do
http://localhost:50070
I can see the management UI for HDFS. Very good!!
But the I am going through another tutorial which tells me that there must be a task manager UI running at http://mymachine.com:50030 and http://mymachine.com:50060
on my machine I cannot open these ports.
I have already done
start-dfs.sh
start-yarn.sh
start-all.sh
is something wrong? why can't I see the task manager UI?
You have installed YARN (MRv2) which runs the ResourceManager. The URL http://mymachine.com:50030 is the web address for the JobTracker daemon that comes with MRv1 and hence you are not able to see it.
To see the ResourceManager UI, check your yarn-site.xml file for the following property:
yarn.resourcemanager.webapp.address
By default, it should point to : resource_manager_hostname:8088
Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
Make sure all your deamons are up and running before you visit the URL for the ResourceManager.
For Hadoop 2[aka YARN/MRV2] - Any hadoop installation version-ed 2.x or higher its at port number 8088. eg. localhost:8088
For Hadoop 1 - Any hadoop installation version-ed lower than 2.x[eg 1.x or 0.x] its at port number 50030. eg localhost:50030
By default HadoopUI location is as below
http://mymachine.com:50070

Cloudera Manager - dfs.datanode.du.reserved not working

I have set dfs.datanode.du.reserved property to 10 GB using Cloudera Manager. But when I check the map-reduce job.xml file, I find dfs.datanode.du.reserved is still set to 0. How do I verify whether the property is set ??
PS: I am using Cloudera Standard 4.7.2 with CDH 4.4.0
This flag is set in the hdfs-site.xml and not in the mapred-site.xml.
You will not be able to see this flag in the client configurations (/etc/hadoop/conf/hdfs-site.xml) without tweaking configuration.
It is only set in the datanode configuration that is regenerated by Cloudera Manager. This configuration can be found in /var/run/cloudera-scm-agent/process/XXXXXX-hdfs-DATANODE/hdfs-site.xml, where XXXXXX is a incremented number of some kind (used by Cloudera Manager).
From within Cloudera Manager you can see this flag on Datanode (), click Processes, then Configuration files/Environment - Show and then you find the hdfs-site.xml for a datanode.

Running multiple hadoop instances on same machine

I wish to run a second instance of Hadoop on a machine which already has an instance of Hadoop running. After untar'ing hadoop distribution, some config files need to changed from hadoop-version/conf directory. The linux user will be same for both the instances. I have identified the following attributes, but, I am not sure if this is good enough.
hdfs-site.xml : dfs.data.dir and dfs.name.dir
core-site.xml : fs.default.name and hadoop.tmp.dir
mapred-site.xml : mapred.job.tracker
I couldn't find the attribute names for the port number of job tracker/task tracker/DFS web interface. Their default values are 50030, 50060 and 50070 respctively.
Are there any more attributes that need to be changed to ensure that the new hadoop instance is running in its own environment?
Look for ".address" in src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml, and you'll find plenty attributes defined there.
BTW, I had a box with firewall enabled, and I observed that the effective ports in default configuration are 50010, 50020, 50030, 50060, 50070, 50075 and 50090.

Resources