Running multiple hadoop instances on same machine - hadoop

I wish to run a second instance of Hadoop on a machine which already has an instance of Hadoop running. After untar'ing hadoop distribution, some config files need to changed from hadoop-version/conf directory. The linux user will be same for both the instances. I have identified the following attributes, but, I am not sure if this is good enough.
hdfs-site.xml : dfs.data.dir and dfs.name.dir
core-site.xml : fs.default.name and hadoop.tmp.dir
mapred-site.xml : mapred.job.tracker
I couldn't find the attribute names for the port number of job tracker/task tracker/DFS web interface. Their default values are 50030, 50060 and 50070 respctively.
Are there any more attributes that need to be changed to ensure that the new hadoop instance is running in its own environment?

Look for ".address" in src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml, and you'll find plenty attributes defined there.
BTW, I had a box with firewall enabled, and I observed that the effective ports in default configuration are 50010, 50020, 50030, 50060, 50070, 50075 and 50090.

Related

regarding core-ste.xml file entries with start-dfs.sh and map reduce task - Hadoop

Am new to big data modules and am running hadoop on ubuntu.
for map reduce jobs, the below entry from core-site.xml needs to be suppressed
fs.default.name
hdfs://localhost:8020
start-dfs.sh does not execute with the above entry suppressed.
kindly assist and do update if multiple core-site.xml files or entries are permitted?
fs.defaultFS is the preferred property over the deprecated fs.default.name . One of them is required, and they cannot be "suppressed".
If you define multiple matching properties in the XML, only one will be used.
You can't have multiple files with the same name in the same hadoop config directory, anyway. This includes "core-site.xml"

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

How to check the port number of the hadoop services

How to check the port number of the hadoop services eg: port number for hive, oozie, sqoop, pig etc. I heard each hadoop service has a port number.
Normally Port use to get configured in the Configuration Files it self, available in either under "/etc/hadoop/conf/" or "/usr/local/hadoop/conf/" location "hadoop" with respected names like "pig/hive/sqoop" etc.
The Configuration named as "hdfs-site.xml/core-site.xml/hive-site.xml/mapred-site.xml...etc"
Some of the default Ports Used by Hadoop and it's Eco Systmems are:
Daemon Default Port Configuration Paramete
Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
Backup/Checkpoint node 50105 dfs.backup.http.address
Jobracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address
Also Check Reference: MORE DETAIL PORTS
You can take advantage of Cloudera Distribution of Hadoop post numbers for each components : Ports Used by Components of CDH 5

How to find installation mode of Hadoop 2.x

what is the quickest way of finding the installation mode of the Hadoop 2.x?
I just want to learn the best way to find the mode when I login first time into a Hadoop installed machine.
In hadoop 2 - go to /etc/hadoop/conf folder and check the Fs.defaultFS in core-site.xml and Yarn.resourcemanager.hostname property in yarn-site.xml. The values for those properties decide which mode you are running in.
Fs.defaultFS
Standalone mode - file:///
pseudo distributed- hdfs://localhost:8020/
Fully distributed - hdfs://namenodehostname:8020/
Yarn.resourcemanager.hostname
Standalone mode - file:///
pseudo distributed - hdfs://localhost:8021/
Fully ditributed - hdfs://resourcemanagerhostname:8021/
Alternatively you can use jps command to check the mode. if you see namenode/secondary namenode /jobtracker daemons running separately then it is distributed.
similarly in MR1 go to /etc/hadoop/conf folder and check the fs.default.name in core-site.xml and mapred.job.tracker property in mapred-site.xml.

hadoop conf "fs.default.name" can't be setted ip:port format directly?

all
I have setupped a hadoop cluster in fully distributed mode. First, I set core-site.xml "fs.default.name" and mapred-site.xml "mapred.job.tracker" in hostname:port format, and chang /etc/hosts correspondingly, the cluster works succesfully.
Then I use another way, I set set core-site.xml "fs.default.name" and mapred-site.xml "mapred.job.tracker" in ip:port format. It dosen't work.
I find
ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhost name. Using 'localhost'...
in namenode log file and
ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhos
t name. Using 'localhost'...
java.net.UnknownHostException: slave01: slave01: Name or service not known
in datanode log file.
In my opinion,ip and hostname is equivalent. Is there something wrong in my hadoop conf?
maybe there is a wrong configured hostname in /etc,
you should check hostname /etc/hosts /etc/HOSTNAME (rhel/debian) or rc.conf (archlinux) etc.
I got your point. This is because of that you probably wrote in mapred-site.xml, hdfs://ip:port (it starts with hdfs, this is wrong) but when you write hostname:port, you probably did not write hdfs at the beginning of the value which is correct way. THerefore, firstone did not work,but, second has worked
Fatih haltas
I found answer here.
It seems that HDFS uses host name only for it's all communication and display purposes, so we can NOT use ip directly in core-site.xml and mapred-site.xml

Resources