Unable to start a node manager on master - hadoop

I am setting up a Hadoop YARN cluster and I am using a machine as both a master and a slave. When I start the YARN using the following command, it starts the nodemanager on slaves but not on the master node.
sbin/yarn-daemons.sh start nodemanager
I have a master which also is slave and then I have another two slaves within the cluster, the nodemanagers in the slaves are starting properly.
The error I get :
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
Output of some of the Commands .
cat /etc/services | grep 8040
ampify 8040/tcp # Ampify Messaging Protocol
ampify 8040/udp # Ampify Messaging Protocol
lsof -i tcp:8040
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 28021 df 195u IPv6 3580602 0t0 TCP server1.mydomain.com:ampify (LISTEN

Under the default configuration that Hadoop ships, port 8040 is the port that the NodeManager uses for the localizer. This is basically a server endpoint responsible for bringing the files required to run a container onto the local node. (For example, this can be a MapReduce job's jar file or distributed cache files.)
Assuming that there is another server on the machine (here shown as Ampify) legitimately bound to port 8040, and you don't want to stop that service, then it is possible to reconfigure the port used by the NodeManager for the localizer. Set property yarn.nodemanager.localizer.address in your yarn-site.xml file. This is documented here:
http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Pulling that from the XML source in the Hadoop tree, here is the documentation for the property:
<property>
<description>Address where the localizer IPC is.</description>
<name>yarn.nodemanager.localizer.address</name>
<value>${yarn.nodemanager.hostname}:8040</value>
</property>

Above error means, you are trying to start a process on 8040, which is already occupied by another instance.
To get rid of this error, you need to kill the process which is currently listening to port 8040. Your lsof output says pid is 28021. kill the process using the following command and start again
kill -9 28021

Related

spark history not start on ambari cluster

we start the spark history as the following
/usr/hdp/2.6.0.3-8/spark2/sbin/start-history-server.sh
from the log
spark-root-org.apache.spark.deploy.history.HistoryServer-1-master01
we get
WARN AbstractLifeCycle: FAILED ServerConnector#14a54ef6{HTTP/1.1}
{0.0.0.0:18081}: java.net.BindException: Address already in use
java.net.BindException: Address already in use
please advice what is solution in order to start the spark history
You need to kill the process (like a zombie History server) with the port already open, or change the port in Ambari to something else
A combination of netstat -an, ps -ef, and lsof will help you find what process holds the port

Cannot connect slave1:8088 in hadoop 2.7.2

I am new in hadoop and I have installed hadoop 2.7.2 into two machines which are master and slave1. I have followed this tutorial. It was not mentioned in the tutorial but I have also edited JAVA_HOME and HADOOP_CONF_DIR variables in hadoop-env.sh. At the end I have two machines hadoop installed. In master NameNode, DataNode, SecondaryNameNode, ResourceManager and NodeManager are running and in slave1 DataNode and NodeManager are running.
I am able to go to master:8088 in the browser and when I go http://master:8088/cluster/nodes, there is only master node here. I am not able to go isci17:8088 and that is not a live node. Why could it be?
Port 8088 is the resource manager web ui port, so if it is running on master you probably won't have it on the slave.
You should be able to also go to the name node web ui on port 50070 on your name node as well to see status such as http://master:50070/ and the MapReduce JobHistory Server at http://hostname:19888/ for a web ui.
If you have access to a terminal session you run the following command on each server as root/sudo user to see which ports are listening on which server in a Linux terminal session;
sudo lsof -i tcp | grep -i LISTEN
You also also run hadoop cli commands that will give you info;
You can run the following to check hadoops ports in a terminal session.
hdfs portmap
Other Health checks on command line;
hdfs classpath
hdfs getconf -namenodes
hdfs dfsadmin -report -live
hdfs dfsadmin -report -dead
hdfs dfsadmin -printTopology
Depending on if the hadoop cli command works automatically you might have to find the executable to run ./hdfs. Also depending on distro/version you might have to replace the command hdfs with the command hadoop.
If you want to see your cluster configurations check your /etc/hadoop/conf folder along with /etc/hadoop/hive . You will find about 5-10 *-site.xml files. There configuration files contain your clusters configuration with the hostnames and ports.

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application
could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
I've tried the top-rated answer here around reformatting but this doesn't work for me: HDFS error: could only be replicated to 0 nodes, instead of 1
What is happening is this:
My application consists of 2 threads each one configured with their own Spring Data PartitionTextFileWriter
Thread 1 is the first to process data and this can successfully write to HDFS
However, once Thread 2 starts to process data I get this error when it attempts to flush to a file
Thread 1 and 2 will not be writing to the same file, although they do share a parent directory at the root of my directory tree.
There are no problems with disk space on my server.
I also see this in my name-node logs, but not sure what it means:
2016-03-15 11:23:12,149 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2016-03-15 11:23:12,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.104.247.78:52004 Call#61 Retry#0
java.io.IOException: File /metrics/abc/myfile could only be replicated to 0 nodes instead of [2016-03-15 13:34:16,663] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.GroupMetadataManager)
What could be the cause of this error?
Thanks
This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that:
Only a NameNode instance is running and it's not in safe-mode
There is no DataNode instances up and running, or some are dead. (Check the servers)
Namenode and Datanode instances are both running, but they cannot communicate with each other, which means There is connectivity issue between DataNode and NameNode instances.
Running DataNode instances are not able to talk to the server because of some networking of hadoop-based issues (check logs that include datanode info)
There is no hard disk space specified in configured data directories for DataNode instances or DataNode instances have run out of space. (check dfs.data.dir // delete old files if any)
Specified reserved spaces for DataNode instances in dfs.datanode.du.reserved is more than the free space which makes DataNode instances to understand there is no enough free space.
There is no enough threads for DataNode instances (check datanode logs and dfs.datanode.handler.count value)
Make sure dfs.data.transfer.protection is not equal to “authentication” and dfs.encrypt.data.transfer is equal to true.
Also please:
Verify the status of NameNode and DataNode services and check the related logs
Verify if core-site.xml has correct fs.defaultFS value and hdfs-site.xml has a valid value.
Verify hdfs-site.xml has dfs.namenode.http-address.. for all NameNode instances specified in case of PHD HA configuration.
Verify if the permissions on the directories are correct
Ref: https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
Ref: https://support.pivotal.io/hc/en-us/articles/201846688-HDFS-reports-Configured-Capacity-0-0-B-for-datanode
Also, please check: Writing to HDFS from Java, getting "could only be replicated to 0 nodes instead of minReplication"
Another reason could be that your Datanode machine hasn't exposed the port(50010 by default). In my case, I was trying to write a file from Machine1 to HDFS running on a Docker container C1 which was hosted on Machine2.
For the host machine to forward the requests to the services running on the container, the port forwarding should be taken care of. I could resolve the issue after forwarding the port 50010 from host machine to guest machine.
Check if the jps command on the computers which run the datanodes show that the datanodes are running. If they are running, then it means that they could not connect with the namenode and hence the namenode thinks there are no datanodes in the hadoop system.
In such a case, after running start-dfs.sh, run netstat -ntlp in the master node. 9000 is the port number most tutorials tells you to specify in core-site.xml. So if you see a line like this in the output of netstat
tcp 0 0 120.0.1.1:9000 0.0.0.0:* LISTEN 4209/java
then you have a problem with the host alias. I had the same problem, so I'll state how it was resolved.
This is the contents of my core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://vm-sm:9000</value>
</property>
</configuration>
So the vm-sm alias in the master computer maps to the 127.0.1.1. This is because of the setup of my /etc/hosts file.
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vm-sm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
Looks like the core-site.xml of the master system seemed to have mapped on the the 120.0.1.1:9000 while that of the worker nodes are trying to connect through 192.168.1.1:9000.
So I had to change the alias of the master node for the hadoop system (just removed the hyphen) in the /etc/hosts file
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vmsm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
and reflected the change in the core-site.xml, mapred-site.xml, and slave files (wherever the old alias of the master occurred).
After deleting the old hdfs files from the hadoop location as well as the tmp folder and restarting all nodes, the issue was solved.
Now, netstat -ntlp after starting DFS returns
tcp 0 0 192.168.1.1:9000 0.0.0.0:* LISTEN ...
...
I had the same error, re-starting hdfs services solved this issue. ie re-started NameNode and DataNode services.
In my case it was a storage policy of output path set to COLD.
How to check settings of your folder:
hdfs storagepolicies -getStoragePolicy -path my_path
In my case it returned
The storage policy of my_path
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}
I dumped the data else where (to HOT storage) and the issue went away.
You may leave HDFS safe mode:
hdfs dfsadmin -safemode forceExit
I had a similar issue recently. As my datanodes (only) had SSDs for storage, I put [SSD]file:///path/to/data/dir for the dfs.datanode.data.dir configuration. Due to the logs containing unavailableStorages=[DISK] I removed the [SSD] tag, which solved the problem.
Apparently, Hadoop uses [DISK] as default Storage Type, and does not 'fallback' (or rather 'fallup') to using SSD if no [DISK] tagged storage location is available. I could not find any documenation on this behaviour though.
I too had the same error, then i have changed the block size. This came to resolve the problem.
In my case the problem was hadoop temporary files
Logs were showing the following error:
2019-02-27 13:52:01,079 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /tmp/hadoop-i843484/dfs/data/in_use.lock acquired by nodename 28111#slel00681841a
2019-02-27 13:52:01,087 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /tmp/hadoop-i843484/dfs/data: namenode clusterID = CID-38b0104b-d3d2-4088-9a54-44b71b452006; datanode clusterID = CID-8e121bbb-5a08-4085-9817-b2040cd399e1
I solved by removing hadoop tmp files
sudo rm -r /tmp/hadoop-*
Got this error as Data Node was not running. To resolve this on VM
Removed Name/Data Node directories
Re-Created the directories
Formatted the name node & data node(not required)hadoop namenode -format
Restarted the service start-dfs.sh
Now jps shows both Name & Data nodes and Sqoop job worked successfully
maybe the number of your DataNode is too small(less than 3), I put 3 ip-address in hadoop/etc/hadoop/slaves, and it works!
1.check your firewall status, you can simply stop firewall in both master and slaves:systemctl stop firewalld. Which fixed my problem.
2.delete namenode and reformat it: delete namenode dir and datanode dir both.(as my slaves computer didn't shutdown normally, causing my datanode broken) then call hdfs namenode -format`.
call jps in both master and slaves. make sure master have namenode and slaves have datanode.

Need to Install Mesos to get Mesos Slave?

I'm trying to get this question solve,
To get mesos slave, is it we have to install Mesos and start mesos slave set up or?
And also I have problem with mesos master which I run a command
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos
end up it does not continue to run so i stop it running. End up I run the same above command and I get error shown
Failed to initialize, bind: Address already in use [98]
Which part did I do wrongly?
You have to run mesos-master first and then you can connect mesos slave running on a different node to the master. You can refer to getting started guide of mesos. only one slave can connect to the master on the same port. If you get bind address already in use, you can try running slave on another port by passing --port=value parameter. Replace value with port number.
to start mesos master on localhost:
./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
to start and connect slave to master
./bin/mesos-slave.sh --master=127.0.0.1:5050
to start and connect another slave to the same master you have to use another port as default port 5051 is already used by the first connected slave. Use argument --port-value to start slave on another port
./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You may get a permission denied error. If so use sudo to access the given port
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You can run one more slave but you have to specify ip and a different workdir using
./mesos-slave.sh --master=<ipaddr>:<port> --ip=<ip of slave> --work_dir=<work_dir other than that of a running slave> --port=<another_port>
edit your etc/hosts and add more local ips with the following entries
127.0.0.2 slave2
127.0.0.3 slave3
then you can replace --ip=<ip of slave> with --ip=slave1 or --ip=slave2
You may have to replace <another_port> with ports like 5052,5053 or any available port if you have a running slave. The slave will be using the default port.
To run only a mesos-slave on a host is simple by installing the mesos package and only running the mesos-slave process with the correct flags, it's not a problem if the master is also installed, but be careful only to run the masters correct to the quorum number.
Something already running on the port you are trying to start the mesos-master, which has a web interface.
Check what program runs on the mesos default port, or use another port, more info about the command line documentation available here: Mesos configuration
To see what's using port 5050 or 5051 use either one of these commands:
sudo fuser -v 5050/tcp
sudo lsof -i | grep 5050
Both command will give you the process pid which holds the port. Either kill them or specify a new port for mesos by starting it with the correct port option:
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos --port=FREE_PORT
Where do you specify the zookeepers for the mesos master and slaves? The following flags are required to start mesos-master (see the link I gave you):
--advertise_ip, --advertise_port, --quorum, --work_dir, --zk
What are your current full configuration for mesos master? You can find the files under related at /etc/mesos/, /etc/mesos-master/, /etc/mesos-slave/, /etc/defaults/mesos, /etc/defaults/mesos-master, /etc/defaults/mesos-slave. If you copy paste the lines from them and the mesos log here, we might give you more help.
Also please explain the cluster you would like to set up (Number of hosts, masters, slaves) and we can also help there.
excecute below command :
sudo netstat -peanut
Then check which process is using the port 5050 and 5051.
Kill those process using the pid.
Start the mesos master and slave again.
This happens to me when I killed the mesos slave accidentally and then restarted it but failed with address-bind issue.

org.apache.hadoop.hbase.PleaseHoldException: Master is initializing

I am trying to setup the multinode cluster of Hbase. When i do the jps on slave i get
5780 Jps
5558 HQuorumPeer
5684 HRegionServer
1963 DataNode
2093 TaskTracker
similarly on master i get
4254 SecondaryNameNode
15226 Jps
14982 HMaster
3907 NameNode
14921 HQuorumPeer
4340 JobTracker
EVerything is runnnig properly. But when i try to create table on hbase shell. It gives an error
ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
regionserver log of my slave(where region server is running):
2013-06-11 13:09:53,119 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
2013-06-11 13:10:53,190 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:60000
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
2013-06-11 13:10:53,391 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
FYI, i have also took care of /etc/hosts file on both master and slave.
127.0.0.1 localhost
127.0.0.1 naresh-PC
I again did changes in /etc/hosts file 127.0.1.1 to naresh-PC. But still getting this error
2013-06-11 14:51:17,781 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at naresh-pc,60000,137094$
2013-06-11 14:52:17,817 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
java.net.UnknownHostException: unknown host: naresh-pc
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.<init>(HBaseClient.java:276)
at org.apache.hadoop.hbase.ipc.HBaseClient.createConnection(HBaseClient.java:255)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1111)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
Try clearing all the states in Zookeeper.
Stop Zookeeper
Wipe the Zookeeper data directory
Start Zookeeper
I was getting the same issue and followed this approach and it worked fine.
You need to change the configuration on the slave node to point at the master. It is currently pointing to localhost and not connecting to the actual master:
"org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This
server is in the failed servers list: localhost/127.0.0.1:60000 at "
I'm hosting my own cluster inside Docker. Here's what worked in my case. I grepped the HBase log file for errors and found "Master passed us a different hostname to use"
`[root#docker-iop bin]# grep ERROR /var/log/hbase/hbase-hbase-regionserver-bi-mgmt01.local.log
2016-10-06 00:05:29,816 ERROR [regionserver/bi-mgmt01.local/111.11.2.3:16020] regionserver.HRegionServer: Master passed us a different hostname to use; was=my-host-name, but now=111.22.33.444'
I mapped my-host-name to 111.22.333.444 in my hosts file, restarted HBase and it worked.
I also had the same issue with a fully distributed hbase cluster with the configuration below.
Master Node (Node-A)
Backup Masters ($HBASE_HOME/conf/backup-masters) (Node-B & Node-C)
3 Replication servers (Node-A, Node-B & Node-C)
RCA:
The backup-masters nodes were attempted to be started when the cluster started.
Solution
I removed the backup masters by making $HBASE_HOME/conf/backup-masters empty in all hbase nodes.
So I had a cluster running without backup masters.
I wonder if the master node and master nodes must not also function as regionservers? The HBase documentation says otherwise though.
I came across the same issue and could not find anything, it turns out I was copy pasting from the Hbase documentation (https://hbase.apache.org/book.html#shell_exercises). I believe some character in there may be creating the error, so try to manually enter:
create 'test', 'cf'
We resolved this issue. Solution is to
stop Hbase
log to zookeeper-client as root
execute command rmr /hbase-unsecure/meta-region-server
start Hbase
We stop/start Hbase through Ambari UI, delete /hbase... through server bash shell.
[root#s1 ~]# zookeeper-client
Connecting to localhost:2181
.......
[zk: localhost:2181(CONNECTED) 0] rmr /hbase-unsecure/meta-region-server
I use docker/docker-compose to set up my distributed hbase, after I make changes, I can not create table in hbase shell.
I docker rm all the related images, and rebuild them. It works. Also, simply rebuilding the images doesn't work...

Resources