Unable to start mesos-slave on different VM. Constant Deactivated status - mesos

I'm trying to setup simple mesos cluster on 2 virtual machines. IPs are:
10.10.0.102 (with 1 master and 1 slave)- FQDN mesos1.mydomain
10.10.0.103 (with 1 slave)- FQDN mesos2.mydomain
I'm using mesos 0.27.1 (rpm's downloaded from Mesosphere) and CentOS Linux release 7.1.1503 (Core).
I was successful in deploying 1 node cluster (10.10.0.102): master and slave works and I can deploy and scale some simple application via marathon.
The problem comes when I try to start second slave on 10.10.0.103. Always, when I start that slave its state is deactivated.
Logs from slave on 10.10.0.103:
I0226 13:49:58.428019 14937 slave.cpp:463] Slave resources: cpus(*):1; mem(*):2768; disk(*):3409; ports(*):[31000-32000]
I0226 13:49:58.428019 14937 slave.cpp:471] Slave attributes: [ ]
I0226 13:49:58.428019 14937 slave.cpp:476] Slave hostname: mesos2
I0226 13:49:58.430469 14946 state.cpp:58] Recovering state from '/tmp/mesos/meta'
I0226 13:49:58.430922 14947 status_update_manager.cpp:200] Recovering status update manager
I0226 13:49:58.430954 14947 containerizer.cpp:390] Recovering containerizer
I0226 13:49:58.432219 14947 provisioner.cpp:245] Provisioner recovery complete
I0226 13:49:58.432273 14947 slave.cpp:4495] Finished recovery
I0226 13:49:58.448940 14948 group.cpp:349] Group process (group(1)#10.10.0.103:5051) connected to ZooKeeper
I0226 13:49:58.449050 14948 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0226 13:49:58.449064 14948 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0226 13:49:58.451846 14948 detector.cpp:154] Detected a new leader: (id='3')
I0226 13:49:58.451937 14948 group.cpp:700] Trying to get '/mesos/json.info_0000000003' in ZooKeeper
I0226 13:49:58.453397 14948 detector.cpp:479] A new leading master (UPID=master#10.10.0.102:5050) is detected
I0226 13:49:58.453459 14948 slave.cpp:795] New master detected at master#10.10.0.102:5050
I0226 13:49:58.453698 14948 slave.cpp:820] No credentials provided. Attempting to register without authentication
I0226 13:49:58.453724 14948 slave.cpp:831] Detecting new master
I0226 13:49:58.453743 14948 status_update_manager.cpp:174] Pausing sending status updates
I0226 13:50:58.445101 14948 slave.cpp:4304] Current disk usage 22.11%. Max allowed age: 4.752451232032847days
I0226 13:51:58.460233 14948 slave.cpp:4304] Current disk usage 22.11%. Max allowed age: 4.752451232032847days
Logs from master on 10.10.0.102
I0226 22:55:14.240464 2021 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 682
I0226 22:55:14.240542 2021 hierarchical.cpp:473] Added slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 (mesos2) with cpus(*):1; mem(*):2768; disk(*):3409; ports(*):[31000-32000] (allocated: )
I0226 22:55:14.240671 2021 master.cpp:5350] Sending 1 offers to framework c5a5818d-16fa-42bf-8e73-697a2d12fe97-0001 (marathon) at scheduler-91034353-1820-4020-aad1-10e11d567136#10.10.0.102:45698
I0226 22:55:14.240767 2021 replica.cpp:537] Replica received write request for position 682 from (1259)#10.10.0.102:5050
E0226 22:55:14.241082 2027 process.cpp:1966] Failed to shutdown socket with fd 32: Transport endpoint is not connected
I0226 22:55:14.241143 2019 master.cpp:1172] Slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)#10.10.0.103:5051 (mesos2) disconnected
I0226 22:55:14.241153 2019 master.cpp:2633] Disconnecting slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)#10.10.0.103:5051 (mesos2)
I0226 22:55:14.241161 2019 master.cpp:2652] Deactivating slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)#10.10.0.103:5051 (mesos2)
I0226 22:55:14.241230 2019 hierarchical.cpp:560] Slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 deactivated
I0226 22:55:14.245923 2019 master.cpp:3673] Processing DECLINE call for offers: [ a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-O1251 ] for framework c5a5818d-16fa-42bf-8e73-697a2d12fe97-0001 (marathon) at scheduler-91034353-1820-4020-aad1-10e11d567136#10.10.0.102:45698
W0226 22:55:14.245923 2019 master.cpp:3720] Ignoring decline of offer a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-O1251 since it is no longer valid
I0226 22:55:14.249065 2021 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 8.264893ms
I0226 22:55:14.249107 2021 replica.cpp:712] Persisted action at 682
I0226 22:55:14.249220 2021 replica.cpp:691] Replica received learned notice for position 682 from #0.0.0.0:0
I've tried to start slave using two approaches (on 10.10.0.103):
sudo service mesos-slave start
mesos-slave --master=10.10.0.102:5050 --ip=10.10.0.103
Both give me the same results.
Additionally in MESOS-SLAVE.WARNING I see also:
Running on machine: mesos2.mydomain
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0226 13:49:58.415089 14937 systemd.cpp:244] Required functionality `Delegate` was introduced in Version `218`. Your system may not function properly; however since some distributions have patched systemd packages, your system may still be functional. This is why we keep running. See MESOS-3352 for more information
Base on similar topics I see that this can be related to network configuration so below is some info about:
hosts file on 10.10.0.102
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.0.103 mesos2 mesos2.mydomain
10.10.0.102 mesos1 mesos1.mydomain
hosts file on 10.10.0.103
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.0.102 mesos1 mesos1.mydomain
10.10.0.103 mesos2 mesos2.mydomain
both VM's have 2 network interfaces (without loopback). Below comes from 10.10.0.103- on 10.10.0.102 is similar:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:49:76:48 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
valid_lft 75232sec preferred_lft 75232sec
inet6 fe80::a00:27ff:fe49:7648/64 scope link
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:d9:24:2a brd ff:ff:ff:ff:ff:ff
inet 10.10.0.103/24 brd 10.10.0.255 scope global enp0s8
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fed9:242a/64 scope link
valid_lft forever preferred_lft forever
Both VMs have network connectivity.
from 10.10.0.102 to 10.10.0.103
[root#mesos1 ~]# ping mesos2.mydomain
PING mesos2 (10.10.0.103) 56(84) bytes of data.
64 bytes from mesos2 (10.10.0.103): icmp_seq=1 ttl=64 time=0.578 ms
64 bytes from mesos2 (10.10.0.103): icmp_seq=2 ttl=64 time=0.616 ms
from 10.10.0.103 to 10.10.0.102
[root#mesos2 ~]# ping mesos1.mydomain
PING mesos1 (10.10.0.102) 56(84) bytes of data.
64 bytes from mesos1 (10.10.0.102): icmp_seq=1 ttl=64 time=0.441 ms
64 bytes from mesos1 (10.10.0.102): icmp_seq=2 ttl=64 time=0.972 ms
All help would be highly appreciate.
Regards
Andrzej

Like always the simplest answers are the best. It's turn out that I had running iptables on slave node. Disabling this resolve my problem:
systemctl disable firewalld
systemctl stop firewalld
Thanks everyone for help!

Related

Cannot determine ethernet address for proxy ARP (Cent OS PPTP VPN)

I've installed pptpd on CentOS 7 with AWS EC2 and I can connect to vpn with windows client but I have no internet access while the server has full internet access. In pptpd log I noticed the error "Cannot determine ethernet address for proxy ARP".
I've changed the dns in /etc/ppp/options.pptpd as below:
ms-dns 8.8.8.8
ms-dns 8.8.4.4
I've also created users in /etc/ppp/chap-secrets and clients can connect without problem (but with no internet access.)
I've also enabled IP forwarding in /etc/sysctl.conf
net.ipv4.ip_forward = 1
and execute this command:
sudo sysctl -p
I changed local and remote IPs in /etc/pptpd.conf as below:
localip 192.168.10.1
remoteip 192.168.20.10-100
I configured firewall for IP masquerading:
sudo iptables -t nat -A POSTROUTING -o ens5 -j MASQUERADE
This is the ifconfig result:
ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 172.31.28.246 netmask 255.255.240.0 broadcast 172.31.31.255
inet6 fe80::4e6:11ff:fed8:bb4a prefixlen 64 scopeid 0x20<link>
ether 06:e6:11:d8:bb:4a txqueuelen 1000 (Ethernet)
RX packets 3668 bytes 347939 (339.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3111 bytes 385009 (375.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 6 bytes 416 (416.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6 bytes 416 (416.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ppp0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 1396
inet 192.168.10.1 netmask 255.255.255.255 destination 192.168.20.10
ppp txqueuelen 3 (Point-to-Point Protocol)
RX packets 40 bytes 3158 (3.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8 bytes 104 (104.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
And this is the pptpd status (I could connect to the VPN successful but could not access the internet):
[root#ip-172-31-28-246 ~]# systemctl status pptpd
● pptpd.service - PoPToP Point to Point Tunneling Server
Loaded: loaded (/usr/lib/systemd/system/pptpd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2021-08-22 09:24:41 UTC; 2min 9s ago
Main PID: 1476 (pptpd)
CGroup: /system.slice/pptpd.service
├─1476 /usr/sbin/pptpd -f
├─1505 pptpd [171.213.14.133:ED5A - 0000]
└─1506 /usr/sbin/pppd local file /etc/ppp/options.pptpd 115200 192.168.10.1:192.168.20.10 ipparam 171.213.14.133 plugin /usr/lib64/pptpd/pptpd-logwtmp.so pptpd-original-ip 171.213.14.133 remote...
Aug 22 09:25:28 ip-172-31-28-246.ap-east-1.compute.internal pptpd[1505]: CTRL: Starting call (launching pppd, opening GRE)
Aug 22 09:25:28 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: Plugin /usr/lib64/pptpd/pptpd-logwtmp.so loaded.
Aug 22 09:25:28 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: pppd 2.4.5 started by root, uid 0
Aug 22 09:25:28 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: Using interface ppp0
Aug 22 09:25:28 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: Connect: ppp0 <--> /dev/pts/1
Aug 22 09:25:32 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: peer from calling number 171.213.14.133 authorized
Aug 22 09:25:32 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: MPPE 128-bit stateless compression enabled
Aug 22 09:25:34 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: Cannot determine ethernet address for proxy ARP
Aug 22 09:25:34 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: local IP address 192.168.10.1
Aug 22 09:25:34 ip-172-31-28-246.ap-east-1.compute.internal pppd[1506]: remote IP address 192.168.20.10

HDFS NFS startup error: “ERROR mount.MountdBase: Failed to start the TCP server...ChannelException: Failed to bind..."

Attempting to use / startup HDFS NFS following the docs (ignoring the instructions to stop the rpcbind service and did not start the hadoop portmap service given that the OS is not SLES 11 and RHEL 6.2), but running into error when trying to set up the NFS service starting the hdfs nfs3 service:
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs status
Redirecting to /bin/systemctl status nfs.service
Unit nfs.service could not be found.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs stop
Redirecting to /bin/systemctl stop nfs.service
Failed to stop nfs.service: Unit nfs.service not loaded.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-23 13:48:54 HST; 28s ago
Process: 27337 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
Main PID: 27338 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─27338 /sbin/rpcbind -w
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Starting RPC bind service...
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Started RPC bind service.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# hdfs nfs3
19/07/23 13:49:33 INFO nfs3.Nfs3Base: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Nfs3
STARTUP_MSG: host = HW02.ucera.local/172.18.4.47
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78
STARTUP_MSG: classpath = /usr/hdp/3.1.0.0-78/hadoop/conf:/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-server-1.19.jar:/usr/hdp/3.1.0.0-78/hadoop/lib/ranger-hdfs-plugin-shim-1.2.0.3.1.0.0-78.jar:
...
<a bunch of other jars>
...
STARTUP_MSG: build = git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64; compiled by 'jenkins' on 2018-12-06T12:26Z
STARTUP_MSG: java = 1.8.0_112
************************************************************/
19/07/23 13:49:33 INFO nfs3.Nfs3Base: registered UNIX signal handlers for [TERM, HUP, INT]
19/07/23 13:49:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Nfs3 metrics system started
19/07/23 13:49:33 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:33 INFO security.ShellBasedIdMapping: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist.
19/07/23 13:49:33 INFO nfs3.WriteManager: Stream timeout is 600000ms.
19/07/23 13:49:33 INFO nfs3.WriteManager: Maximum open streams is 256
19/07/23 13:49:33 INFO nfs3.OpenFileCtxCache: Maximum open streams is 256
19/07/23 13:49:34 INFO nfs3.DFSClientCache: Added export: / FileSystem URI: / with namenodeId: -1408097406
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Configured HDFS superuser is
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Delete current dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Create new dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.Nfs3Base: NFS server port set to: 2049
19/07/23 13:49:34 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:34 INFO mount.RpcProgramMountd: FS:hdfs adding export Path:/ with URI: hdfs://hw01.ucera.local:8020/
19/07/23 13:49:34 INFO oncrpc.SimpleUdpServer: Started listening to UDP requests at port 4242 for Rpc program: mountd at localhost:4242 with workerCount 1
19/07/23 13:49:34 ERROR mount.MountdBase: Failed to start the TCP server.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.hadoop.oncrpc.SimpleTcpServer.run(SimpleTcpServer.java:89)
at org.apache.hadoop.mount.MountdBase.startTCPServer(MountdBase.java:83)
at org.apache.hadoop.mount.MountdBase.start(MountdBase.java:98)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startServiceInternal(Nfs3.java:56)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:69)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:79)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
...
...
19/07/23 13:49:34 INFO util.ExitUtil: Exiting with status 1: org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
19/07/23 13:49:34 INFO nfs3.Nfs3Base: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Nfs3 at HW02.ucera.local/172.18.4.47
************************************************************/
Not sure how to interpret any of the errors seen here (and have not installed any packages like nfs-utils, assuming that Ambari would have installed all needed packages when cluster was initially installed).
Any debugging suggestions or solutions for what to do about this?
** UPDATE:
After looking at the error, I can see
Caused by: java.net.BindException: Address already in use
and looking into what is already using it, we see...
[root#HW02 ~]# netstat -ltnp | grep 4242
tcp 0 0 0.0.0.0:4242 0.0.0.0:* LISTEN 98067/jsvc.exec
The process jsvc.exec appears to be related to running java applications. Given that hadoop runs on java, I assume it would be bad to just kill the process. Is it not supposed to be on this port (since interferes with NFS Gateway)? Not sure what to do about this.
TLDR: nfs gateway service was already running (by default, apparently) and the service that I thought was blocking the hadoop nfs3 service (jsvc.exec) from starting was (I'm assuming) part of that service already running.
What made me suspect this was that when shutting down the cluster, the service also stopped plus the fact that it was using the port I needed for nfs. The way that I confirmed this was just from following the verification steps in the docs and seeing that my output was similar to what should be expected.
[root#HW02 ~]# rpcinfo -p hw02
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4242 mountd
100005 2 udp 4242 mountd
100005 3 udp 4242 mountd
100005 1 tcp 4242 mountd
100005 2 tcp 4242 mountd
100005 3 tcp 4242 mountd
100003 3 tcp 2049 nfs
[root#HW02 ~]# showmount -e hw02
Export list for hw02:
/ *
Another thing that could told me that the jsvc process was part of an already running hdfs nfs service would have been checking the process info...
[root#HW02 ~]# ps -feww | grep jsvc
root 61106 59083 0 14:27 pts/2 00:00:00 grep --color=auto jsvc
root 163179 1 0 12:14 ? 00:00:00 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
...
hdfs 163193 163179 0 12:14 ? 00:00:17 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
and seeing jsvc.exec -Dproc_nfs3 ... to get the hint that jsvc (which apparently is for running java apps on linux) was being used to run the very nfs3 service I was trying to start.
And for anyone else with this problem, note that I did not stop all the services that the docs want you to stop (since using centos7)
[root#HW01 /]# service nfs status
Redirecting to /bin/systemctl status nfs.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root#HW01 /]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2019-07-19 15:17:02 HST; 6 days ago
Main PID: 2155 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─2155 /sbin/rpcbind -w
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Also note that I did not follow any of the config file settings recommended in the docs (and that some of the properties instructed in the docs could not even be found in the Ambari-managed HDFS configs (so if anyone can explain why this is still working for me despite that, please do)).
** Update:
After talking with some people more experienced with using HDP (v3.1) than me, the docs that I linked to for setting up NFS for HDFS may not be totally up to date (when setting up NFS via Ambari mgnt. in any case)...
Can have a cluster node act as an NFS gateway by checking it off as a NFS node in the Ambari host management UI:
Needed configs can be set like so in the HDFS mgnt. UI...
Can confirm that HDFS NFS gateway is running by looking at the Host > Summary > Components section in Ambari...

ElasticSearch java.net.NoRouteToHostException in docker

[2015-10-11 13:08:26,587][WARN ][transport.netty ] [Joseph] exception caught on transport layer [[id: 0x7e9f652b]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
I get this exception when launching the elasticsearch in docker (Actually I only have this problem in CentOS7 docker host)
First, my dockefile exposes the UDP ports.
EXPOSE 9200 9300/udp 9301/udp 9302/udp 9303/udp 9304/udp 9305/udp
When I start the docker container, I opened these ports via -p 9200:9200 -p 9300:9300/udp -p 9301:9301/udp -p 9302:9302/udp -p 9303:9303/udp -p 9304:9304/udp -p 9305:9305/udp
Within docker ps, I do see these ports are opened as 0.0.0.0:9300-9305->9300-9305/udp
And here is some lines of my elasticsearch.yml
cluster.name: changsha
discovery.zen.ping.unicast.hosts: [ "10.0.5.241" ]
network.publish_host: 10.0.5.241
10.0.5.241 is my docker host's IP address. Please what is wrong here? it succeeded in CentOS6 host, but failes on this CentOS7 host.
UPDATE
Following this answer, I get the following result from tcpdump -p -nn icmp.
09:26:53.277117 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
09:26:53.277494 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
09:26:53.277822 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
09:26:53.278043 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
09:26:54.277753 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
09:27:04.280703 IP 10.0.5.241 > 172.17.0.8: ICMP host 10.0.5.241 unreachable - admin prohibited, length 68
First, find out the docker interface ip address
# ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.42.1 netmask 255.255.0.0 broadcast 0.0.0.0
ether 56:84:7a:fe:97:99 txqueuelen 0 (Ethernet)
RX packets 115761 bytes 12605533 (12.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 55687 bytes 22647938 (21.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Then add all of the docker IP addresses into whitelist
firewall-cmd --permanent --zone=trusted --add-source=172.17.0.0/16
firewall-cmd --reload
Problem solved
If someone come across the issue in centos 7.4, it`s because of the conflict between docker service and firewalld service.
you can solve by disable firewalld and then restart docker service.
please refer https://sanenthusiast.com/docker-and-firewalld-mess-in-centos-7/

Running HBase in standalone mode but get hadoop "retrying connect to server" message?

I'm trying to run HBase in standalone mode following this tutorial:
http://hbase.apache.org/book.html#quickstart
I get the following exception when I try to run
create 'test', 'cf'
in the HBase shell
ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
I've seen questions here regarding this error, but the solutions haven't worked for me.
What is perhaps more troubling, and what may be at the heart of the matter, is that when I stop HBase, I get the following message over and over in the log:
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 192.168.200.1/192.168.200.1:54310. Already tried <n> time(s)
I don't know what server it's trying to connect to- that's not my computer's IP address- and like I said, I'm trying to run HBase in standalone mode.
I would really appreciate if someone could help me understand this log output.
My etc/hosts file:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
127.0.0.1 j.gloves
iconfig -a
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
options=3<RXCSUM,TXCSUM>
inet6 ::1 prefixlen 128
inet 127.0.0.1 netmask 0xff000000
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
nd6 options=1<PERFORMNUD>
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=10b<RXCSUM,TXCSUM,VLAN_HWTAGGING,AV>
ether 10:9a:dd:60:de:3d
nd6 options=1<PERFORMNUD>
media: autoselect (none)
status: inactive
fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 4078
lladdr 70:cd:60:ff:fe:4c:07:7a
nd6 options=1<PERFORMNUD>
media: autoselect <full-duplex>
status: inactive
en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 10:9a:dd:b6:b4:7d
inet6 fe80::129a:ddff:feb6:b47d%en1 prefixlen 64 scopeid 0x6
inet 192.168.1.161 netmask 0xffffff00 broadcast 192.168.1.255
nd6 options=1<PERFORMNUD>
media: autoselect
status: active
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
ether 02:9a:dd:b6:b4:7d
media: autoselect
status: inactive
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///Users/j.gloves/trynutch/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/j.gloves/trynutch/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
Thank you to everyone who offered help in the comments.
My boss was able to fix the problem. It turned out there was an older version of Hadoop on my machine that was referencing an old IP address. Once it was removed from my path and the machine, HBase worked as expected.

Hadoop slave cannot connect to master, even when service is running and ports are open

I'm running hadoop 2.5.1 and I'm having a problem when slaves are connecting to master. My goal is to set-up a hadoop cluster. I hope someone can help, I'm been poundering with this too long already! :)
This is what comes up to the log file of slave:
2014-10-18 22:14:07,368 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.0.104:8020
This is my core-site.xml -file (same on master and slave):
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master/</value>
</property>
</configuration>
This is my hosts -file ((almost)same on master and slave).. I have hard coded addresses to there without any success:
127.0.0.1 localhost
192.168.0.104 xubuntu: xubuntu
192.168.0.104 master
192.168.0.194 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Netstats from master:
xubuntu#xubuntu:/usr/local/hadoop/logs$ netstat -atnp | grep 8020
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 192.168.0.104:8020 0.0.0.0:* LISTEN 26917/java
tcp 0 0 192.168.0.104:52114 192.168.0.104:8020 ESTABLISHED 27046/java
tcp 0 0 192.168.0.104:8020 192.168.0.104:52114 ESTABLISHED 26917/java
Nmap from master to master:
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:36 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.000072s latency).
rDNS record for 192.168.0.104: xubuntu:
PORT STATE SERVICE
8020/tcp open unknown
..and nmap from slave to master (even when the port is open, the slave doesn't connect to it..):
ubuntu#ubuntu:/usr/local/hadoop/logs$ nmap master -p 8020
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:35 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.14s latency).
PORT STATE SERVICE
8020/tcp open unknown
What is this all about? The problem is not about firewall.. I have also read every thread there is to to this without any success. I'm frustrated to this.. :(
At least one of your problems is that you are using old configuration name for the HDFS. For version 2.5.1 the configuration name should be fs.defaultFS instead of fs.default.name. I also suggest defining the port in the value, so the value would be hdfs://master:8020.
Sorry, I'm not linux guru, so I don't know about nmap, but does telnet'ing work from slave to master to the port?

Resources