I setup a Hazelcast-jet cluster on aws ec2 following instructions here. I made use of the hazelcast-aws model so that nodes can automatically discover each other. The cluster is up and running:
[2019-09-26 22:26:26.288] [INFO ] com.hazelcast.config.AbstractConfigLocator - Using configuration file at /home/ec2-user/hazelcast-jet-3.1/config/hazelcast.xml
[2019-09-26 22:26:26.416] [INFO ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Interfaces is enabled, trying to pick one address matching to one of: [172.31.*.*]
[2019-09-26 22:26:26.416] [INFO ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Prefer IPv4 stack is true, prefer IPv6 addresses is false
[2019-09-26 22:26:26.425] [INFO ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Picked [172.31.33.212]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
[2019-09-26 22:26:26.460] [INFO ] com.hazelcast.system - [172.31.33.212]:5701 [jet] [3.1] Hazelcast Jet 3.1 (20190624 - 000ced7) starting at [172.31.33.212]:5701
It also successfully found its peer:
[2019-09-26 22:26:26.664] [INFO ] com.hazelcast.spi.impl.operationservice.impl.BackpressureRegulator - [172.31.33.212]:5701 [jet] [3.1] Backpressure is disabled
[2019-09-26 22:26:27.103] [INFO ] com.hazelcast.instance.Node - [172.31.33.212]:5701 [jet] [3.1] Activating Discovery SPI Joiner
[2019-09-26 22:26:27.297] [INFO ] com.hazelcast.jet.impl.metrics.JetMetricsService - [172.31.33.212]:5701 [jet] [3.1] Configuring metrics collection, collection interval=5 seconds, retention=5 seconds, publishers=[Management Center Publisher, JMX Publisher]
[2019-09-26 22:26:27.343] [INFO ] com.hazelcast.jet.impl.JetService - [172.31.33.212]:5701 [jet] [3.1] Setting number of cooperative threads and default parallelism to 36
[2019-09-26 22:26:27.345] [INFO ] com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl - [172.31.33.212]:5701 [jet] [3.1] Starting 36 partition threads and 19 generic threads (1 dedicated for priority tasks)
[2019-09-26 22:26:27.354] [INFO ] com.hazelcast.internal.diagnostics.Diagnostics - [172.31.33.212]:5701 [jet] [3.1] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
[2019-09-26 22:26:27.364] [INFO ] com.hazelcast.core.LifecycleService - [172.31.33.212]:5701 [jet] [3.1] [172.31.33.212]:5701 is STARTING
[2019-09-26 22:26:27.772] [INFO ] com.hazelcast.nio.tcp.TcpIpConnector - [172.31.33.212]:5701 [jet] [3.1] Connecting to /172.31.47.40:5701, timeout: 10000, bind-any: true
[2019-09-26 22:26:27.782] [INFO ] com.hazelcast.nio.tcp.TcpIpConnection - [172.31.33.212]:5701 [jet] [3.1] Initialized new cluster connection between /172.31.33.212:47065 and /172.31.47.40:5701
[2019-09-26 22:26:33.786] [INFO ] com.hazelcast.internal.cluster.ClusterService - [172.31.33.212]:5701 [jet] [3.1]
Members {size:2, ver:6} [
Member [172.31.47.40]:5701 - 3ba123c0-e98b-47dc-9bf5-34944d2c53a2
Member [172.31.33.212]:5701 - 0127e9a7-80b1-4c5d-a122-2da5aa7fa042 this
]
Everything looks good except for my client (not on aws) not being able to connect to the cluster. All I am doing is running the word counting example. The only difference is that, instead of having both client and server run in the same JVM, I want to submit the task to the cluster I setup. I replaced the JetInstance jet = Jet.newJetInstance(); with (following instructions):
ClientConfig clientConfig = new ClientConfig();
ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();
clientConfig.getGroupConfig().setName("jet");
networkConfig.getAwsConfig().setEnabled(true)
.setProperty("access-key", "abc")
.setProperty("secret-key", "cde")
.setProperty("region", "us-west-2")
.setProperty("security-group-name", "eee")
.setProperty("hz-port", "5701")
.setProperty("use-public-ip", "true");
JetInstance jet = Jet.newJetClient(clientConfig);
I can tell the client is looking for the right endpoints:
INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to cluster with name: jet
Sep 26, 2019 3:40:55 PM com.hazelcast.client.connection.nio.ClusterConnectorService
INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to [172.31.47.40]:5701 as owner member
Sep 26, 2019 3:41:00 PM com.hazelcast.client.connection.nio.ClusterConnectorService
WARNING: hz.client_0 [jet] [3.0] [3.12] Exception during initial connection to [172.31.47.40]:5701: com.hazelcast.core.HazelcastException: java.net.SocketTimeoutException
Sep 26, 2019 3:41:00 PM com.hazelcast.client.connection.nio.ClusterConnectorService
INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to [172.31.33.212]:5701 as owner member
Sep 26, 2019 3:41:05 PM com.hazelcast.client.connection.nio.ClusterConnectorService
WARNING: hz.client_0 [jet] [3.0] [3.12] Exception during initial connection to [172.31.33.212]:5701: com.hazelcast.core.HazelcastException: java.net.SocketTimeoutException
I already added 5701 to the inbound rule of the security group using by the two ec2 instances.
To debug, I ran a couple networking commands to see if port 5701 is open:
[ec2-user#ip-172-31-33-212 ~]$ sudo lsof -i -P -n | grep LISTEN
rpcbind 5428 rpc 8u IPv4 50298 0t0 TCP *:111 (LISTEN)
rpcbind 5428 rpc 11u IPv6 50301 0t0 TCP *:111 (LISTEN)
master 5897 root 13u IPv4 40255 0t0 TCP 127.0.0.1:25 (LISTEN)
sshd 6115 root 3u IPv4 41329 0t0 TCP *:22 (LISTEN)
sshd 6115 root 4u IPv6 41331 0t0 TCP *:22 (LISTEN)
java 43020 ec2-user 10u IPv6 118393 0t0 TCP *:5701 (LISTEN)
[ec2-user#ip-172-31-33-212 ~]$ sudo lsof -i:5701
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 43020 ec2-user 10u IPv6 118393 0t0 TCP *:5701 (LISTEN)
java 43020 ec2-user 45u IPv6 152973 0t0 TCP ip-172-31-33-212.us-west-2.compute.internal:52599->ip-172-31-47-40.us-west-2.compute.internal:5701 (ESTABLISHED)
My knowledge on networking is limited. I cannot figure out what the issue is. One thing I noticed is that the port is opened for ipv6 while the client tried to connect to the private ipv4 address.
Marko was right (look at comments of the question). This looks like some AWS network constrains. I setup netcat server with port 5701 on one of my ec2 box. I was not able to connect to the port from my laptop using nc but able to connect to it from another ec2 in the same VPC. I then did the same experiment with port 80. I can connect to the port from both my laptop and ec2 instances from the same VPC. Looks like something only allows instances outside of AWS to connect to a couple of well-known ports of ec2 instances.
Anyways, I unblocked myself by running the hazelcast server on port 80. This is not ideal but much more convenient for me to try out some hazelcast-jet features from my IDE comparing to deploy testing code to ec2.
Related
I have Elasticsearch running on Kubernetes (EKS), with filebeat running as daemonset on Kubernetes.
Now I am trying to get the logs from other EC2 machines (outside of the EKS), so have installed exact version of filebeat on EC2 and configured it to send logs to Elasticsearch running on Kubernetes.
But not able to see any logs in Elasticsearch (Kibana). Here are the logs for filebeat
2019-08-26T18:18:16.005Z INFO instance/beat.go:292 Setup Beat: filebeat; Version: 7.2.1
2019-08-26T18:18:16.005Z INFO [index-management] idxmgmt/std.go:178 Set output.elasticsearch.index to 'filebeat-7.2.1' as ILM is enabled.
2019-08-26T18:18:16.005Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
2019-08-26T18:18:16.005Z INFO add_cloud_metadata/add_cloud_metadata.go:351 add_cloud_metadata: hosting provider type detected as aws, metadata={"availability_zone":"us-west-2a","instance":{"id":"i-0185e1d68306f95b4"},"machine":{"type":"t2.medium"},"provider":"aws","region":"us-west-2"}
2019-08-26T18:18:16.005Z INFO [publisher] pipeline/module.go:97 Beat name: dev-web1
2019-08-26T18:18:16.006Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
Not much info in the logs.
Then I notice :
root#dev-web1:~# sudo systemctl status filebeat
● filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.
Loaded: loaded (/lib/systemd/system/filebeat.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-08-26 18:18:47 UTC; 18min ago
Docs: https://www.elastic.co/products/beats/filebeat
Main PID: 7768 (filebeat)
CGroup: /system.slice/filebeat.service
└─7768 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://elasticsear
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://elastic
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO [publisher] pipeline/retry.go:189 retryer: send unwait-signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:191 done
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:166 retryer: send wait signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:168 done
Aug 26 18:35:47 dev-web1 filebeat[7768]: 2019-08-26T18:35:47.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
Aug 26 18:36:17 dev-web1 filebeat[7768]: 2019-08-26T18:36:17.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
root#dev-web1:~#
But I can't read complete line in above status message.
So I tried :
root#dev-web1:~# curl elasticsearch.dev.domain.net/_cat/health
1566844775 18:39:35 dev-eks-logs green 3 3 48 24 0 0 0 0 - 100.0%
root#dev-web1:~#
which worked but not with port
root#dev-web1:~# curl elasticsearch.dev.domain.net:9200/_cat/health
filebeat has following config
output.elasticsearch:
hosts: ["elasticsearch.dev.domain.net"]
username: "elastic"
password: "changeme"
How can I fix this at filebeat side ?
Telnet Test :
root#dev-web1:~# telnet <ip> 5044
Trying <ip>...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~# telnet localhost 5044
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~#
https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#hosts-option says:
hosts...If no port is specified, 9200 is used.
Adding hosts: ["elasticsearch.dev.domain.net:80"] in the filbeat configuration should resolve the issue.
I think is a problem of network , check A telnet to localhost/IP 5044
Attempting to use / startup HDFS NFS following the docs (ignoring the instructions to stop the rpcbind service and did not start the hadoop portmap service given that the OS is not SLES 11 and RHEL 6.2), but running into error when trying to set up the NFS service starting the hdfs nfs3 service:
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs status
Redirecting to /bin/systemctl status nfs.service
Unit nfs.service could not be found.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs stop
Redirecting to /bin/systemctl stop nfs.service
Failed to stop nfs.service: Unit nfs.service not loaded.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-23 13:48:54 HST; 28s ago
Process: 27337 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
Main PID: 27338 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─27338 /sbin/rpcbind -w
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Starting RPC bind service...
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Started RPC bind service.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# hdfs nfs3
19/07/23 13:49:33 INFO nfs3.Nfs3Base: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Nfs3
STARTUP_MSG: host = HW02.ucera.local/172.18.4.47
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78
STARTUP_MSG: classpath = /usr/hdp/3.1.0.0-78/hadoop/conf:/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-server-1.19.jar:/usr/hdp/3.1.0.0-78/hadoop/lib/ranger-hdfs-plugin-shim-1.2.0.3.1.0.0-78.jar:
...
<a bunch of other jars>
...
STARTUP_MSG: build = git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64; compiled by 'jenkins' on 2018-12-06T12:26Z
STARTUP_MSG: java = 1.8.0_112
************************************************************/
19/07/23 13:49:33 INFO nfs3.Nfs3Base: registered UNIX signal handlers for [TERM, HUP, INT]
19/07/23 13:49:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Nfs3 metrics system started
19/07/23 13:49:33 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:33 INFO security.ShellBasedIdMapping: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist.
19/07/23 13:49:33 INFO nfs3.WriteManager: Stream timeout is 600000ms.
19/07/23 13:49:33 INFO nfs3.WriteManager: Maximum open streams is 256
19/07/23 13:49:33 INFO nfs3.OpenFileCtxCache: Maximum open streams is 256
19/07/23 13:49:34 INFO nfs3.DFSClientCache: Added export: / FileSystem URI: / with namenodeId: -1408097406
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Configured HDFS superuser is
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Delete current dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Create new dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.Nfs3Base: NFS server port set to: 2049
19/07/23 13:49:34 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:34 INFO mount.RpcProgramMountd: FS:hdfs adding export Path:/ with URI: hdfs://hw01.ucera.local:8020/
19/07/23 13:49:34 INFO oncrpc.SimpleUdpServer: Started listening to UDP requests at port 4242 for Rpc program: mountd at localhost:4242 with workerCount 1
19/07/23 13:49:34 ERROR mount.MountdBase: Failed to start the TCP server.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.hadoop.oncrpc.SimpleTcpServer.run(SimpleTcpServer.java:89)
at org.apache.hadoop.mount.MountdBase.startTCPServer(MountdBase.java:83)
at org.apache.hadoop.mount.MountdBase.start(MountdBase.java:98)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startServiceInternal(Nfs3.java:56)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:69)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:79)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
...
...
19/07/23 13:49:34 INFO util.ExitUtil: Exiting with status 1: org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
19/07/23 13:49:34 INFO nfs3.Nfs3Base: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Nfs3 at HW02.ucera.local/172.18.4.47
************************************************************/
Not sure how to interpret any of the errors seen here (and have not installed any packages like nfs-utils, assuming that Ambari would have installed all needed packages when cluster was initially installed).
Any debugging suggestions or solutions for what to do about this?
** UPDATE:
After looking at the error, I can see
Caused by: java.net.BindException: Address already in use
and looking into what is already using it, we see...
[root#HW02 ~]# netstat -ltnp | grep 4242
tcp 0 0 0.0.0.0:4242 0.0.0.0:* LISTEN 98067/jsvc.exec
The process jsvc.exec appears to be related to running java applications. Given that hadoop runs on java, I assume it would be bad to just kill the process. Is it not supposed to be on this port (since interferes with NFS Gateway)? Not sure what to do about this.
TLDR: nfs gateway service was already running (by default, apparently) and the service that I thought was blocking the hadoop nfs3 service (jsvc.exec) from starting was (I'm assuming) part of that service already running.
What made me suspect this was that when shutting down the cluster, the service also stopped plus the fact that it was using the port I needed for nfs. The way that I confirmed this was just from following the verification steps in the docs and seeing that my output was similar to what should be expected.
[root#HW02 ~]# rpcinfo -p hw02
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4242 mountd
100005 2 udp 4242 mountd
100005 3 udp 4242 mountd
100005 1 tcp 4242 mountd
100005 2 tcp 4242 mountd
100005 3 tcp 4242 mountd
100003 3 tcp 2049 nfs
[root#HW02 ~]# showmount -e hw02
Export list for hw02:
/ *
Another thing that could told me that the jsvc process was part of an already running hdfs nfs service would have been checking the process info...
[root#HW02 ~]# ps -feww | grep jsvc
root 61106 59083 0 14:27 pts/2 00:00:00 grep --color=auto jsvc
root 163179 1 0 12:14 ? 00:00:00 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
...
hdfs 163193 163179 0 12:14 ? 00:00:17 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
and seeing jsvc.exec -Dproc_nfs3 ... to get the hint that jsvc (which apparently is for running java apps on linux) was being used to run the very nfs3 service I was trying to start.
And for anyone else with this problem, note that I did not stop all the services that the docs want you to stop (since using centos7)
[root#HW01 /]# service nfs status
Redirecting to /bin/systemctl status nfs.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root#HW01 /]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2019-07-19 15:17:02 HST; 6 days ago
Main PID: 2155 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─2155 /sbin/rpcbind -w
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Also note that I did not follow any of the config file settings recommended in the docs (and that some of the properties instructed in the docs could not even be found in the Ambari-managed HDFS configs (so if anyone can explain why this is still working for me despite that, please do)).
** Update:
After talking with some people more experienced with using HDP (v3.1) than me, the docs that I linked to for setting up NFS for HDFS may not be totally up to date (when setting up NFS via Ambari mgnt. in any case)...
Can have a cluster node act as an NFS gateway by checking it off as a NFS node in the Ambari host management UI:
Needed configs can be set like so in the HDFS mgnt. UI...
Can confirm that HDFS NFS gateway is running by looking at the Host > Summary > Components section in Ambari...
I'm using Hazelcast in EC2 discovery mode to allow OrientDB to run in distributed mode. The nodes are running in the same security group under the same EC2 role. The plugin successfully discovers other nodes but fails to connect over ports 5701-5703. Here is the error message:
2019-05-31 19:44:45:303 INFO [10.4.31.181]:5701 [orientdb] [3.8.4] Could not connect to: /10.4.26.235:5703. Reason: SocketException[Connection timed out to address /10.4.26.235:5703] [InitConnectionTask]
I checked if any process is listening on those ports on other nodes (lsof -i -P -n) and discovered these entries
java 10023 root 91u IPv6 357165 0t0 TCP *:2424 (LISTEN)
java 10023 root 92u IPv6 357166 0t0 TCP *:2480 (LISTEN)
java 10023 root 135u IPv6 357826 0t0 TCP *:5701 (LISTEN)
It seems that all OrientDB listeners are using IPv6 although I never enabled it anywhere (there are no IPv4 listeners). How do I make it listen using IP4? Here is the only thing I changed in hazelcast.xml after I install it.
<network>
<join>
<multicast enabled="false"/>
<aws enabled="true">
<tag-key>Name</tag-key>
<tag-value>orientdb-test</tag-value>
</aws>
</join>
</network>
When I tried to stop the ZooKeeper with command "zkServer stop", I got the following result:
call "C:\Program Files\Java\jdk1.8.0_121"\bin\java "-Dzookeeper.log.dir=C:\zookeeper-3.4.10\bin\.." "-Dzookeeper.root.logger=INFO,CONSOLE" -cp "C:\zookeeper-3.4.10\bin\..\build\classes;C:\zookeeper-3.4.10\bin\..\build\lib\*;C:\zookeeper-3.4.10\bin\..\*;C:\zookeeper-3.4.10\bin\..\lib\*;C:\zookeeper-3.4.10\bin\..\conf" org.apache.zookeeper.server.quorum.QuorumPeerMain "C:\zookeeper-3.4.10\bin\..\conf\zoo.cfg" stop
Output:
2017-09-01 13:55:22,070 [myid:] - INFO [main:DatadirCleanupManager#78] - autopurge.snapRetainCount set to 3
2017-09-01 13:55:22,072 [myid:] - INFO [main:DatadirCleanupManager#79] - autopurge.purgeInterval set to 0
2017-09-01 13:55:22,072 [myid:] - INFO [main:DatadirCleanupManager#101] - Purge task is not scheduled.
2017-09-01 13:55:22,072 [myid:] - WARN [main:QuorumPeerMain#113] - Either no config or no quorum defined in config, running in standalone mode
2017-09-01 13:55:22,145 [myid:] - ERROR [main:ZooKeeperServerMain#55] - Invalid arguments, exiting abnormally
java.lang.NumberFormatException: For input string: "C:\zookeeper-3.4.10\bin\..\conf\zoo.cfg"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.zookeeper.server.ServerConfig.parse(ServerConfig.java:59)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:84)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:53)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
2017-09-01 13:55:22,148 [myid:] - INFO [main:ZooKeeperServerMain#56] - Usage: ZooKeeperServerMain configfile | port datadir [ticktime] [maxcnxns]
I am sure I have started the Zookeeper, because when I tried to start a new one, it shows "java.net.BindException: Address already in use: bind"
Another strange problem is that I cannot find Zookeeper in the Windows Service list. However, when I tried to show all port usage in Windows PowerShell by netstat -and, I found the 2181 is in use:
Proto Local Address Foreign Address State
TCP 0.0.0.0:2181 0.0.0.0:0 LISTENING
[java.exe]
TCP [::1]:2181 [::1]:62268 ESTABLISHED
[java.exe]
TCP [::1]:2181 [::1]:62279 ESTABLISHED
[java.exe]
TCP [::1]:2181 [::1]:62280 ESTABLISHED
[java.exe]
TCP [::1]:2181 [::1]:62281 ESTABLISHED
[java.exe]
I was running ZooKeeper on Windows and wasn't able to stop ZooKeeper running at 2181 port using zookeeper-stop.sh, so I tried this double slash "//" method to taskkill. It worked
1. netstat -ano | findstr :2181
TCP 0.0.0.0:2181 0.0.0.0:0 LISTENING 8876
TCP [::]:2181 [::]:0 LISTENING 8876
2. taskkill //PID 8876 //F
SUCCESS: The process with PID 8876 has been terminated.
Credit goes to: How do I kill the process currently using a port on localhost in Windows?
It looks like there is an open bug concerning the start and stop commands in ZooKeeper
To start ZooKeeper, omit the start parameter and call bin\zkServer instead.
To stop it, if you don't see the process from the task manager. You need to connect to ZooKeeper server as an administrator and perform the kill commands.
More details are here.
Please tell me how to establish the RegionServer of Hbase to master.
I configured 5 region servers, however, only 2 server is worked properly.
hbase(main):001:0> status
2 servers, 0 dead, 1.5000 average load
The hostname of this two servers are sm3-10 and sm3-12 from http://hbase-master:60010.
But the other servers like sm3-8 not work.
I'd like to know the trouble shooting step and resolutions.
sm3-10:slave, work well
[root#sm3-10 ~]# jps
2581 QuorumPeerMain
2761 SecondaryNameNode
2678 DataNode
19913 Jps
2551 HRegionServer
[root#sm3-10 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2678 hdfs 52r IPv6 27608 TCP sm3-10:33316->sm3-12:54310 (ESTABLISHED)
[root#sm3-10 ~]# lsof -i:3888
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2581 zookeeper 19u IPv6 7239 TCP *:ciphire-serv (LISTEN)
java 2581 zookeeper 20u IPv6 7242 TCP sm3-10:ciphire-serv->sm3-11:53593 (ESTABLISHED)
java 2581 zookeeper 25u IPv6 27011 TCP sm3-10:ciphire-serv->sm3-12:40352 (ESTABLISHED)
java 2581 zookeeper 29u IPv6 25573 TCP sm3-10:ciphire-serv->sm3-8:44271 (ESTABLISHED)
sm3-8:slave, not work properly, however, the status looks good
[root#sm3-8 ~]# jps
3489 Jps
2249 HRegionServer
2463 DataNode
2297 QuorumPeerMain
2686 SecondaryNameNode
[root#sm3-8 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2463 hdfs 51u IPv6 9919 TCP sm3-8.nos-seamicro.local:40776->sm3-12:54310 (ESTABLISHED)
[root#sm3-8 ~]# lsof -i:3888
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2297 zookeeper 18u IPv6 5951 TCP *:ciphire-serv (LISTEN)
java 2297 zookeeper 19u IPv6 9839 TCP sm3-8.nos-seamicro.local:52886->sm3-12:ciphire-serv (ESTABLISHED)
java 2297 zookeeper 20u IPv6 5956 TCP sm3-8.nos-seamicro.local:44271->sm3-10:ciphire-serv (ESTABLISHED)
java 2297 zookeeper 24u IPv6 5959 TCP sm3-8.nos-seamicro.local:47922->sm3-11:ciphire-serv (ESTABLISHED)
Mastet:sm3-12
[root#sm3-12 ~]# jps
2760 QuorumPeerMain
3035 NameNode
3096 SecondaryNameNode
2612 HRegionServer
4330 Jps
2872 DataNode
3723 HMaster
[root#sm3-12 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2872 hdfs 51u IPv6 7824 TCP sm3-12:45482->sm3-12:54310 (ESTABLISHED)
java 3035 hdfs 54u IPv6 7783 TCP sm3-12:54310 (LISTEN)
java 3035 hdfs 70u IPv6 7873 TCP sm3-12:54310->sm3-8:40776 (ESTABLISHED)
java 3035 hdfs 71u IPv6 7874 TCP sm3-12:54310->sm3-11:54990 (ESTABLISHED)
java 3035 hdfs 72u IPv6 7875 TCP sm3-12:54310->sm3-10:33316 (ESTABLISHED)
java 3035 hdfs 74u IPv6 7877 TCP sm3-12:54310->sm3-12:45482 (ESTABLISHED)
[root#sm3-12 ~]#
[root#sm3-12 ~]# cat /etc/hbase/conf/hbase-site.xml
hbase.rootdir
hdfs://sm3-12:54310/hbase
true
hbase.zookeeper.quorum
sm3-8,sm3-10,sm3-11,sm3-12,sm3-13
true
--- snip ---
[root#sm3-12 ~]# cat /etc/zookeeper/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper
clientPort=2181
server.1=sm3-10:2888:3888
server.2=sm3-11:2888:3888
server.3=sm3-12:2888:3888
server.4=sm3-8:2888:3888
[root#sm3-12 ~]#
Thanks in advance,
Hiromi
check to make sure your dns is configured properly on all of the hosts, and each server can do a reverse lookup