I am sure that this answer is somewhere out there but I can not find or fix it after several tries. Here is the use-case :
1.> I have two ec2 instances belonging to the same VPC but having different security groups
2.> Both the security groups have 22,80 (for public) and All Traffic from all ports open for CIDR block 10.20.0.0/16
3.> The internal IP of the EC2 instances are 10.20.0.51 (server-1) and 10.20.0.202 (server-2)
4.> I am using these following commands to run two dockerized consul servers on them
server-1 : docker run -it -p 8400:8400 -p 8500:8500 -p 8600:53/udp -p 8301:8301 -p 8300:8300 -h node1 progrium/consul -server -advertise 10.20.0.51 -bootstrap-expect 2
server-2 : docker run -it -p 8400:8400 -p 8500:8500 -p 8600:53/udp -p 8301:8301 -p 8300:8300 --name node2 -h node2 progrium/consul -server -advertise 10.20.0.202 -join 10.20.0.51
5.> Both of them start and for one second they recognise each other and the election happens and the first node gets elected but soon after that server-2 starts saying "memberlist: Suspect node1 has failed, no acks received" and server-1 also says "memberlist: Suspect node2 has failed, no acks received"
This is what the logs look like for server-1
2016/01/04 19:18:35 [INFO] serf: EventMemberJoin: node2 10.20.0.202
2016/01/04 19:18:35 [INFO] consul: adding server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:35 [INFO] consul: Attempting bootstrap with nodes: [10.20.0.51:8300 10.20.0.202:8300]
2016/01/04 19:18:35 [WARN] raft: Heartbeat timeout reached, starting election
2016/01/04 19:18:35 [INFO] raft: Node at 10.20.0.51:8300 [Candidate] entering Candidate state
2016/01/04 19:18:35 [WARN] raft: Remote peer 10.20.0.202:8300 does not have local node 10.20.0.51:8300 as a peer
2016/01/04 19:18:35 [INFO] raft: Election won. Tally: 2
2016/01/04 19:18:35 [INFO] raft: Node at 10.20.0.51:8300 [Leader] entering Leader state
2016/01/04 19:18:35 [INFO] consul: cluster leadership acquired
2016/01/04 19:18:35 [INFO] consul: New leader elected: node1
2016/01/04 19:18:35 [INFO] raft: pipelining replication to peer 10.20.0.202:8300
2016/01/04 19:18:35 [INFO] consul: member 'node1' joined, marking health alive
2016/01/04 19:18:35 [INFO] consul: member 'node2' joined, marking health alive
2016/01/04 19:18:37 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:37 [INFO] agent: Synced service 'consul'
2016/01/04 19:18:39 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:41 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:42 [INFO] memberlist: Marking node2 as failed, suspect timeout reached
2016/01/04 19:18:42 [INFO] serf: EventMemberFailed: node2 10.20.0.202
2016/01/04 19:18:42 [INFO] consul: removing server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
And for server -2
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node2 10.20.0.202
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node2.dc1 10.20.0.202
2016/01/04 19:18:10 [INFO] raft: Node at 10.20.0.202:8300 [Follower] entering Follower state
2016/01/04 19:18:10 [INFO] agent: (LAN) joining: [10.20.0.51]
2016/01/04 19:18:10 [INFO] consul: adding server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:10 [INFO] consul: adding server node2.dc1 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node1 10.20.0.51
2016/01/04 19:18:10 [INFO] agent: (LAN) joined: 1 Err: <nil>
2016/01/04 19:18:10 [ERR] agent: failed to sync remote state: No cluster leader
2016/01/04 19:18:10 [INFO] consul: adding server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:12 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:14 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:16 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:17 [INFO] memberlist: Marking node1 as failed, suspect timeout reached
2016/01/04 19:18:17 [INFO] serf: EventMemberFailed: node1 10.20.0.51
2016/01/04 19:18:17 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:17 [INFO] consul: removing server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:19 [INFO] serf: EventMemberJoin: node1 10.20.0.51
2016/01/04 19:18:19 [INFO] consul: adding server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:19 [INFO] consul: New leader elected: node1
2016/01/04 19:18:21 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:22 [INFO] agent: Synced service 'consul'
2016/01/04 19:18:23 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:25 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:26 [INFO] memberlist: Marking node1 as failed, suspect timeout reached
2016/01/04 19:18:26 [INFO] serf: EventMemberFailed: node1 10.20.0.51
2016/01/04 19:18:26 [INFO] consul: removing server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:26 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:40 [INFO] serf: attempting reconnect to node1 10.20.0.51:8301
2016/01/04 19:18:40 [INFO] serf: EventMemberJoin: node1 10.20.0.51
What exactly I am doing wrong. All I want is to run two consul docker in two EC2 instances and communicate between them without explicitly opening up the ports in the security group (When I explicitly open them up it works of course!)
Please can somebody help.
Thanks
Related
Running single node Consul (v1.8.4) on Ubuntu 18.04. consul service is up, I had set the ui to be true (default).
But when I try access http://192.168.37.128:8500/ui
This site can’t be reached 192.168.37.128 took too long to respond.
ui.json
{
"addresses": {
"http": "0.0.0.0"
}
}
consul.service file:
[Unit]
Description=Consul
Documentation=https://www.consul.io/
[Service]
ExecStart=/usr/bin/consul agent –server –ui –data-dir=/temp/consul –bootstrap-expect=1 –node=vault –bind=–config-dir=/etc/consul.d/
ExecReload=/bin/kill –HUP $MAINPID
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
systemctl status consul
● consul.service - Consul
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2020-10-04 19:19:08 CDT; 50min ago
Docs: https://www.consul.io/
Main PID: 9477 (consul)
Tasks: 9 (limit: 4980)
CGroup: /system.slice/consul.service
└─9477 /opt/consul/bin/consul agent -server -ui -data-dir=/temp/consul -bootstrap-expect=1 -node=vault -bind=1
agent.server.raft: heartbeat timeout reached, starting election: last-leader=
agent.server.raft: entering candidate state: node="Node at 192.168.37.128:8300 [Candid
agent.server.raft: election won: tally=1
agent.server.raft: entering leader state: leader="Node at 192.168.37.128:8300 [Leader]
agent.server: cluster leadership acquired
agent.server: New leader elected: payload=vault
agent.leader: started routine: routine="federation state anti-entropy"
agent.leader: started routine: routine="federation state pruning"
agent.leader: started routine: routine="CA root pruning"
agent: Synced node info
Shows bind at 192.168.37.128:8300
This issue was firewall, had to open firewall on 8500
sudo ufw allow 8500/tcp
Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I have a storm Topology with 3 bolts(A,B,C), Where the middle bolt takes around 450ms mean time and other two bolts takes less than 1ms.
I am able to run topology with following parallelism hint values:
A: 4
B: 700
C: 10
But when I increase parallelism hint of B to 1200, the topology does not start.
In the topology logs, I see logs to load the executor: B multiple times, like this:
2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]
But in between worker process get restarted. I don't see any error in topology logs or storm logs. Following are storm logs, when worker gets restart:
2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0
This keeps happening and topology never restarts, which used to start perfectly when parallelism hint for bolt: B was 700, there is no other change.
I see one interesting log here is, not yet sure what this means:
Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
Any Suggestions?
Edit:
Config:
topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g
Edit:
Logs for strace -fp PID -e trace=read,write,network,signal,ipc in gist.
not yet able to understand it fully, some relevant looking from it:
[pid 3362] open("/usr/lib/locale/UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 3362] kill(1487, SIGTERM) = 0
[pid 3362] close(1)
Quick google suggests 143 is the exit code for when the JVM receives a SIGTERM (e.g. Always app Java end with "Exit 143" Ubuntu). You might be running out of memory, or the OS may be killing the process for some other reason. Remember that setting the parallelism hint to 1200 means that you will get 1200 tasks (copies) for bolt B, where you only had 700 before.
I was able to get this running by tweaking following configurations, seems like it was timing out due to nimbus.task.launch.sec, which was set to 120 and it was restarting the worker if it was not started within 120 secs.
Updated value of some of these configs:
drpc.request.timeout.secs: 1600
supervisor.worker.start.timeout.secs: 1200
nimbus.supervisor.timeout.secs: 1200
nimbus.task.launch.secs: 1200
About nimbus.task.launch.sec:
A special timeout used when a task is initially launched. During launch, this is the timeout used until the first heartbeat, overriding nimbus.task.timeout.secs.
A separate timeout exists for launch because there can be quite a bit of overhead to launching new JVM's and configuring them.
I try to run Consul image on Mac forwarding 8500 port for simple tests.
My command to run the image is:
docker run -it -p 8500:8500 consul agent -server -bootstrap 0.0.0.0
I do not use --net=host since it does not work on Mac so I try to forward 8500.
When I try to telnet from my Mac the connection gets immediately closed:
user$ telnet localhost 8500
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.
Or when I try to add a new value I get:
consul kv put foo bar
Error! Failed writing data: Put http://127.0.0.1:8500/v1/kv/foo: dial tcp 127.0.0.1:8500: getsockopt: connection refused
What did I miss?
I have just tried what you have posted and seems that the port 8500 is opened
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f4ac8a5233e2 consul "docker-entrypoint..." 2 minutes ago Up 2 minutes 8300-8302/tcp, 8301-8302/udp, 8600/tcp, 8600/udp, 0.0.0.0:8500->8500/tcp sharp_knuth
And I get this:
Trying 0.0.0.0...
Connected to dev-consul
Escape character is '^]'.
Connection closed by foreign host.
However, it is running as you can see from the logs:
==> Starting Consul agent...
==> Consul agent running!
Version: 'v0.9.3'
Node ID: '27998add-58f9-e424-84a0-038db228629f'
Node name: '68bfdf141e7f'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2017/10/02 20:26:27 [DEBUG] Using random ID "27998add-58f9-e424-84a0-038db228629f" as node ID
2017/10/02 20:26:27 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:127.0.0.1:8300 Address:127.0.0.1:8300}]
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2017/10/02 20:26:27 [INFO] serf: EventMemberJoin: 68bfdf141e7f.dc1 127.0.0.1
2017/10/02 20:26:27 [INFO] serf: EventMemberJoin: 68bfdf141e7f 127.0.0.1
2017/10/02 20:26:27 [INFO] consul: Adding LAN server 68bfdf141e7f (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2017/10/02 20:26:27 [INFO] consul: Handled member-join event for server "68bfdf141e7f.dc1" in area "wan"
2017/10/02 20:26:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2017/10/02 20:26:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2017/10/02 20:26:27 [INFO] agent: Started HTTP server on [::]:8500
2017/10/02 20:26:27 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2017/10/02 20:26:27 [DEBUG] raft: Votes needed: 1
2017/10/02 20:26:27 [DEBUG] raft: Vote granted from 127.0.0.1:8300 in term 2. Tally: 1
2017/10/02 20:26:27 [INFO] raft: Election won. Tally: 1
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Leader] entering Leader state
2017/10/02 20:26:27 [INFO] consul: cluster leadership acquired
2017/10/02 20:26:27 [DEBUG] consul: Skipping self join check for "68bfdf141e7f" since the cluster is too small
2017/10/02 20:26:27 [INFO] consul: member '68bfdf141e7f' joined, marking health alive
2017/10/02 20:26:27 [INFO] consul: New leader elected: 68bfdf141e7f
2017/10/02 20:26:28 [INFO] agent: Synced node info
2017/10/02 20:27:27 [DEBUG] consul: Skipping self join check for "68bfdf141e7f" since the cluster is too small
2017/10/02 20:27:34 [DEBUG] agent: Node info in sync
I had added a consul agent to the host in client mode and added a service.
And now, it constantly and silently removes the service and registers again
2017/01/27 08:25:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:26:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:28:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:29:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:30:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:31:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:33:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:35:23 [INFO] consul: member 'static' joined, marking health alive
2017/01/27 08:37:23 [INFO] consul: member 'static' joined, marking health alive
The service config is simple
{
"service": {
"tags": [
"master"
],
"address": "172.16.50.40",
"port": 5432,
"name": "staging-postgres"
}
}
Is it posible to register a service forever and deregister only manually?
Consul services are usually registered to a specific node (member). When that member leaves the cluster, it is assumed that all its services are also unhealthy, therefore they are marked as unhealthy.
It would be helpful to know why "static" continues to join and leave the cluster. If that's a behavior that cannot be prevented, it might be best to register your service as an external service.
$ curl -X PUT -d '{"Datacenter": "dc1", "Node": "google",
"Address": "www.google.com",
"Service": {"Service": "search", "Port": 80}}'
http://127.0.0.1:8500/v1/catalog/register
When I tried to run a topology in my storm client I got an error that point to a connection failed with the nimbus .
I checked my numbus log and here's what shows :
2014-04-25 11:05:03 nimbus [INFO] Uploading file from client to storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Finished uploading file from client: storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Received topology submission for beat with conf {"topology.max.task.parallelism" nil, "topology.acker.executors" 1, "topology.kryo.register" nil, "topology.kryo.decorators" (), "topology.nam$
2014-04-25 11:05:03 nimbus [INFO] Activating beat: beat-2-1398416703
2014-04-25 11:05:03 EvenScheduler [INFO] Available slots: (["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6702] ["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6703] ["8f506a92-4a1b-4cc6-8f80-ed53ea810256" 6701] ["8f506a92-4a1b-4cc6-8f80-e$
2014-04-25 11:05:03 nimbus [INFO] Setting new assignment for topology id beat-2-1398416703: #backtype.storm.daemon.common.Assignment{:master-code-dir "storm-local/nimbus/stormdist/beat-2-1398416703", :node->host {"c3a1bab3-e$
2014-04-25 12:08:03 nimbus [INFO] Cleaning inbox ... deleted: stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
**2014-04-25 13:59:47 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?
2014-04-25 14:00:16 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?**
any clarification ?