I am using ECK to deploy Elasticsearch cluster on Kubernetes.
My Elasticsearch is working fine and it shows green as cluster. But when Enterprise search start and start creating indexes in Elasticsearch, after creating some indexes, it give error for timeout.
pv.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: elasticsearch-master
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /mnt/nfs/kubernetes/elasticsearch/master/
...
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: elasticsearch-data
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /mnt/nfs/kubernetes/elasticsearch/data/
...
multi_node.yaml
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: bselastic
spec:
version: 8.1.2
nodeSets:
- name: masters
count: 1
config:
node.roles: ["master",
# "data",
]
xpack.ml.enabled: true
# Volumeclaim needed to add volume, it was giving error for not volume claim
# and its not starting pod.
volumeClaimTemplates:
- metadata:
name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
- name: data-node
count: 1
config:
node.roles: ["data", "ingest"]
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
...
---
apiVersion: enterprisesearch.k8s.elastic.co/v1
kind: EnterpriseSearch
metadata:
name: enterprise-search-bselastic
spec:
version: 8.1.3
count: 1
elasticsearchRef:
name: bselastic
podTemplate:
spec:
containers:
- name: enterprise-search
env:
- name: JAVA_OPTS
value: -Xms2g -Xmx2g
- name: "elasticsearch.startup_retry.interval"
value: "30"
- name: allow_es_settings_modification
value: "true"
...
Apply these changes using below command.
kubectl apply -f multi_node.yaml -n deleteme -f pv.yaml
Check the Elasticsearch cluster status
# kubectl get es -n deleteme
NAME HEALTH NODES VERSION PHASE AGE
bselastic unknown 8.1.2 ApplyingChanges 47s
Check all pods
# kubectl get pod -n deleteme
NAME READY STATUS RESTARTS AGE
bselastic-es-data-node-0 0/1 Running 0 87s
bselastic-es-masters-0 1/1 Running 0 87s
enterprise-search-bselastic-ent-54675f95f8-9sskf 0/1 Running 0 86s
Elasticsearch cluster become green after 7+ min
[root#1175014-kubemaster01 nilesh]# kubectl get es -n deleteme
NAME HEALTH NODES VERSION PHASE AGE
bselastic green 2 8.1.2 Ready 7m30s
enterprise search log
# kubectl -n deleteme logs -f enterprise-search-bselastic-ent-549bbcb9-rnhmc
Custom Enterprise Search configuration file detected, not overwriting it (any settings passed via environment will be ignored)
Found java executable in PATH
Java version detected: 11.0.14.1 (major version: 11)
Enterprise Search is starting...
[2022-04-25T16:34:22.282+00:00][7][2000][app-server][INFO]: Elastic Enterprise Search version=8.1.3, JRuby version=9.2.16.0, Ruby version=2.5.7, Rails version=5.2.6
[2022-04-25T16:34:23.862+00:00][7][2000][app-server][INFO]: Performing pre-flight checks for Elasticsearch running on https://bselastic-es-http.deleteme.svc:9200...
[2022-04-25T16:34:25.308+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Elasticsearch backend. Make sure it is running and healthy.
[2022-04-25T16:34:25.310+00:00][7][2000][app-server][INFO]: [pre-flight] Error: /usr/share/enterprise-search/lib/war/shared_togo/lib/shared_togo/elasticsearch_checks.class:187: Connection refused (Connection refused) (Faraday::ConnectionFailed)
[2022-04-25T16:34:31.353+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Elasticsearch backend. Make sure it is running and healthy.
[2022-04-25T16:34:31.355+00:00][7][2000][app-server][INFO]: [pre-flight] Error: /usr/share/enterprise-search/lib/war/shared_togo/lib/shared_togo/elasticsearch_checks.class:187: Connection refused (Connection refused) (Faraday::ConnectionFailed)
[2022-04-25T16:34:37.370+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Elasticsearch backend. Make sure it is running and healthy.
[2022-04-25T16:34:37.372+00:00][7][2000][app-server][INFO]: [pre-flight] Error: /usr/share/enterprise-search/lib/war/shared_togo/lib/shared_togo/elasticsearch_checks.class:187: Connection refused (Connection refused) (Faraday::ConnectionFailed)
[2022-04-25T16:34:43.384+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Elasticsearch backend. Make sure it is running and healthy.
[2022-04-25T16:34:43.386+00:00][7][2000][app-server][INFO]: [pre-flight] Error: /usr/share/enterprise-search/lib/war/shared_togo/lib/shared_togo/elasticsearch_checks.class:187: Connection refused (Connection refused) (Faraday::ConnectionFailed)
[2022-04-25T16:34:49.400+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Elasticsearch backend. Make sure it is running and healthy.
[2022-04-25T16:34:49.401+00:00][7][2000][app-server][INFO]: [pre-flight] Error: /usr/share/enterprise-search/lib/war/shared_togo/lib/shared_togo/elasticsearch_checks.class:187: Connection refused (Connection refused) (Faraday::ConnectionFailed)
[2022-04-25T16:37:56.290+00:00][7][2000][app-server][INFO]: [pre-flight] Elasticsearch cluster is ready
[2022-04-25T16:37:56.292+00:00][7][2000][app-server][INFO]: [pre-flight] Successfully connected to Elasticsearch
[2022-04-25T16:37:56.367+00:00][7][2000][app-server][INFO]: [pre-flight] Successfully loaded Elasticsearch plugin information for all nodes
[2022-04-25T16:37:56.381+00:00][7][2000][app-server][INFO]: [pre-flight] Elasticsearch running with an active basic license
[2022-04-25T16:37:56.423+00:00][7][2000][app-server][INFO]: [pre-flight] Elasticsearch API key service is enabled
[2022-04-25T16:37:56.446+00:00][7][2000][app-server][INFO]: [pre-flight] Elasticsearch will be used for authentication
[2022-04-25T16:37:56.447+00:00][7][2000][app-server][INFO]: Elasticsearch looks healthy and configured correctly to run Enterprise Search
[2022-04-25T16:37:56.452+00:00][7][2000][app-server][INFO]: Performing pre-flight checks for Kibana running on http://localhost:5601...
[2022-04-25T16:37:56.482+00:00][7][2000][app-server][WARN]: [pre-flight] Failed to connect to Kibana backend. Make sure it is running and healthy.
[2022-04-25T16:37:56.486+00:00][7][2000][app-server][ERROR]: Could not connect to Kibana backend after 0 seconds.
[2022-04-25T16:37:56.488+00:00][7][2000][app-server][WARN]: Enterprise Search is unable to connect to Kibana. Ensure it is running at http://localhost:5601 for user deleteme-enterprise-search-bselastic-ent-user.
[2022-04-25T16:37:59.344+00:00][7][2000][app-server][INFO]: Elastic APM agent is disabled
{"timestamp": "2022-04-25T16:38:05+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"timestamp": "2022-04-25T16:38:06+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"timestamp": "2022-04-25T16:38:16+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"timestamp": "2022-04-25T16:38:26+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"timestamp": "2022-04-25T16:38:36+00:00", "message": "readiness probe failed", "curl_rc": "7"}
[2022-04-25T16:38:43.880+00:00][7][2000][app-server][INFO]: [db_lock] [installation] Status: [Starting] Ensuring migrations tracking index exists
{"timestamp": "2022-04-25T16:38:45+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"timestamp": "2022-04-25T16:38:56+00:00", "message": "readiness probe failed", "curl_rc": "7"}
[2022-04-25T16:39:05.283+00:00][7][2000][app-server][INFO]: [db_lock] [installation] Status: [Finished] Ensuring migrations tracking index exists
[2022-04-25T16:39:05.782+00:00][7][2000][app-server][INFO]: [db_lock] [installation] Status: [Starting] Creating indices for 38 models
[2022-05-02T16:21:47.303+00:00][8][2000][es][DEBUG]: {
"request": {
"url": "https://bselastic-es-http.deleteme.svc:9200/.ent-search-actastic-oauth_applications_v2",
"method": "put",
"headers": {
"Authorization": "[FILTERED]",
"Content-Type": "application/json",
"x-elastic-product-origin": "enterprise-search",
"User-Agent": "Faraday v1.8.0"
},
"params": null,
"body": "{\"settings\":{\"index\":{\"hidden\":true,\"refresh_interval\":-1},\"number_of_shards\":1,\"auto_expand_replicas\":\"0-3\",\"priority\":250},\"mappings\":{\"dynamic\":\"strict\",\"properties\":{\"id\":{\"type\":\"keyword\"},\"created_at\":{\"type\":\"date\"},\"updated_at\":{\"type\":\"date\"},\"name\":{\"type\":\"keyword\"},\"uid\":{\"type\":\"keyword\"},\"secret\":{\"type\":\"keyword\"},\"redirect_uri\":{\"type\":\"keyword\"},\"scopes\":{\"type\":\"keyword\"},\"confidential\":{\"type\":\"boolean\"},\"app_type\":{\"type\":\"keyword\"}}},\"aliases\":{}}"
},
"exception": "/usr/share/enterprise-search/lib/war/lib/swiftype/es/client.class:28: Read timed out (Faraday::TimeoutError)\n",
"duration": 30042.3,
"stack": [
"lib/actastic/schema.class:172:in `create_index!'",
"lib/actastic/schema.class:195:in `create_index_and_mapping!'",
"shared_togo/lib/shared_togo.class:894:in `block in apply_actastic_migrations'",
"shared_togo/lib/shared_togo.class:892:in `block in each'",
"shared_togo/lib/shared_togo.class:892:in `block in apply_actastic_migrations'",
"lib/db_lock.class:182:in `with_status'",
"shared_togo/lib/shared_togo.class:891:in `apply_actastic_migrations'",
"shared_togo/lib/shared_togo.class:406:in `block in install!'",
"lib/db_lock.class:171:in `with_lock'",
"shared_togo/lib/shared_togo.class:399:in `install!'",
"config/application.class:102:in `block in Application'",
"config/environment.class:9:in `<main>'",
"config/environment.rb:1:in `<main>'",
"shared_togo/lib/shared_togo/cli/command.class:37:in `initialize'",
"shared_togo/lib/shared_togo/cli/command.class:10:in `run_and_exit'",
"shared_togo/lib/shared_togo/cli.class:143:in `run_supported_command'",
"shared_togo/lib/shared_togo/cli.class:125:in `run_command'",
"shared_togo/lib/shared_togo/cli.class:112:in `run!'",
"bin/enterprise-search-internal:15:in `<main>'"
]
}
[2022-04-25T16:55:21.340+00:00][7][2000][app-server][INFO]: [db_lock] [installation] Status: [Failed] Creating indices for 38 models: Error = Faraday::TimeoutError: Read timed out
Unexpected exception while running Enterprise Search:
Error: Read timed out at
Master node logs
# kubectl -n deleteme logs -f bselastic-es-masters-0
Skipping security auto configuration because the configuration file [/usr/share/elasticsearch/config/elasticsearch.yml] is missing or is not a regular file
{"#timestamp":"2022-04-25T16:55:11.051Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ent-search-actastic-search_relevance_suggestions-document_position_id-unique-constraint][0]]]).","previous.health":"YELLOW","reason":"shards started [[.ent-search-actastic-search_relevance_suggestions-document_position_id-unique-constraint][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[bselastic-es-masters-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"rnaZmz4kQwOBNbWau43wYA","elasticsearch.node.id":"YMyOM1umSL22ro86II6Ymw","elasticsearch.node.name":"bselastic-es-masters-0","elasticsearch.cluster.name":"bselastic"}
{"#timestamp":"2022-04-25T16:55:21.447Z", "log.level": "WARN", "message":"writing cluster state took [10525ms] which is above the warn threshold of [10s]; [skipped writing] global metadata, wrote metadata for [0] new indices and [1] existing indices, removed metadata for [0] indices and skipped [48] unchanged indices", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[bselastic-es-masters-0][generic][T#5]","log.logger":"org.elasticsearch.gateway.PersistedClusterStateService","elasticsearch.cluster.uuid":"rnaZmz4kQwOBNbWau43wYA","elasticsearch.node.id":"YMyOM1umSL22ro86II6Ymw","elasticsearch.node.name":"bselastic-es-masters-0","elasticsearch.cluster.name":"bselastic"}
{"#timestamp":"2022-04-25T16:55:21.448Z", "log.level": "INFO", "message":"after [10.3s] publication of cluster state version [226] is still waiting for {bselastic-es-masters-0}{YMyOM1umSL22ro86II6Ymw}{ljGkLdk-RAukc9NEJtQCVw}{192.168.88.213}{192.168.88.213:9300}{m}{k8s_node_name=1175027-kubeworker15.sb.rackspace.com, xpack.installed=true} [SENT_APPLY_COMMIT], {bselastic-es-data-node-0}{K88khDyfRwaGCBZwMKEaHA}{g9mXrT4WTumoj09W1OylYA}{192.168.88.214}{192.168.88.214:9300}{di}{k8s_node_name=1175027-kubeworker15.sb.rackspace.com, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[bselastic-es-masters-0][generic][T#1]","log.logger":"org.elasticsearch.cluster.coordination.Coordinator.CoordinatorPublication","elasticsearch.cluster.uuid":"rnaZmz4kQwOBNbWau43wYA","elasticsearch.node.id":"YMyOM1umSL22ro86II6Ymw","elasticsearch.node.name":"bselastic-es-masters-0","elasticsearch.cluster.name":"bselastic"}
Which attribute we have to set in Enterprise search to increase timeout ? or is there any way to get debug log for Enterprise search ?
You can try to increase the default timeout Globally parameter by following this example:
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)
This would help to give the cluster more time to respond.
Running single node Consul (v1.8.4) on Ubuntu 18.04. consul service is up, I had set the ui to be true (default).
But when I try access http://192.168.37.128:8500/ui
This site can’t be reached 192.168.37.128 took too long to respond.
ui.json
{
"addresses": {
"http": "0.0.0.0"
}
}
consul.service file:
[Unit]
Description=Consul
Documentation=https://www.consul.io/
[Service]
ExecStart=/usr/bin/consul agent –server –ui –data-dir=/temp/consul –bootstrap-expect=1 –node=vault –bind=–config-dir=/etc/consul.d/
ExecReload=/bin/kill –HUP $MAINPID
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
systemctl status consul
● consul.service - Consul
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2020-10-04 19:19:08 CDT; 50min ago
Docs: https://www.consul.io/
Main PID: 9477 (consul)
Tasks: 9 (limit: 4980)
CGroup: /system.slice/consul.service
└─9477 /opt/consul/bin/consul agent -server -ui -data-dir=/temp/consul -bootstrap-expect=1 -node=vault -bind=1
agent.server.raft: heartbeat timeout reached, starting election: last-leader=
agent.server.raft: entering candidate state: node="Node at 192.168.37.128:8300 [Candid
agent.server.raft: election won: tally=1
agent.server.raft: entering leader state: leader="Node at 192.168.37.128:8300 [Leader]
agent.server: cluster leadership acquired
agent.server: New leader elected: payload=vault
agent.leader: started routine: routine="federation state anti-entropy"
agent.leader: started routine: routine="federation state pruning"
agent.leader: started routine: routine="CA root pruning"
agent: Synced node info
Shows bind at 192.168.37.128:8300
This issue was firewall, had to open firewall on 8500
sudo ufw allow 8500/tcp
I just bought an Amazon ec2 instance and installed erlang and elixir and PostgreSQL.
Just put a basic Phoenix app.
When I run mix phx. Server
It is starting in local host http://localhost:4000/
But I want to run that in Amazon public IP.
So I put that in config/dev.exs
Http: [ip:{1, 2, 3, 4}, port:4000}
After this i have created a security group and allowed all traffic.
Now when i start the app using sudo mix phx.server
I am getting the below error
Compiling 10 files (.ex)
Generated myapp_test app
[error] Failed to start Ranch listener myappTestWeb.Endpoint.HTTP in :ranch_tcp:listen([port: 4000, ip: {1, 2, 3, 4}]) for reason :eaddrnotavail (can't assign requested address)
[info] Application myapp_test exited: myappTest.Application.start(:normal, []) returned an error: shutdown: failed to start child: myappTestWeb.Endpoint
** (EXIT) shutdown: failed to start child: Phoenix.Endpoint.Handler
** (EXIT) shutdown: failed to start child: {:ranch_listener_sup, myappTestWeb.Endpoint.HTTP}
** (EXIT) shutdown: failed to start child: :ranch_acceptors_sup
** (EXIT) {:listen_error, myappTestWeb.Endpoint.HTTP, :eaddrnotavail}
[info] Application phoenix_ecto exited: :stopped
[info] Application ecto exited: :stopped
[info] Application poolboy exited: :stopped
[info] Application postgrex exited: :stopped
[info] Application decimal exited: :stopped
[info] Application db_connection exited: :stopped
[info] Application connection exited: :stopped
[info] Application cowboy exited: :stopped
[info] Application cowlib exited: :stopped
[info] Application ranch exited: :stopped
[info] Application runtime_tools exited: :stopped
=INFO REPORT==== 23-Jan-2018::10:48:23 ===
application: logger
exited: stopped
type: temporary
** (Mix) Could not start application myapp_test: myappTest.Application.start(:normal, []) returned an error: shutdown: failed to start child: myappTestWeb.Endpoint
** (EXIT) shutdown: failed to start child: Phoenix.Endpoint.Handler
** (EXIT) shutdown: failed to start child: {:ranch_listener_sup, myappTestWeb.Endpoint.HTTP}
** (EXIT) shutdown: failed to start child: :ranch_acceptors_sup
** (EXIT) {:listen_error, myappTestWeb.Endpoint.HTTP, :eaddrnotavail}
When i put the public IP in browser also it is not working.
Do i need to install apache or anyother webserver.
Or
Do i need to bind the amazon public IP anywhere in system?
Any insight on how to fix the issue will be greatly appreciated
Thanks
Your best bet at this point is to start isolating what is failing. Once you can identify components that should be working and aren't, you'll be able to make your question more focused. Some troubleshooting ideas to get you started:
can you ping the ec2 public address from your machine?
does it have that address (ip address show from the ec2 terminal)?
can the ec2 machine ping out to an external ip, like google's dns (ping 8.8.8.8)?
use netcat to see if the port is truly open: sudo nc -l 80 (on the ec2 host) and nc <ec2-ip> 80 on your machine. Then you should be able to type in your machine (make sure you hit enter after some characters) and see it appear on the ec2 host.
remove the address from your cowboy config, and let it bind to 0.0.0.0 (the default), then see if you can reach it.
I try to run Consul image on Mac forwarding 8500 port for simple tests.
My command to run the image is:
docker run -it -p 8500:8500 consul agent -server -bootstrap 0.0.0.0
I do not use --net=host since it does not work on Mac so I try to forward 8500.
When I try to telnet from my Mac the connection gets immediately closed:
user$ telnet localhost 8500
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.
Or when I try to add a new value I get:
consul kv put foo bar
Error! Failed writing data: Put http://127.0.0.1:8500/v1/kv/foo: dial tcp 127.0.0.1:8500: getsockopt: connection refused
What did I miss?
I have just tried what you have posted and seems that the port 8500 is opened
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f4ac8a5233e2 consul "docker-entrypoint..." 2 minutes ago Up 2 minutes 8300-8302/tcp, 8301-8302/udp, 8600/tcp, 8600/udp, 0.0.0.0:8500->8500/tcp sharp_knuth
And I get this:
Trying 0.0.0.0...
Connected to dev-consul
Escape character is '^]'.
Connection closed by foreign host.
However, it is running as you can see from the logs:
==> Starting Consul agent...
==> Consul agent running!
Version: 'v0.9.3'
Node ID: '27998add-58f9-e424-84a0-038db228629f'
Node name: '68bfdf141e7f'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2017/10/02 20:26:27 [DEBUG] Using random ID "27998add-58f9-e424-84a0-038db228629f" as node ID
2017/10/02 20:26:27 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:127.0.0.1:8300 Address:127.0.0.1:8300}]
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2017/10/02 20:26:27 [INFO] serf: EventMemberJoin: 68bfdf141e7f.dc1 127.0.0.1
2017/10/02 20:26:27 [INFO] serf: EventMemberJoin: 68bfdf141e7f 127.0.0.1
2017/10/02 20:26:27 [INFO] consul: Adding LAN server 68bfdf141e7f (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2017/10/02 20:26:27 [INFO] consul: Handled member-join event for server "68bfdf141e7f.dc1" in area "wan"
2017/10/02 20:26:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2017/10/02 20:26:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2017/10/02 20:26:27 [INFO] agent: Started HTTP server on [::]:8500
2017/10/02 20:26:27 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2017/10/02 20:26:27 [DEBUG] raft: Votes needed: 1
2017/10/02 20:26:27 [DEBUG] raft: Vote granted from 127.0.0.1:8300 in term 2. Tally: 1
2017/10/02 20:26:27 [INFO] raft: Election won. Tally: 1
2017/10/02 20:26:27 [INFO] raft: Node at 127.0.0.1:8300 [Leader] entering Leader state
2017/10/02 20:26:27 [INFO] consul: cluster leadership acquired
2017/10/02 20:26:27 [DEBUG] consul: Skipping self join check for "68bfdf141e7f" since the cluster is too small
2017/10/02 20:26:27 [INFO] consul: member '68bfdf141e7f' joined, marking health alive
2017/10/02 20:26:27 [INFO] consul: New leader elected: 68bfdf141e7f
2017/10/02 20:26:28 [INFO] agent: Synced node info
2017/10/02 20:27:27 [DEBUG] consul: Skipping self join check for "68bfdf141e7f" since the cluster is too small
2017/10/02 20:27:34 [DEBUG] agent: Node info in sync
I am sure that this answer is somewhere out there but I can not find or fix it after several tries. Here is the use-case :
1.> I have two ec2 instances belonging to the same VPC but having different security groups
2.> Both the security groups have 22,80 (for public) and All Traffic from all ports open for CIDR block 10.20.0.0/16
3.> The internal IP of the EC2 instances are 10.20.0.51 (server-1) and 10.20.0.202 (server-2)
4.> I am using these following commands to run two dockerized consul servers on them
server-1 : docker run -it -p 8400:8400 -p 8500:8500 -p 8600:53/udp -p 8301:8301 -p 8300:8300 -h node1 progrium/consul -server -advertise 10.20.0.51 -bootstrap-expect 2
server-2 : docker run -it -p 8400:8400 -p 8500:8500 -p 8600:53/udp -p 8301:8301 -p 8300:8300 --name node2 -h node2 progrium/consul -server -advertise 10.20.0.202 -join 10.20.0.51
5.> Both of them start and for one second they recognise each other and the election happens and the first node gets elected but soon after that server-2 starts saying "memberlist: Suspect node1 has failed, no acks received" and server-1 also says "memberlist: Suspect node2 has failed, no acks received"
This is what the logs look like for server-1
2016/01/04 19:18:35 [INFO] serf: EventMemberJoin: node2 10.20.0.202
2016/01/04 19:18:35 [INFO] consul: adding server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:35 [INFO] consul: Attempting bootstrap with nodes: [10.20.0.51:8300 10.20.0.202:8300]
2016/01/04 19:18:35 [WARN] raft: Heartbeat timeout reached, starting election
2016/01/04 19:18:35 [INFO] raft: Node at 10.20.0.51:8300 [Candidate] entering Candidate state
2016/01/04 19:18:35 [WARN] raft: Remote peer 10.20.0.202:8300 does not have local node 10.20.0.51:8300 as a peer
2016/01/04 19:18:35 [INFO] raft: Election won. Tally: 2
2016/01/04 19:18:35 [INFO] raft: Node at 10.20.0.51:8300 [Leader] entering Leader state
2016/01/04 19:18:35 [INFO] consul: cluster leadership acquired
2016/01/04 19:18:35 [INFO] consul: New leader elected: node1
2016/01/04 19:18:35 [INFO] raft: pipelining replication to peer 10.20.0.202:8300
2016/01/04 19:18:35 [INFO] consul: member 'node1' joined, marking health alive
2016/01/04 19:18:35 [INFO] consul: member 'node2' joined, marking health alive
2016/01/04 19:18:37 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:37 [INFO] agent: Synced service 'consul'
2016/01/04 19:18:39 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:41 [INFO] memberlist: Suspect node2 has failed, no acks received
2016/01/04 19:18:42 [INFO] memberlist: Marking node2 as failed, suspect timeout reached
2016/01/04 19:18:42 [INFO] serf: EventMemberFailed: node2 10.20.0.202
2016/01/04 19:18:42 [INFO] consul: removing server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
And for server -2
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node2 10.20.0.202
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node2.dc1 10.20.0.202
2016/01/04 19:18:10 [INFO] raft: Node at 10.20.0.202:8300 [Follower] entering Follower state
2016/01/04 19:18:10 [INFO] agent: (LAN) joining: [10.20.0.51]
2016/01/04 19:18:10 [INFO] consul: adding server node2 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:10 [INFO] consul: adding server node2.dc1 (Addr: 10.20.0.202:8300) (DC: dc1)
2016/01/04 19:18:10 [INFO] serf: EventMemberJoin: node1 10.20.0.51
2016/01/04 19:18:10 [INFO] agent: (LAN) joined: 1 Err: <nil>
2016/01/04 19:18:10 [ERR] agent: failed to sync remote state: No cluster leader
2016/01/04 19:18:10 [INFO] consul: adding server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:12 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:14 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:16 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:17 [INFO] memberlist: Marking node1 as failed, suspect timeout reached
2016/01/04 19:18:17 [INFO] serf: EventMemberFailed: node1 10.20.0.51
2016/01/04 19:18:17 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:17 [INFO] consul: removing server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:19 [INFO] serf: EventMemberJoin: node1 10.20.0.51
2016/01/04 19:18:19 [INFO] consul: adding server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:19 [INFO] consul: New leader elected: node1
2016/01/04 19:18:21 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:22 [INFO] agent: Synced service 'consul'
2016/01/04 19:18:23 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:25 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:26 [INFO] memberlist: Marking node1 as failed, suspect timeout reached
2016/01/04 19:18:26 [INFO] serf: EventMemberFailed: node1 10.20.0.51
2016/01/04 19:18:26 [INFO] consul: removing server node1 (Addr: 10.20.0.51:8300) (DC: dc1)
2016/01/04 19:18:26 [INFO] memberlist: Suspect node1 has failed, no acks received
2016/01/04 19:18:40 [INFO] serf: attempting reconnect to node1 10.20.0.51:8301
2016/01/04 19:18:40 [INFO] serf: EventMemberJoin: node1 10.20.0.51
What exactly I am doing wrong. All I want is to run two consul docker in two EC2 instances and communicate between them without explicitly opening up the ports in the security group (When I explicitly open them up it works of course!)
Please can somebody help.
Thanks