IBM Cloud Private 2.1.0.3 unable to access port 8443 after installation - ibm-cloud-private

I am currently facing this issue on ICP 2.1.0.3 where after an installation, with all the pods are up and running but the port 8443 is not listening and the platform-ui container seems to be having issue connecting to 10.0.0.25, which is the ClusterIP service for icp-management-ingress. The ICP was installed on fresh VMs where both iptables and ufw are inactive.
Below is the log of the container after a restart.
root#icpmaster:/opt/ibm-cloud-private-ce-2.1.0.3/cluster# docker logs cbaaebabd4c9
[2019-03-06T04:12:38.429] [INFO] [platform-ui] [server] [pid 1] [env production] started.
[HPM] Proxy created: / -> https://icp-management-ingress:8443
[HPM] Proxy rewrite rule created: "^/catalog/api/proxy" ~> ""
[2019-03-06T04:13:33.455] [INFO] [platform-ui] [server] Starting express server.
[2019-03-06T04:13:33.653] [INFO] [platform-ui] [server] Platform UI listening on http port 3000.
[2019-03-06T04:13:46.373] [ERROR] [platform-ui] [service-watcher] Error making request: Error: connect ECONNREFUSED 10.0.0.25:8443
GET https://icp-management-ingress:8443/kubernetes/api/v1/services?labelSelector=inmenu%3Dtrue HTTP/1.1
Accept: application/json
Authorization: Bearer ***
Error: connect ECONNREFUSED 10.0.0.25:8443
[Edit]
So I have restarted the kubelet and docker. Then found this in the kubelet's log related to cgroups about file or directory not found, I wonder is it related to Docker, however my Docker is within the supported version 17.12.1-ce
kubelet's log (trimmed the log)
-- Logs begin at Tue 2019-03-05 22:52:15 +08, end at Thu 2019-03-07 21:36:31 +08. --
Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/var-lib-docker-overlay2- xxx -merged.mount: no such file or directory
Error while processing event ("/sys/fs/cgroup/blkio/system.slice/var-lib-docker-overlay2- xxx -merged.mount": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/var-lib-docker-overlay2- xxx -merged.mount: no such file or directory
Error while processing event ("/sys/fs/cgroup/memory/system.slice/var-lib-docker-overlay2- xxx -merged.mount": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/var-lib-docker-overlay2- xxx-merged.mount: no such file or directory Error while processing event ("/sys/fs/cgroup/devices/system.slice/var-lib-docker-overlay2- xxx -merged.mount: no such file or directory
Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/var-lib-docker-containers- xxx -shm.mount: no such file or directory
Error while processing event ("/sys/fs/cgroup/blkio/system.slice/var-lib-docker-containers- xxx -shm.mount: no such file or directory
Error while processing event ("/sys/fs/cgroup/memory/system.slice/var-lib-docker-containers- xxx -shm.mount": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/var-lib-docker-containers- xxx-shm.mount: no such file or directory
Status of other pods
root#icpmaster:/opt# kubectl --kubeconfig=/var/lib/kubelet/kubelet-config get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE
auth-apikeys-tj57w 1/1 Running 2 1d 10.1.9.5 10.113.64.6
auth-idp-c84cb 2/3 Running 15 1d 10.1.9.8 10.113.64.6
auth-pap-28tnl 1/1 Running 2 1d 10.1.9.32 10.113.64.6
auth-pdp-bwn9k 1/1 Running 2 1d 10.1.9.30 10.113.64.6
calico-kube-controllers-759f7fc556-bfnn8 1/1 Running 0 57m 10.113.64.6 10.113.64.6
calico-node-bdnrc 2/2 Running 46 1d 10.113.64.6 10.113.64.6
calico-node-h8jnd 2/2 Running 4 1d 10.113.64.8 10.113.64.8
catalog-ui-7ctqv 1/1 Running 2 1d 10.1.9.14 10.113.64.6
default-backend-7c6d6df9d5-j4pl9 1/1 Running 0 57m 10.1.9.22 10.113.64.6
heapster-5649f84695-vfjjw 2/2 Running 0 1h 10.1.9.4 10.113.64.6
helm-api-76c8d8bc7-8qjxf 2/2 Running 3 57m 10.1.9.2 10.113.64.6
helm-repo-7455d96-bg2td 1/1 Running 0 58m 10.1.9.19 10.113.64.6
icp-management-ingress-xgp95 1/1 Running 3 1d 10.1.9.62 10.113.64.6
icp-mongodb-0 1/1 Running 23 1d 10.1.9.10 10.113.64.6
image-manager-0 2/2 Running 6 1d 10.113.64.6 10.113.64.6
k8s-etcd-10.113.64.6 1/1 Running 2 1d 10.113.64.6 10.113.64.6
k8s-master-10.113.64.6 3/3 Running 6 1d 10.113.64.6 10.113.64.6
k8s-proxy-10.113.64.6 1/1 Running 2 1d 10.113.64.6 10.113.64.6
k8s-proxy-10.113.64.8 1/1 Running 3 1d 10.113.64.8 10.113.64.8
kube-dns-ltdb4 3/3 Running 37 1d 10.1.9.34 10.113.64.6
logging-elk-client-65745dcd68-b69wb 2/2 Running 0 1h 10.1.9.44 10.113.64.6
logging-elk-data-0 1/1 Running 0 56m 10.1.9.16 10.113.64.6
logging-elk-filebeat-ds-7cb78 1/1 Running 2 1d 10.1.214.67 10.113.64.8
logging-elk-filebeat-ds-vmfbk 1/1 Running 2 1d 10.1.9.36 10.113.64.6
logging-elk-logstash-76c548744b-n24c5 1/1 Running 0 1h 10.1.9.17 10.113.64.6
logging-elk-master-686fbdd984-kpt7s 1/1 Running 0 1h 10.1.9.56 10.113.64.6
mariadb-0 1/1 Running 8 1d 10.113.64.6 10.113.64.6
metrics-server-7f4fdb695f-7rsd5 1/1 Running 7 1h 10.1.9.20 10.113.64.6
nginx-ingress-controller-gjnnb 1/1 Running 9 1d 10.1.9.9 10.113.64.6
platform-api-dq4p8 1/1 Running 2 1d 10.1.9.7 10.113.64.6
platform-deploy-6kzds 1/1 Running 2 1d 10.1.9.11 10.113.64.6
platform-ui-6kzzn 0/1 Running 46 1d 10.1.9.35 10.113.64.6
rescheduler-g85d5 1/1 Running 2 1d 10.113.64.6 10.113.64.6
service-catalog-apiserver-rlfj5 1/1 Running 6 1d 10.1.9.28 10.113.64.6
service-catalog-controller-manager-5b654dc8b8-jfj64 1/1 Running 5 57m 10.1.9.15 10.113.64.6
tiller-deploy-c59888d97-l7rhk 1/1 Running 3 57m 10.113.64.6 10.113.64.6
unified-router-zmxwh 1/1 Running 2 1d 10.1.9.24 10.113.64.6
update-secrets-cpp6j 1/1 Running 10 57m 10.1.9.18 10.113.64.6

auth-idp-c84cb pod is not fully running (2/3 instead of 3/3) which will cause platform-ui container to have issue connecting to 10.0.0.25:8443. This is why the platform-ui pod is not fully running either (platform-ui-6kzzn 0/1).
The cause for auth-idp pods not fully starting is often due to the ICP cluster/environment not having sufficient resources. Please re-install your ICP cluster after increasing the resources to meet these hardware requirements https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.3/supported_system_config/hardware_reqs.html
Especially pay attention to the number of CPUs, RAM, and disk space.

Related

http.host: 0.0.0.0 - ERROR lookup logstash : no such host

I'm deploying ELK on k8s but getting error on Filebeat
kubectl describe pod filebeat-filebeat-rpjbg -n elk ///
Error:
Warning Unhealthy 8s (x5 over 48s) kubelet Readiness probe failed: logstash: logstash:5044...
connection...
parse host... OK
dns lookup... ERROR lookup logstash on 10.245.0.10:53: no such host
In logstash values.yaml may be this causing error?
logstashConfig:
logstash.yml: |
http.host: 0.0.0.0
xpack.monitoring.enabled: false
PODS:
NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 1/1 Running 0 146m
filebeat-filebeat-rpjbg 0/1 Running 0 5m45s
filebeat-filebeat-v4fxz 0/1 Running 0 5m45s
filebeat-filebeat-zf5w7 0/1 Running 0 5m45s
logstash-logstash-0 1/1 Running 0 14m
logstash-logstash-1 1/1 Running 0 14m
logstash-logstash-2 1/1 Running 0 14m
SVC:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch-master ClusterIP 10.245.205.251 <none> 9200/TCP,9300/TCP 172m
elasticsearch-master-headless ClusterIP None <none> 9200/TCP,9300/TCP 172m
logstash-logstash ClusterIP 10.245.104.163 <none> 5044/TCP 16m
logstash-logstash-headless ClusterIP None <none> 9600/TCP 16m
elasticsearch - values.yaml
logstash - values.yaml
filebeat - values.yaml
Filebeat is trying to resolve "logstash", but you don't have a service with that name. You have "logstash-logstash". Try to change filebeat config (line 49 and 116 in filebeat values.yaml) or change the name of your logstash service accordingly.

Hadoop Container failed even 100 percent completed

I have setup a small cluster Hadoop 2.7, Hbase 0.98 and Nutch 2.3.1. I have wrote a custom job that simple first combine docs of same domain, after that each URL of domain (from cache i.e., a list) is first obtained from from cache and then corresponding key is used to fetched the object via datastore.get(url_key) and then after updating score, it is written via context.write.
The job should complete after all docs are processed but what I have observed that each attempt if failed due to timeout and progress is 100 percent complete show. Here is the LOG
attempt_1549963404554_0110_r_000001_1 100.00 FAILED reduce > reduce node2:8042 logs Thu Feb 21 20:50:43 +0500 2019 Fri Feb 22 02:11:44 +0500 2019 5hrs, 21mins, 0sec AttemptID:attempt_1549963404554_0110_r_000001_1 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
attempt_1549963404554_0110_r_000001_3 100.00 FAILED reduce > reduce node1:8042 logs Fri Feb 22 04:39:08 +0500 2019 Fri Feb 22 07:25:44 +0500 2019 2hrs, 46mins, 35sec AttemptID:attempt_1549963404554_0110_r_000001_3 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
attempt_1549963404554_0110_r_000002_0 100.00 FAILED reduce > reduce node3:8042 logs Thu Feb 21 12:38:45 +0500 2019 Thu Feb 21 22:50:13 +0500 2019 10hrs, 11mins, 28sec AttemptID:attempt_1549963404554_0110_r_000002_0 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
What it is so i.e., when an attempt is 100.00 percent complete then it should be marked as successfull. Unfortunately, there is any error information other than timeout for my case. How to debug this problem ?
My reducer is somewhat posted to another question
Apache Nutch 2.3.1 map-reduce timeout occurred while updating the score
I have observed that, in the mentioned 3 logs the time required for execution is varied with big difference. Please look upto the job which you are executing once.

kubernetes windows worker node with calico can not deploy pods

I try to use kubeadm.exe join to join windows worker node but it's not working.
Then I try to refer to this document nwoodmsft/SDN/CalicoFelix.md,after this, node status just like this
# node status
root#ysicing:~# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
win-o35a06j767t Ready <none> 1h v1.10.10 <none> Windows Server Standard 10.0.17134.1 docker://18.9.0
ysicing Ready master 4h v1.10.10 <none> Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://17.3.3
pods stauts:
root#ysicing:~# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default demo-deployment-c96d5d97b-99h9s 0/1 ContainerCreating 0 5m <none> win-o35a06j767t
default demo-deployment-c96d5d97b-lq2jm 0/1 ContainerCreating 0 5m <none> win-o35a06j767t
default demo-deployment-c96d5d97b-zrc2k 1/1 Running 0 5m 192.168.0.3 ysicing
default iis-7f7dc9fbbb-xhccv 0/1 ContainerCreating 0 1h <none> win-o35a06j767t
kube-system calico-node-nr5mt 0/2 ContainerCreating 0 1h 192.168.1.2 win-o35a06j767t
kube-system calico-node-w6mls 2/2 Running 0 5h 172.16.0.169 ysicing
kube-system etcd-ysicing 1/1 Running 0 6h 172.16.0.169 ysicing
kube-system kube-apiserver-ysicing 1/1 Running 0 6h 172.16.0.169 ysicing
kube-system kube-controller-manager-ysicing 1/1 Running 0 6h 172.16.0.169 ysicing
kube-system kube-dns-86f4d74b45-dbcmb 3/3 Running 0 6h 192.168.0.2 ysicing
kube-system kube-proxy-wt6dn 1/1 Running 0 6h 172.16.0.169 ysicing
kube-system kube-proxy-z5jx8 0/1 ContainerCreating 0 1h 192.168.1.2 win-o35a06j767t
kube-system kube-scheduler-ysicing 1/1 Running 0 6h 172.16.0.169 ysicing
The kube-proxy and calico should not be the container way, and it runs under Windows using kube-proxy.exe.
calico pods err info:
Warning FailedCreatePodSandBox 2m (x1329 over 32m) kubelet, win-o35a06j767t Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-nr5mt": Error response from daemon: network host not found
demo.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: iis
spec:
replicas: 1
template:
metadata:
labels:
app: iis
spec:
nodeSelector:
beta.kubernetes.io/os: windows
containers:
- name: iis
image: microsoft/iis
resources:
limits:
memory: "128Mi"
cpu: 2
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
labels:
app: iis
name: iis
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: iis
type: NodePort
demo pods err logs
(extra info:
{"SystemType":"Container","Name":"082e861a8720a84223111b3959a1e2cd26e4be3d0ffcb9eda35b2a09955d4081","Owner":"docker","VolumePath":"\\\\?\\Volume{e8dcfa1d-fbbe-4ef9-b849-5f02b1799a3f}","IgnoreFlushesDuringBoot":true,"LayerFolderPath":"C:\\ProgramData\\docker\\windowsfilter\\082e861a8720a84223111b3959a1e2cd26e4be3d0ffcb9eda35b2a09955d4081","Layers":[{"ID":"8c940e59-c455-597f-b4b2-ff055e33bc2a","Path":"C:\\ProgramData\\docker\\windowsfilter\\7f1a079916723fd228aa878db3bb1e37b50e508422f20be476871597fa53852d"},{"ID":"f72db42e-18f4-54da-98f1-0877e17a069f","Path":"C:\\ProgramData\\docker\\windowsfilter\\449dc4ee662760c0102fe0f388235a111bb709d30b6d9b6787fb26d1ee76c990"},{"ID":"40282350-4b8f-57a2-94e9-31bebb7ec0a9","Path":"C:\\ProgramData\\docker\\windowsfilter\\6ba0fa65b66c3b3134bba338e1f305d030e859133b03e2c80550c32348ba16c5"},{"ID":"f5a96576-2382-5cba-a12f-82ad7616de0f","Path":"C:\\ProgramData\\docker\\windowsfilter\\3b68fac2830f2110aa9eb1c057cf881ee96ce973a378b37e20b74e32c3d41ee0"}],"ProcessorWeight":2,"HostName":"iis-7f7dc9fbbb-xhccv","HvPartition":false})
Warning FailedCreatePodSandBox 14m (x680 over 29m) kubelet, win-o35a06j767t (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "iis-7f7dc9fbbb-xhccv": Error response from daemon: CreateComputeSystem 0b9ab5f3dd4a69464f756aeb0bd780763b38712e32e8c1318fdd17e531437b0f: The operating system of the container does not match the operating system of the host.
(extra info:{"SystemType":"Container","Name":"0b9ab5f3dd4a69464f756aeb0bd780763b38712e32e8c1318fdd17e531437b0f","Owner":"docker","VolumePath":"\\\\?\\Volume{e8dcfa1d-fbbe-4ef9-b849-5f02b1799a3f}","IgnoreFlushesDuringBoot":true,"LayerFolderPath":"C:\\ProgramData\\docker\\windowsfilter\\0b9ab5f3dd4a69464f756aeb0bd780763b38712e32e8c1318fdd17e531437b0f","Layers":[{"ID":"8c940e59-c455-597f-b4b2-ff055e33bc2a","Path":"C:\\ProgramData\\docker\\windowsfilter\\7f1a079916723fd228aa878db3bb1e37b50e508422f20be476871597fa53852d"},{"ID":"f72db42e-18f4-54da-98f1-0877e17a069f","Path":"C:\\ProgramData\\docker\\windowsfilter\\449dc4ee662760c0102fe0f388235a111bb709d30b6d9b6787fb26d1ee76c990"},{"ID":"40282350-4b8f-57a2-94e9-31bebb7ec0a9","Path":"C:\\ProgramData\\docker\\windowsfilter\\6ba0fa65b66c3b3134bba338e1f305d030e859133b03e2c80550c32348ba16c5"},{"ID":"f5a96576-2382-5cba-a12f-82ad7616de0f","Path":"C:\\ProgramData\\docker\\windowsfilter\\3b68fac2830f2110aa9eb1c057cf881ee96ce973a378b37e20b74e32c3d41ee0"}],"ProcessorWeight":2,"HostName":"iis-7f7dc9fbbb-xhccv","HvPartition":false})
Normal SandboxChanged 4m (x1083 over 29m) kubelet, win-o35a06j767t Pod sandbox changed, it will be killed and re-created.
config: "c:\k\"
The cni directory is empty by default. Then add calico-felix.exe and config fileL2Brige.conf
i try to google it, need cni, but not found calico cni.
What should I do in this situation, build Windows calico cni?

cam-bpd-ui pod doesn't start successfully after CAM fresh install

After a CAM 2.1.0.2 fresh install on ICP, by running the following command:
kubectl -n services get pods
I noticed that "cam-bpd-ui" pod didn't start. So I'm not able to log in to Process Designer UI and I'm getting the error: "Readiness probe failed: HTTP probe failed with statuscode: 404".
According to the ICP overview pane it is running and available. However I see this in the logs
"[Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos' doesn't exist
Version: '10.1.16-MariaDB-1~jessie' socket: '/var/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
2018-04-24 16:15:52 140411194034112 [Note] mysqld: ready for connections."
When checking the events in the cam-bpd-ui pod we see the following:
kubectl describe pod cam-bpd-ui-687764b5fc-qxjnp -n services
Name: cam-bpd-ui-687764b5fc-qxjnp
Namespace: services
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned cam-bpd-ui-687764b5fc-qxjnp to 10.190.155.237
Normal SuccessfulMountVolume 27m kubelet, 10.190.155.237 MountVolume.SetUp succeeded for volume "default-token-c8nq4"
Normal SuccessfulMountVolume 27m kubelet, 10.190.155.237 MountVolume.SetUp succeeded for volume "cam-logs-pv"
Normal SuccessfulMountVolume 27m kubelet, 10.190.155.237 MountVolume.SetUp succeeded for volume "cam-bpd-appdata-pv"
Normal Pulled 27m kubelet, 10.190.155.237 Container image "icp-dev.watsonplatform.net:8500/services/icam-busybox:2.1.0.2-x86_64" already present on machine
Normal Created 27m kubelet, 10.190.155.237 Created container
Normal Pulled 27m kubelet, 10.190.155.237 Container image "icp-dev.watsonplatform.net:8500/services/icam-bpd-ui:2.1.0.2-x86_64" already present on machine
Normal Started 27m kubelet, 10.190.155.237 Started container
Normal Created 27m kubelet, 10.190.155.237 Created container
Normal Started 27m kubelet, 10.190.155.237 Started container
Warning Unhealthy 26m (x2 over 26m) kubelet, 10.190.155.237 Readiness probe failed: Get http://10.1.45.36:8080/landscaper/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 12m (x3 over 12m) kubelet, 10.190.155.237 Back-off restarting failed container
Warning Unhealthy 2m (x129 over 26m) kubelet, 10.190.155.237 Readiness probe failed: HTTP probe failed with statuscode: 404
The Process Designer (BPD) needs to connect to the mariadb so that it can populate.
You need to be 100% sure that the db is functional. If it is not, bpd will not return the login page for you.
Some hints:
1) If you have an NFS, ensure that /etc/exports has /export *(rw,insecure,no_subtree_check,async,no_root_squash)
For more details about no_root_squash you can see the link here: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04
As workaround you can do the following:
you can setup your own db and configure so that the BPD uses it.
Details can be found here:
https://www.ibm.com/support/knowledgecenter/SS4GSP_6.2.7/com.ibm.edt.doc/topics/install_database_mysql_bds.html

How to set up autosearch nodes in Elasticsearch 6.1

I have created cluster of 5 nodes in ES 6.1. I am able to create cluster when I added line with all ip addresses of other nodes into configuration file elasticsearch.yaml as discovery.zen.ping.unicast.hosts. It looks like this:
discovery.zen.ping.unicast.hosts: ["10.206.81.241","10.206.81.238","10.206.81.237","10.206.81.239"]
When I have this line in my config file, everything works well.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.206.81.241 9 54 0 0.03 0.05 0.05 mi * master4
10.206.81.239 10 54 0 0.00 0.01 0.05 mi - master1
10.206.81.238 14 54 0 0.00 0.01 0.05 mi - master3
10.206.81.240 15 54 0 0.00 0.01 0.05 mi - master5
10.206.81.237 10 54 0 0.00 0.01 0.05 mi - master2
When I added discovery.zen.ping.multicast.enabled: true elasticsearch will not start.
I would like to have more nodes and if I will have to configure each file separately and add new address to each configuration every time, it is not proper way. So is there any way how to set up ES6 to find new nodes automatically?
EDIT:
journalctl -f output:
led 08 10:43:04 elk-prod3.user.dc.company.local polkitd[548]: Registered Authentication Agent for unix-process:23395:23676999 (system bus name :1.162 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8)
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Stopping Elasticsearch...
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Started Elasticsearch.
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Starting Elasticsearch...
led 08 10:43:04 elk-prod3.user.dc.company.local polkitd[548]: Unregistered Authentication Agent for unix-process:23395:23676999 (system bus name :1.162, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: elasticsearch.service: main process exited, code=exited, status=1/FAILURE
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: Unit elasticsearch.service entered failed state.
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: elasticsearch.service failed.
Basically you should have "stable" nodes. What i mean is that you should have IPs which are always part of cluster
discovery.zen.ping.unicast.hosts: [MASTER_NODE_IP_OR_DNS, MASTER2_NODE_IP_OR_DNS, MASTER3_NODE_IP_OR_DNS]
Then if you use autoscaling or add nodes they must "talk" to that ips to let them know that they are joining the cluster.
You haven't mentioned anything about your network setup so i can say you for sure what is wrong. But as I recall unicast hosts is recommended approach
PS. If you are using azure, there is feature called VM scaleset I modified template to my needs. Idea is that by default I am always using 3 nodes, and if my cluster is loaded scale set will add dynamically more nodes.
discovery.zen.ping.multicast has been removed from elasticsearch, see: https://www.elastic.co/guide/en/elasticsearch/plugins/6.1/discovery-multicast.html

Resources