Why does ES show an error log `readiness probe failed`? - elasticsearch

I am deploying Elasticsearch cluster on AWS EKS. Below is the k8s spec yml file.
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: datasource
spec:
version: 7.14.0
nodeSets:
- name: node
count: 3
config:
node.store.allow_mmap: true
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.security.enabled: false
podTemplate:
spec:
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
containers:
- name: elasticsearch
readinessProbe:
exec:
command:
- bash
- -c
- /mnt/elastic-internal/scripts/readiness-probe-script.sh
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 12
successThreshold: 1
timeoutSeconds: 12
env:
- name: READINESS_PROBE_TIMEOUT
value: "30"
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: ebs-sc
resources:
requests:
storage: 1024Gi
After deploy, I see all three pods have error:
{"type": "server", "timestamp": "2021-10-05T05:19:37,041Z", "level": "INFO", "component": "o.e.c.m.MetadataMappingService", "cluster.name": "datasource", "node.name": "datasource-es-node-0", "message": "[.kibana/g5_90XpHSI-y-I7MJfBZhQ] update_mapping [_doc]", "cluster.uuid": "xJ00drroT_CbJPfzi8jSAg", "node.id": "qmtgUZHbR4aTWsYaoIEDEA" }
{"type": "server", "timestamp": "2021-10-05T05:19:37,622Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "datasource", "node.name": "datasource-es-node-0", "message": "Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana][0]]]).", "cluster.uuid": "xJ00drroT_CbJPfzi8jSAg", "node.id": "qmtgUZHbR4aTWsYaoIEDEA" }
{"timestamp": "2021-10-05T05:19:40+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:19:45+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:19:50+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:19:55+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:20:00+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:20:05+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:20:10+00:00", "message": "readiness probe failed", "curl_rc": "35"}
{"timestamp": "2021-10-05T05:20:15+00:00", "message": "readiness probe failed", "curl_rc": "35"}
From above log, it shows Cluster health status changed from [YELLOW] to [GREEN] first then comes to this error readiness probe failed. I wonder how I can solve this issue. Is it Elasticsearch related error or k8s related?

You can by declaring READINESS_PROBE_TIMEOUT in your spec like this.
...
env:
- name: READINESS_PROBE_TIMEOUT
value: "30"
You can customize the readiness probe if necessary, the latest elasticsearch.k8s.elastic.co/v1 API spec is here, it's the same K8s PodTemplateSpec that you can use in your Elasticsearch spec.
Update: curl error code 35 refers to SSL error. Here's a post regarding the script. Can you remove the following settings from your spec and re-run:
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.security.enabled: false

Related

k8s elasticsearch pod running but not work with dynamic pvc

I use https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner and everything is work.
But when I use helm to install elasticsearch, elasticsearch show running but not work.
Step I install elasticsearch
Create certificate , username/password for elasticsearch
manually create pvc/automatically create pvc by elasticsearch's value.yaml
elasticsearch show running but can't curl http://service's ip:9200
I also change the value of persistence: enabled: true to persistence: enabled: false and everything work , also can curl http://service's ip:9200 get the default response for elasticsearch
So I wonder if I misunderstand the way to use pvc, Here is my pvc.yaml and my status when I create pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: elasticsearch-master-elasticsearch-master-0
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 2Gi
storageClassName: nfs-client
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
elasticsearch-master-elasticsearch-master-0 Bound pvc-55480402-c7d7-4a52-aa85-961f97ab7f82 2Gi RWO nfs-client 20m
My disk information for node that run es
10.10.1.134:/mnt/nfs/default-elasticsearch-master-elasticsearch-master-0-pvc-55480402-c7d7-4a52-aa85-961f97ab7f82 15G 1.8G 14G 12% /var/lib/kubelet/pods/b3696f5a-5d9b-4c00-943e-027c2dc7a86c/volumes/kubernetes.io~nfs/pvc-55480402-c7d7-4a52-aa85-961f97ab7f82
drwxrwxrwx 3 root root 19 Sep 19 17:25 pvc-55480402-c7d7-4a52-aa85-961f97ab7f82
My description of elasticsearch
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 21m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Normal Scheduled 21m default-scheduler Successfully assigned default/elasticsearch-master-0 to a136
Normal Pulled 21m kubelet Container image "docker.elastic.co/elasticsearch/elasticsearch:7.17.3" already present on machine
Normal Created 21m kubelet Created container configure-sysctl
Normal Started 21m kubelet Started container configure-sysctl
Normal Pulled 21m kubelet Container image "docker.elastic.co/elasticsearch/elasticsearch:7.17.3" already present on machine
Normal Created 21m kubelet Created container elasticsearch
Normal Started 21m kubelet Started container elasticsearch
My logs of es
{"type": "server", "timestamp": "2022-09-19T09:25:49,509Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "loaded module [x-pack-voting-only-node]" }
{"type": "server", "timestamp": "2022-09-19T09:25:49,509Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "loaded module [x-pack-watcher]" }
{"type": "server", "timestamp": "2022-09-19T09:25:49,510Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "no plugins loaded" }
{"type": "deprecation.elasticsearch", "timestamp": "2022-09-19T09:25:49,515Z", "level": "CRITICAL", "component": "o.e.d.c.s.Settings", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "[node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.", "key": "node.ml", "category": "settings" }
{"type": "deprecation.elasticsearch", "timestamp": "2022-09-19T09:25:49,621Z", "level": "CRITICAL", "component": "o.e.d.c.s.Settings", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "[node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.", "key": "node.data", "category": "settings" }
curl when persist value to true
curl -u username:password http://10.101.45.49:9200
curl: (7) Failed connect to 10.101.45.49:9200; Connection refused
curl when persist value to false
curl -u username:password http://10.101.45.49:9200
{
"name" : "elasticsearch-master-0",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "WRR3PXhLS7GGHl5AaUz2DA",
"version" : {
"number" : "7.17.3",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "5ad023604c8d7416c9eb6c0eadb62b14e766caff",
"build_date" : "2022-04-19T08:11:19.070913226Z",
"build_snapshot" : false,
"lucene_version" : "8.11.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
My volumeClaimTemplate in elasticsearch value.yaml
volumeClaimTemplate:
storageClassName: "nfs-client"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
It seems that es can't work when there is dynamic pvc, but i don't know how to solve the problem, thank you for help

Kubernetes pod name/service name as index name in Kibana using fluentd

Currently we have many services running on k8s and sending logs with fluent-bit to elastic using fluentd.
In fluentd we have hard coded logstash_prefix xxx-logstash, so all logs are created with the same index. Now we want to send data to elastic with respect to podname/service name.
From the json document of logs in kibana, we see there is a key PodName, but how to use this in fluentd.conf? We are using helm for elastic stack deployment.
fluentd.conf
#see more ddetails in https://github.com/uken/fluent-plugin-elasticsearch
apiVersion: v1
kind: ConfigMap
metadata:
name: elasticsearch-output
data:
fluentd.conf: |
#configure the logging level to error
<system>
log_level error
</system>
# Ignore fluentd own events
<label #FLUENT_LOG>
<match fluent.**>
#type null
</match>
</label>
# TCP input to receive logs from the forwarders
<source>
#type forward
bind 0.0.0.0
port 24224
</source>
# HTTP input for the liveness and readiness probes
<source>
#type http
bind 0.0.0.0
port 9880
</source>
# Throw the healthcheck to the standard output instead of forwarding it
<match fluentd.healthcheck>
#type null
</match>
# Send the logs to the standard output
<match **>
#type elasticsearch
include_tag_key true
host "{{ .Release.Name }}-es-http"
port "9200"
user "elastic"
password "{{ (.Values.env.secret.password | b64dec) | indent 4 | trim }}"
logstash_format true
scheme https
ssl_verify false
logstash_prefix xxx-logstash
logstash_prefix_separator -
logstash_dateformat %Y.%m.%d
<buffer>
#type file
path /opt/bitnami/fluentd/logs/buffers/logs.buffer
flush_thread_count 2
flush_interval 5s
</buffer>
</match>
** Sample log document from Kibana**
{
"_index": "xxx-logstash-2022.08.19",
"_type": "_doc",
"_id": "N34ntYIBvWtHvFBZmz-L",
"_version": 1,
"_score": 1,
"_ignored": [
"message.keyword"
],
"_source": {
"FileName": "/app/logs/app.log",
"#timestamp": "2022-08-19T08:10:46.854Z",
"#version": "1",
"message": "[com.couchbase.endpoint][EndpointConnectionFailedEvent][1485us] Connect attempt 16569 failed because of : finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting. {\"circuitBreaker\":\"DISABLED\",\"coreId\":\"0x94bd86a800000002\",\"remote\":\"xxx-couchbase-cluster.couchbase:8091\",\"type\":\"MANAGER\"}",
"logger_name": "com.couchbase.endpoint",
"thread_name": "cb-events",
"level": "WARN",
"level_value": 30000,
"stack_trace": "com.couchbase.client.core.endpoint.BaseEndpoint$2: finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting.\n",
"PodName": "product-59b7f4b567-r52vn",
"Namespace": "designer-dev",
"tag": "tail.0"
},
"fields": {
"thread_name.keyword": [
"cb-events"
],
"level": [
"WARN"
],
"FileName": [
"/app/logs/app.log"
],
"stack_trace.keyword": [
"com.couchbase.client.core.endpoint.BaseEndpoint$2: finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting.\n"
],
"PodName.keyword": [
"product-59b7f4b567-r52vn"
],
"#version.keyword": [
"1"
],
"message": [
"[com.couchbase.endpoint][EndpointConnectionFailedEvent][1485us] Connect attempt 16569 failed because of : finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting. {\"circuitBreaker\":\"DISABLED\",\"coreId\":\"0x94bd86a800000002\",\"remote\":\"xxx-couchbase-cluster.couchbase:8091\",\"type\":\"MANAGER\"}"
],
"Namespace": [
"designer-dev"
],
"PodName": [
"product-59b7f4b567-r52vn"
],
"#timestamp": [
"2022-08-19T08:10:46.854Z"
],
"level.keyword": [
"WARN"
],
"thread_name": [
"cb-events"
],
"level_value": [
30000
],
"Namespace.keyword": [
"designer-dev"
],
"#version": [
"1"
],
"logger_name": [
"com.couchbase.endpoint"
],
"tag": [
"tail.0"
],
"stack_trace": [
"com.couchbase.client.core.endpoint.BaseEndpoint$2: finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting.\n"
],
"tag.keyword": [
"tail.0"
],
"FileName.keyword": [
"/app/logs/app.log"
],
"logger_name.keyword": [
"com.couchbase.endpoint"
]
},
"ignored_field_values": {
"message.keyword": [
"[com.couchbase.endpoint][EndpointConnectionFailedEvent][1485us] Connect attempt 16569 failed because of : finishConnect(..) failed: Connection refused: xxx-couchbase-cluster.couchbase/10.244.27.5:8091 - Check server ports and cluster encryption setting. {\"circuitBreaker\":\"DISABLED\",\"coreId\":\"0x94bd86a800000002\",\"remote\":\"xxx-couchbase-cluster.couchbase:8091\",\"type\":\"MANAGER\"}"
]
}
}

Beats can’t reach Elastic Service

I've been running my ECK (Elastic Cloud on Kubernetes) cluster for a couple of weeks with no issues. However, 3 days ago filebeat stopped being able to connect to my ES service. All pods are up and running (Elastic, Beats and Kibana).
Also, shelling into filebeats pods and connecting to the Elasticsearch service works just fine:
curl -k -u "user:$PASSWORD" https://quickstart-es-http.quickstart.svc:9200
{
"name" : "aegis-es-default-4",
"cluster_name" : "quickstart",
"cluster_uuid" : "",
"version" : {
"number" : "7.14.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "",
"build_date" : "",
"build_snapshot" : false,
"lucene_version" : "8.9.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Yet the filebeats pod logs are producing the below error:
ERROR
[publisher_pipeline_output] pipeline/output.go:154
Failed to connect to backoff(elasticsearch(https://quickstart-es-http.quickstart.svc:9200)):
Connection marked as failed because the onConnect callback failed: could not connect to a compatible version of Elasticsearch:
503 Service Unavailable:
{
"error": {
"root_cause": [
{ "type": "master_not_discovered_exception", "reason": null }
],
"type": "master_not_discovered_exception",
"reason": null
},
"status": 503
}
I haven't made any changes so I think it's a case of authentication or SSL certificates needing updating?
My filebeats config looks like this:
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
name: quickstart
namespace: quickstart
spec:
type: filebeat
version: 7.14.0
elasticsearchRef:
name: quickstart
config:
filebeat:
modules:
- module: gcp
audit:
enabled: true
var.project_id: project_id
var.topic: topic_name
var.subcription: sub_name
var.credentials_file: /usr/certs/credentials_file
var.keep_original_message: false
vpcflow:
enabled: true
var.project_id: project_id
var.topic: topic_name
var.subscription_name: sub_name
var.credentials_file: /usr/certs/credentials_file
firewall:
enabled: true
var.project_id: project_id
var.topic: topic_name
var.subscription_name: sub_name
var.credentials_file: /usr/certs/credentials_file
daemonSet:
podTemplate:
spec:
serviceAccountName: filebeat
automountServiceAccountToken: true
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
securityContext:
runAsUser: 0
containers:
- name: filebeat
volumeMounts:
- name: varlogcontainers
mountPath: /var/log/containers
- name: varlogpods
mountPath: /var/log/pods
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
- name: credentials
mountPath: /usr/certs
readOnly: true
volumes:
- name: varlogcontainers
hostPath:
path: /var/log/containers
- name: varlogpods
hostPath:
path: /var/log/pods
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: credentials
secret:
defaultMode: 420
items:
secretName: elastic-service-account
And it was working just fine - haven't made any changes to this config to make it lose access.
Did a little more digging and found that there weren't enough resources to be able to assign a master node.
Got this when I tried to run GET /_cat/master and it returned the same 503 no master error. I added a new node pool and it started running normally.

Elastic search - SearchPhaseExecutionException: all shards failed

I have a single node ES cluster.
I have created a new index with 10 shards that suppose to have 1TB of information.
So I have started to reindex part of the data into this new index and I got java.lang.OutOfMemoryError: Java heap space exception.
I have restarted the docker container and I see the following.
What Should I do?
thanks
{"type": "server", "timestamp": "2020-12-13T14:35:28,155Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "docker-cluster", "node.name": "607ed4606bec", "message": "path: /.kibana/_count, params: {index=.kibana}", "cluster.uuid": "zNFK_xhtTAuEfr6S_mcdSA", "node.id": "y9BuSdDNTXyo9X0b13fs8w" ,
"stacktrace": ["org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:582) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.9.3.jar:7.9.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.9.3.jar:7.9.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2020-12-13T14:35:42,170Z", "level": "INFO", "component": "o.e.c.m.MetadataIndexTemplateService", "cluster.name": "docker-cluster", "node.name": "607ed4606bec", "message": "adding template [.management-beats] for index patterns [.management-beats]", "cluster.uuid": "zNFK_xhtTAuEfr6S_mcdSA", "node.id": "y9BuSdDNTXyo9X0b13fs8w" }
{"type": "server", "timestamp": "2020-12-13T14:37:52,073Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "docker-cluster", "node.name": "607ed4606bec", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[entities][5]]]).", "cluster.uuid": "zNFK_xhtTAuEfr6S_mcdSA", "node.id": "y9BuSdDNTXyo9X0b13fs8w" }
You are reaching your JVM memory heap size limit to solve the problem you can increase your docker memory size and try again.

elasticsearch: no known master node, scheduling a retry

Anyone able to help me with steps to fix/diagnose an elastic cluster which has fallen over randomly with the following errors please? version 7.3.1.
elasticsearch | {"type": "server", "timestamp": "2019-12-06T09:30:49,585+0000", "level": "DEBUG", "component": "o.e.a.a.i.c.TransportCreateIndexAction", "cluster.name": "xxx", "node.name": "bex", "message": "no known master node, scheduling a retry" }
elasticsearch | {"type": "server", "timestamp": "2019-12-06T09:30:50,741+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "xxx", "node.name": "bex", "message": "path: /_bulk, params: {}" ,
elasticsearch | "stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized, SERVICE_UNAVAILABLE/2/no master];",
It has been running without any issues for ages.

Resources