Prometheus: How to disable 1 rule for 1 specific job_name? - elasticsearch

I'm setting prometheus alert (using elasticsearch_exporter) for 2 elasticsearch clusters, 1 with 8 nodes and 1 with 3 node.
What I want is to send alert when each cluster lost 1 node, but for now all rules apply for both clusters. So it's not possible.
prometheus.yml file
global:
scrape_interval: 10s
rule_files:
- alert.rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: cluster1
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: "/metrics"
static_configs:
- targets: ['xxx1:9114' ]
labels:
service: cluster1
- job_name: cluster2
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: "/metrics"
static_configs:
- targets: ['xxx2:9114' ]
labels:
service: cluster2
alert.rules.yml file:
groups:
- name: alert.rules
rules:
- alert: ElasticsearchLostNode
expr: elasticsearch_cluster_health_number_of_nodes < 8
for: 1m
labels:
severity: warning
annotations:
summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
description: Number Healthy Nodes less than 8
...
Ofc the number_of_nodes < 8 will always be true for small cluster, and if I set < 3, the alert will not triggered when big cluster lost 1 node.
Is there a way to exempt 1 specific rule for 1 specific job_name, or define these rules A applying for 1 specific job_name A, these rules B applying for 1 specific job_name B?

Yes, you can create one rule for each job at the alert.rules.yml file:
groups:
- name: alert.rules
rules:
- alert: ElasticsearchLostNode1
expr: elasticsearch_cluster_health_number_of_nodes{job="cluster1"} < 8
...
- alert: ElasticsearchLostNode2
expr: elasticsearch_cluster_health_number_of_nodes{job="cluster2"} < 3
...

Related

Metricbeat not showing pod cpu usage

i have a kubernetes with minikube in which i would like to get metrics via elasticsearch metricbeat. Almost all the metrics are showing less the cpu usage, which is always 0. Even when i'm running a pod with high cpu usage. Here are my yaml's what can i possibly be doing wrong?
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
name: metricbeat
namespace: default
spec:
type: metricbeat
version: 8.1.0
elasticsearchRef:
name: elasticsearch
kibanaRef:
name: kibana
config:
setup:
kibana:
path: "/kibana"
metricbeat:
autodiscover:
providers:
- hints:
default_config: {}
enabled: "true"
node: ${NODE_NAME}
type: kubernetes
modules:
- module: system
period: 10s
metricsets:
- cpu
- load
- memory
- network
- process
- process_summary
process:
include_top_n:
by_cpu: 5
by_memory: 5
processes:
- .*
- module: system
period: 1m
metricsets:
- filesystem
- fsstat
processors:
- drop_event:
when:
regexp:
system:
filesystem:
mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib)($|/)
- module: kubernetes
period: 10s
node: ${NODE_NAME}
hosts:
- https://${NODE_NAME}:10250
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ssl:
verification_mode: none
metricsets:
- node
- system
- pod
- container
- volume
processors:
- add_cloud_metadata: {}
- add_host_metadata: {}
daemonSet:
podTemplate:
spec:
serviceAccountName: metricbeat
automountServiceAccountToken: true
containers:
- args:
- -e
- -c
- /etc/beat.yml
- -system.hostfs=/hostfs
name: metricbeat
volumeMounts:
- mountPath: /hostfs/sys/fs/cgroup
name: cgroup
- mountPath: /var/run/docker.sock
name: dockersock
- mountPath: /hostfs/proc
name: proc
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
securityContext:
runAsUser: 0
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /sys/fs/cgroup
name: cgroup
- hostPath:
path: /var/run/docker.sock
name: dockersock
- hostPath:
path: /proc
name: proc
And here are the cluster role
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: metricbeat
namespace: default
rules:
- apiGroups:
- ""
resources:
- nodes
- namespaces
- events
- pods
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- deployments
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- nodes/stats
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get

How to check Daemonset scheduled and available numbers are equal with ansible k8s_info module

How can we check Daemonset desired number scheduled and available numbers are equal with ansible k8s_info module ?
DaemonSet status
status:
currentNumberScheduled: 12
desiredNumberScheduled: 12
numberAvailable: 12
numberMisscheduled: 0
numberReady: 12
observedGeneration: 2
updatedNumberScheduled: 12
Something like below
- name: "Wait for DaemonSet to be active"
k8s_info:
api_key: "{{ k8s_api_key }}"
kind: DaemonSet
api_version: apps/v1
name: fluentd
namespace: openshift-logging
register: ds
until:
- ds.resources is defined
- ds | json_query('resources[*].status.desiredNumberScheduled == resources[*].status.numberAvailable )
retries: 60
delay: 10
Thanks
Raj

Readiness and Liveness probes for elasticsearch 6.3.0 on Kubernetes failing

I am trying to setup EFK stack on Kubernetes . The Elasticsearch version being used is 6.3.2. Everything works fine until I place the probes configuration in the deployment YAML file. I am getting error as below. This is causing the pod to be declared unhealthy and eventually gets restarted which appears to be a false restart.
Warning Unhealthy 15s kubelet, aks-agentpool-23337112-0 Liveness probe failed: Get http://10.XXX.Y.ZZZ:9200/_cluster/health: dial tcp 10.XXX.Y.ZZZ:9200: connect: connection refused
I did try using telnet from a different container to the elasticsearch pod with IP and port and I was successful but only kubelet on the node is unable to resolve the IP of the pod causing the probes to fail.
Below is the snippet from the pod spec of the Kubernetes Statefulset YAML. Any assistance on the resolution would be really helpful. Spent quite a lot of time on this without any clue :(
PS: The stack is being setup on AKS cluster
- name: es-data
image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.2
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_NAME
value: myesdb
- name: NODE_MASTER
value: "false"
- name: NODE_INGEST
value: "false"
- name: HTTP_ENABLE
value: "true"
- name: NODE_DATA
value: "true"
- name: DISCOVERY_SERVICE
value: "elasticsearch-discovery"
- name: NETWORK_HOST
value: "_eth0:ipv4_"
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: PROCESSORS
valueFrom:
resourceFieldRef:
resource: limits.cpu
resources:
requests:
cpu: 0.25
limits:
cpu: 1
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
livenessProbe:
httpGet:
port: http
path: /_cluster/health
initialDelaySeconds: 40
periodSeconds: 10
readinessProbe:
httpGet:
path: /_cluster/health
port: http
initialDelaySeconds: 30
timeoutSeconds: 10
The pods/containers runs just fine without the probes in place . Expectation is that the probes should work fine when set on the deployment YAMLs and the POD should not get restarted.
The thing is that ElasticSearch itself has own health statuses (red, yellow, green) and you need to consider that in your configuration.
Here what I found in my own ES configuration, based on the official ES helm chart:
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
exec:
command:
- sh
- -c
- |
#!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be green
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file
http () {
local path="${1}"
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
else
BASIC_AUTH=''
fi
curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy'
http "/"
else
echo 'Waiting for elasticsearch cluster to become green'
if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet green'
exit 1
fi
fi
First Please check the logs using
kubectl logs <pod name> -n <namespacename>
You have to first run the init container and change the volume permissions.
you have to run the whole config as the user : 1000 also before the container of elasticsearch start you have to change the volume permission using init container.
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
name: elasticsearch
spec:
podManagementPolicy: Parallel
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
serviceName: elasticsearch
template:
metadata:
creationTimestamp: null
labels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
spec:
containers:
- env:
- name: cluster.name
value: <SET THIS>
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: bootstrap.memory_lock
value: "false"
image: elasticsearch:6.5.0
imagePullPolicy: IfNotPresent
name: elasticsearch
ports:
- containerPort: 9200
name: http
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: 250m
memory: 1Gi
requests:
cpu: 150m
memory: 512Mi
securityContext:
privileged: true
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: elasticsearch-data
dnsPolicy: ClusterFirst
initContainers:
- command:
- sh
- -c
- chown -R 1000:1000 /usr/share/elasticsearch/data
- sysctl -w vm.max_map_count=262144
- chmod 777 /usr/share/elasticsearch/data
- chomod 777 /usr/share/elasticsearch/data/node
- chmod g+rwx /usr/share/elasticsearch/data
- chgrp 1000 /usr/share/elasticsearch/data
image: busybox:1.29.2
imagePullPolicy: IfNotPresent
name: set-dir-owner
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: elasticsearch-data
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 10
updateStrategy:
type: OnDelete
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Check out the my yaml config and you can use. It's for single node of elasticsearch
Probe outlined in my answer works in 3 nodes discovery when Istio presented. If livenessProbe is bad, than k8s will restart container even not allowing to start properly. I use internal Elastic ports (for node to node communication) to test liveness. These ports speak TCP.
livenessProbe:
tcpSocket:
port: 9300
initialDelaySeconds: 60 # it takes time from jvm process to start start up to point when discovery process starts
timeoutSeconds: 10
- name: discovery.zen.minimum_master_nodes
value: "2"
- name: discovery.zen.ping.unicast.hosts
value: elastic

How to remove a specific block of lines from prometheus.yml file using ansible?

I have a prometheus.yml config file which have multiple k8s clusters configured for monitoring. Since the servers come and go we need to delete the servers which are deleted from our peometheus.yml config file using ansible.
I tried with the following and did not work.
https://docs.ansible.com/ansible/2.5/modules/blockinfile_module.html
- hosts: blocks
tasks:
- name: Removing a line using blockinfile
blockinfile:
dest: /home/mdtutorials2/block_output.txt
marker: <!-- {mark} Adding IP address -->
state: absent
prometheus.yml
# *************************************************************START-RAVVE*****************************************************************************************
# metrics for kubernetes scheduler and controller
- job_name: 'ravve.ntnxsherlock.com-scheduler-and-controller'
scrape_interval: 5s
static_configs:
- targets: ['ip-172-31-12-14.us-east-2.compute.internal:10251']
labels:
customer: 'RAVVE'
# metrics foom node exporter
- job_name: 'ravve.ntnxsherlock.com-nodes-exporter'
scrape_interval: 5s
static_configs:
- targets: ['ip-172-31-12-14.us-east-2.compute.internal:9100']
labels:
customer: 'RAVVE'
- targets: ['ip-172-31-13-200.us-east-2.compute.internal:9100']
labels:
customer: 'RAVVE'
# metrics from cadvisory
- job_name: 'ravve.ntnxsherlock.com-cadvisor'
scrape_interval: 10s
metrics_path: "/metrics/cadvisor"
static_configs:
- targets: ['ip-172-31-12-14.us-east-2.compute.internal:10255']
labels:
customer: 'RAVVE'
# metrics for default/kubernetes api's from the kubernetes master
- job_name: 'ravve.ntnxsherlock.com-apiservers'
kubernetes_sd_configs:
- role: endpoints
api_server: https://ip-172-31-12-14.us-east-2.compute.internal
tls_config:
insecure_skip_verify: true
basic_auth:
username: admin
password: XXXXXXXXXXXXXXXX
scheme: https
tls_config:
insecure_skip_verify: true
basic_auth:
username: admin
password: XXXXXXXXXXXXX
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# **************************************************************END-RAVVE*****************************************************************************************
Now i need to delete the line between start and end cluster
Start *************************************************************START-RAVVE*****************************************************************************************
End
# **************************************************************END-RAVVE*****************************************************************************************

Elasticsearch fails to start on AWS kubernetes cluster

I am running my kubernetes cluster on AWS EKS which runs kubernetes 1.10.
I am following this guide to deploy elasticsearch in my Cluster
elasticsearch Kubernetes
The first time I deployed it everything worked fine. Now, When I redeploy it gives me the following error.
ERROR: [2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2018-08-24T18:07:28,448][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopping ...
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopped
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closing ...
[2018-08-24T18:07:28,555][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closed
Here is my deployment file.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: es-master
labels:
component: elasticsearch
role: master
spec:
replicas: 3
template:
metadata:
labels:
component: elasticsearch
role: master
spec:
initContainers:
- name: init-sysctl
image: busybox:1.27.2
command:
- sysctl
- -w
- vm.max_map_count=262144
securityContext:
privileged: true
containers:
- name: es-master
image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.2
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_NAME
value: myesdb
- name: NUMBER_OF_MASTERS
value: "2"
- name: NODE_MASTER
value: "true"
- name: NODE_INGEST
value: "false"
- name: NODE_DATA
value: "false"
- name: HTTP_ENABLE
value: "false"
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: NETWORK_HOST
value: "0.0.0.0"
- name: PROCESSORS
valueFrom:
resourceFieldRef:
resource: limits.cpu
resources:
requests:
cpu: 0.25
limits:
cpu: 1
ports:
- containerPort: 9300
name: transport
livenessProbe:
tcpSocket:
port: transport
initialDelaySeconds: 20
periodSeconds: 10
volumeMounts:
- name: storage
mountPath: /data
volumes:
- emptyDir:
medium: ""
name: "storage"
I have seen a lot of posts talking about increasing the value but I am not sure how to do it. Any help would be appreciated.
Just want to append to this issue:
If you create EKS cluster by eksctl then you can append to NodeGroup creation yaml:
preBootstrapCommand:
- "sed -i -e 's/1024:4096/65536:65536/g' /etc/sysconfig/docker"
- "systemctl restart docker"
This will solve the problem for newly created cluster by fixing docker daemon config.
Update default-ulimit parameter in the file '/etc/docker/daemon.json'
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 65536,
"Hard": 65536
}
}
and restart docker daemon.
This is the only thing that worked for me using EKS setting up an EFK stack. Add this to your nodegroup creation YAML file under nodeGroups:. Then create your nodegroup and apply your ES pods on it.
preBootstrapCommands:
- "sysctl -w vm.max_map_count=262144"
- "systemctl restart docker"

Resources