StackDriver Monitoring for Elasticsearch in GKE - elasticsearch

I need to monitor Elasticsearch(2.4) that installed on top of the k8s cluster. I have 2 clients, 3 masters and several data nodes run in pods. Following the "how to" of Stackdriver and the post "Can I run Google Monitoring Agent inside a Kubernetes Pod?", I deployed an agent in its own Pod. Suddenly, after all, have no Elasticsearch metrics in StackDriver. The Only Zeros.
Any suggestion are more than welcome.
This is my configuration:
Elastic service:
$kubectl describe svc elasticsearch
Name: elasticsearch
Namespace: default
Labels: component=elasticsearch
role=client
Selector: component=elasticsearch,role=client
Type: NodePort
IP: <IP>
Port: http 9200/TCP
NodePort: http <PORT>/TCP
Endpoints: <IP>:9200,<IP>:9200
Session Affinity: None
No events.
Stackdriver deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: stackagent
spec:
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
component: monitoring
role: stackdriver-agent
spec:
containers:
- name: hslab-data-agent
image: StackDriverAgent:version1
StackDriverAgent:version1 Docker:
FROM ubuntu
WORKDIR /stackdriver
RUN apt-get update
RUN apt-get install curl lsb-release libyajl2 -y
RUN apt-get clean
COPY ./stackdriver/run.sh run.sh
COPY ./stackdriver/elasticsearch.conf elasticsearch.conf
RUN chmod 755 ./run.sh
CMD ["./run.sh"]
run.sh:
#!/bin/bash
curl -O https://repo.stackdriver.com/stack-install.sh
chmod 755 stack-install.sh
bash stack-install.sh --write-gcm
cp ./elasticsearch.conf /opt/stackdriver/collectd/etc/collectd.d/
service stackdriver-agent restart
while true; do
sleep 60
agent_pid=$(cat /var/run/stackdriver-agent.pid 2>/dev/null)
ps -p $agent_pid > /dev/null 2>&1
if [ $? != 0 ]; then
echo "Stackdriver agent pid not found!"
break;
fi
done
elasticsearch.conf:
Taken from https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/elasticsearch.conf
# This is the monitoring configuration for Elasticsearch 1.0.x and later.
# Look for ELASTICSEARCH_HOST and ELASTICSEARCH_PORT to adjust your configuration file.
LoadPlugin curl_json
<Plugin "curl_json">
# When using non-standard Elasticsearch configurations, replace the below with
#<URL "http://ELASTICSEARCH_HOST:ELASTICSEARCH_PORT/_nodes/_local/stats/">
# PREVIOUSE LINE
# <URL "http://localhost:9200/_nodes/_local/stats/">
<URL "http://elasticsearch:9200/_nodes/_local/stats/">
Instance "elasticsearch"
....
Running state:
NAME READY STATUS RESTARTS AGE
esclient-4231471109-bd4tb 1/1 Running 0 23h
esclient-4231471109-k5pnw 1/1 Running 0 23h
esdata-1-2524076994-898r0 1/1 Running 0 23h
esdata-2-2426789420-zhz7j 1/1 Running 0 23h
esmaster-1-4205943399-zj2pn 1/1 Running 0 23h
esmaster-2-4248445829-pwq46 1/1 Running 0 23h
esmaster-3-3967126695-w0tp2 1/1 Running 0 23h
stackagent-3122989159-15vj1 1/1 Running 0 18h

The problem with API URL of plugin configuration.The http://elasticsearch:9200/_nodes/**_local**/stats/"> will returns info only for the _local node which is a client that have no documents.
In addition, the Stackdriver data will be collected under k8s cluster node and not under the pod name.
The partial solution is to setup sidecar in data node and patch the elasticsearch.conf query with the corresponding ES node name:
get the curl [elasticsearch]:9200/_nodes/stats
find the ES node name by the $(hostname)
patch the configuration <URL "http://elasticsearch:9200/_nodes/<esnode_name>/stats/">
This will collect the information of ES data node under the k8s data node name.

Related

Access Kafka installed on a minikube cluster on Windows 10

I'm trying to install Kafka with Strimzy on a local munikube cluster running on Windows 10, to test the impact of different parameters (especially the TLS configuration). Before moving to TLS, i'd simply like to connect to my cluster :)
Here is my yaml configuration :
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
version: 2.3.0
replicas: 1
listeners:
external:
type: nodeport
tls: false
config:
offsets.topic.replication.factor: 1
transaction.state.log.replication.factor: 1
transaction.state.log.min.isr: 1
log.message.format.version: "2.3"
storage:
type: persistent-claim
size: 1Gi
zookeeper:
replicas: 1
storage:
type: persistent-claim
size: 2Gi
deleteClaim: false
entityOperator:
topicOperator: {}
userOperator: {}
For the listener, I firstly started with plain: {} but this only gives me services of type ClusterIP, not accessible from outside minikube (i really need to connect from outside).
I then moved to a listener of kind external.
You can fin below the configuration of the cluster:
kubectl get all -n kafka
NAME READY STATUS RESTARTS AGE
pod/my-cluster-entity-operator-9657c9d79-8hknc 3/3 Running 0 17m
pod/my-cluster-kafka-0 2/2 Running 0 18m
pod/my-cluster-zookeeper-0 2/2 Running 0 18m
pod/strimzi-cluster-operator-f77b7d544-hq5pq 1/1 Running 0 5h22m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/my-cluster-kafka-0 NodePort 10.99.3.204 <none> 9094:30117/TCP 18m
service/my-cluster-kafka-bootstrap ClusterIP 10.106.176.111 <none> 9091/TCP 18m
service/my-cluster-kafka-brokers ClusterIP None <none> 9091/TCP 18m
service/my-cluster-kafka-external-bootstrap NodePort 10.109.235.156 <none> 9094:32372/TCP 18m
service/my-cluster-zookeeper-client ClusterIP 10.97.2.69 <none> 2181/TCP 18m
service/my-cluster-zookeeper-nodes ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 18m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/my-cluster-entity-operator 1/1 1 1 17m
deployment.apps/strimzi-cluster-operator 1/1 1 1 5h22m
The IP address of the minikube cluster is 192.168.49.2 (given by minikube ip)
For the while, is everything correct on my configuration ? I cannot connect on the cluster with a producer (i get a timeout error when i try to publish data).
I tried to connect to 192.168.49.2:32372 & 192.168.49.2:30117 and I always get the same timeout error. I also tryed to run
minikube service -n kafka my-cluster-kafka-external-bootstrap
and
minikube service -n kafka my-cluster-kafka-0
and i still get the same error.
What is wrong in what i'm trying to do?
Thanks!
Ok, I got the answser.
I changed the type of the service to LoadBalancer and started minikube tunnel
One other point, as I'm running this on windows, I noticed that if I run everything using powershell it works, if I used an other command line tool (like Moba) it does not work, I don't explain this.

k8s cron job runs multi times

I have the following cronjob which deletes pods in a specific namespace.
I run the job as-is but it seems that the job doesn't run for each 20 min, it runs every few (2-3) min,
what I need is that on each 20 min the job will start deleting the pods in the specified namespace and then terminate, any idea what could be wrong here?
apiVersion: batch/v1
kind: CronJob
metadata:
name: restart
spec:
schedule: "*/20 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 0
jobTemplate:
spec:
backoffLimit: 0
template:
spec:
serviceAccountName: sa
restartPolicy: Never
containers:
- name: kubectl
image: bitnami/kubectl:1.22.3
command:
- /bin/sh
- -c
- kubectl get pods -o name | while read -r POD; do kubectl delete "$POD"; sleep 30; done
I'm really not sure why this happens...
Maybe the delete of the pod collapse
update
I tried the following but no pods were deleted,any idea?
apiVersion: batch/v1
kind: CronJob
metadata:
name: restart
spec:
schedule: "*/1 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 0
jobTemplate:
spec:
backoffLimit: 0
template:
metadata:
labels:
name: restart
spec:
serviceAccountName: pod-exterminator
restartPolicy: Never
containers:
- name: kubectl
image: bitnami/kubectl:1.22.3
command:
- /bin/sh
- -c
- kubectl get pods -o name --selector name!=restart | while read -r POD; do kubectl delete "$POD"; sleep 10; done.
This cronjob pod will delete itself at some point during the execution. Causing the job to fail and additionally resetting its back-off count.
The docs say:
The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.
You need to apply an appropriate filter. Also note that you can delete all pods with a single command.
Add a label to spec.jobTemplate.spec.template.metadata that you can use for filtering.
apiVersion: batch/v1
kind: CronJob
metadata:
name: restart
spec:
jobTemplate:
spec:
template:
metadata:
labels:
name: restart # label the pod
Then use this label to delete all pods that are not the cronjob pod.
kubectl delete pod --selector name!=restart
Since you state in the comments, you need a loop, a full working example may look like this.
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: restart
namespace: sandbox
spec:
schedule: "*/20 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 0
jobTemplate:
spec:
backoffLimit: 0
template:
metadata:
labels:
name: restart
spec:
serviceAccountName: restart
restartPolicy: Never
containers:
- name: kubectl
image: bitnami/kubectl:1.22.3
command:
- /bin/sh
- -c
- |
kubectl get pods -o name --selector "name!=restart" |
while read -r POD; do
kubectl delete "$POD"
sleep 30
done
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: restart
namespace: sandbox
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-management
namespace: sandbox
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: restart-pod-management
namespace: sandbox
subjects:
- kind: ServiceAccount
name: restart
namespace: sandbox
roleRef:
kind: Role
name: pod-management
apiGroup: rbac.authorization.k8s.io
kubectl create namespace sandbox
kubectl config set-context --current --namespace sandbox
kubectl run pod1 --image busybox -- sleep infinity
kubectl run pod2 --image busybox -- sleep infinity
kubectl apply -f restart.yaml # the above file
Here you can see how the first pod is getting terminated.
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/pod1 1/1 Terminating 0 43s
pod/pod2 1/1 Running 0 39s
pod/restart-27432801-rrtvm 1/1 Running 0 16s
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cronjob.batch/restart */1 * * * * False 1 17s 36s
NAME COMPLETIONS DURATION AGE
job.batch/restart-27432801 0/1 17s 17s
Note that this is actually slightly buggy. Because from the time you're reading the pod list to the time you delete an individual pod in the list, the pod may not exist any more. You could use the below to ignore those cases, since when they are gone you don't need to delete them.
kubectl delete "$POD" || true
That said, since you name your job restart, I assume the purpose of this is to restart the pods of some deployments. You could actually use a proper restart, leveraging Kubernetes update strategies.
kubectl rollout restart $(kubectl get deploy -o name)
With the default update strategy, this will lead to new pods being created first and making sure they are ready before terminating the old ones.
$ kubectl rollout restart $(kubectl get deploy -o name)
NAME READY STATUS RESTARTS AGE
pod/app1-56f87fc665-mf9th 0/1 ContainerCreating 0 2s
pod/app1-5cbc776547-fh96w 1/1 Running 0 2m9s
pod/app2-7b9779f767-48kpd 0/1 ContainerCreating 0 2s
pod/app2-8d6454757-xj4zc 1/1 Running 0 2m9s
This also works with deamonsets.
$ kubectl rollout restart -h
Restart a resource.
Resource rollout will be restarted.
Examples:
# Restart a deployment
kubectl rollout restart deployment/nginx
# Restart a daemon set
kubectl rollout restart daemonset/abc

Record Kubernetes container resource utilization data

I'm doing a perf test for web server which is deployed on EKS cluster. I'm invoking the server using jmeter with different conditions (like varying thread count, payload size, etc..).
So I want to record kubernetes perf data with the timestamp so that I can analyze these data with my jmeter output (JTL).
I have been digging through the internet to find a way to record kubernetes perf data. But I was unable to find a proper way to do that.
Can experts please provide me a standard way to do this??
Note: I have a multi-container pod also.
In line with #Jonas comment
This is the quickest way of installing Prometheus in you K8 cluster. Added Details in the answer as it was impossible to put the commands in a readable format in Comment.
Add bitnami helm repo.
helm repo add bitnami https://charts.bitnami.com/bitnami
Install helmchart for promethus
helm install my-release bitnami/kube-prometheus
Installation output would be:
C:\Users\ameena\Desktop\shine\Article\K8\promethus>helm install my-release bitnami/kube-prometheus
NAME: my-release
LAST DEPLOYED: Mon Apr 12 12:44:13 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
** Please be patient while the chart is being deployed **
Watch the Prometheus Operator Deployment status using the command:
kubectl get deploy -w --namespace default -l app.kubernetes.io/name=kube-prometheus-operator,app.kubernetes.io/instance=my-release
Watch the Prometheus StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-prometheus,app.kubernetes.io/instance=my-release
Prometheus can be accessed via port "9090" on the following DNS name from within your cluster:
my-release-kube-prometheus-prometheus.default.svc.cluster.local
To access Prometheus from outside the cluster execute the following commands:
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Watch the Alertmanager StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-alertmanager,app.kubernetes.io/instance=my-release
Alertmanager can be accessed via port "9093" on the following DNS name from within your cluster:
my-release-kube-prometheus-alertmanager.default.svc.cluster.local
To access Alertmanager from outside the cluster execute the following commands:
echo "Alertmanager URL: http://127.0.0.1:9093/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-alertmanager 9093:9093
Follow the commands to forward the UI to localhost.
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Open the UI in browser: http://127.0.0.1:9090/classic/graph
Annotate the pods for sending the metrics.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 4 # Update the replicas from 2 to 4
template:
metadata:
labels:
app: nginx
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9102'
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In the ui put appropriate filters and start observing the crucial parameter such as memory CPU etc. UI supports autocomplete so it will not be that difficult to figure out things.
Regards

How can I diagnose why a k8s pod keeps restarting?

I deploy a elasticsearch to minikube with below configure file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: elasticsearch
spec:
replicas: 1
selector:
matchLabels:
name: elasticsearch
template:
metadata:
labels:
name: elasticsearch
spec:
containers:
- name: elasticsearch
image: elasticsearch:7.10.1
ports:
- containerPort: 9200
- containerPort: 9300
I run the command kubectl apply -f es.yml to deploy the elasticsearch cluster.
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
elasticsearch-fb9b44948-bchh2 1/1 Running 5 6m23s
The elasticsearch pod keep restarting every a few minutes. When I run kubectl describe pod command, I can see these events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m11s default-scheduler Successfully assigned default/elasticsearch-fb9b44948-bchh2 to minikube
Normal Pulled 3m18s (x5 over 7m11s) kubelet Container image "elasticsearch:7.10.1" already present on machine
Normal Created 3m18s (x5 over 7m11s) kubelet Created container elasticsearch
Normal Started 3m18s (x5 over 7m10s) kubelet Started container elasticsearch
Warning BackOff 103s (x11 over 5m56s) kubelet Back-off restarting failed container
The last event is Back-off restarting failed but I don't know why it restarts the pod. Is there any way I can check why it keeps restarting?
The first step (kubectl describe pod) you've already done. As a next step I suggest checking container logs: kubectl logs <pod_name>. 99% you get the reason from logs in this case (I bet on bootstrap check failure).
When neither describe pod nor logs do not have anything about the error, I get into the container with 'exec': kubectl exec -it <pod_name> -c <container_name> sh. With this you'll get a shell inside the container (of course if there IS a shell binary in it) ans so you can use it to investigate the problem manually. Note that to keep failing container alive you may need to change command and args to something like this:
command:
- /bin/sh
- -c
args:
- cat /dev/stdout
Be sure to disable probes when doing this. A container may restart if liveness probe fails, you will see that in kubectl describe pod if it happen. Since your snippet doesn't have any probes specified, you can skip this.
Checking logs of the pod using kubectl logs podname gives clue about what could go wrong.
ERROR: [2] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/docker-cluster.log
Check this post for a solution

Kubernetes Ingress Controller on Vagrant

Is there anything special about running ingress controllers on Kubernetes CoreOS Vagrant Multi-Machine? I followed the example but when I run kubectl -f I do not get an address.
Example:
http://kubernetes.io/v1.1/docs/user-guide/ingress.html#single-service-ingress
Setup:
https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant.html
I looked at networking in kubernetes. Everything looks like it should run without further configuration.
My goal is to create a local testing environment before I build out a production platform. I'm thinking there's something about how they setup their virtualbox networking. I'm about to dive into the CoreOS cloud config but thought I would ask first.
UPDATE
Yes I'm running an ingress controller.
https://github.com/kubernetes/contrib/blob/master/Ingress/controllers/nginx-alpha/rc.yaml
It runs without giving an error. It's just when I run kubectl -f I do not get an address. I'm thinking there's either two things:
I have to do something extra in networking for CoreOS-Kubernetes vagrant multi-node.
It's running right, but I'm point my localhost to the wrong IP. I'm using a 172.17.4.x ip, I also have 10.0.0.x . I can access services through the 172.17.4.x using a NodePort, but I can get to my Ingress.
Here is the code:
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-ingress
labels:
app: nginx-ingress
spec:
replicas: 1
selector:
app: nginx-ingress
template:
metadata:
labels:
app: nginx-ingress
spec:
containers:
- image: gcr.io/google_containers/nginx-ingress:0.1
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
hostPort: 80
Update 2
Output of commands:
kubectl get pods
NAME READY STATUS RESTARTS AGE
echoheaders-kkja7 1/1 Running 0 24m
nginx-ingress-2wwnk 1/1 Running 0 25m
kubectl logs nginx-ingress-2wwnk --previous
Pod "nginx-ingress-2wwnk" in namespace "default": previous terminated container "nginx" not found
kubectl exec nginx-ingress-2wwnk -- cat /etc/nginx/nginx.conf
events {
worker_connections 1024;
}
http {
}%
I'm running an echoheaders service on NodePort. When I type the node IP and port on my browser, I get that just fine.
I restarted all nodes in virtualbox too.
With a lot help from kubernetes irc and slack, I fixed this a while back. If I remember correctly, I had the ingress service listening on a port that was already being used, I think for vagrant. These commands really help:
kubectl get pod <nginx-ingress pod> -o json
kubectl exec <nginx-ingress pod> -- cat /etc/nginx/nginx.conf
kubectl get pods -o wide
kubectl logs <nginx-ingress pod> --previous

Resources