Unable to collect all kubernetes container/pod logs via fluentd/elasticsearch - elasticsearch

I'm new with fluentd/elasticsearch stack and I'm trying to deploy it on kubernetes. While I've managed to do that, I'm having a problem that not all pod/container logs are showing up on elasticsearch (I'm using Kibana for data visualisation). In other words, I'm able to see logs from "default" kubernetes pods like weave-net and elasticsearch related pod logs (es-data, es-master...etc.) but not from "custom" pods that I'm trying to deploy.
As a simple test, I've deployed redis in the same kube namespace where fluentd/elasticsearch resides and redis service/deployment looks like this:
---
apiVersion: v1
kind: Service
metadata:
name: redis-master
labels:
app: redis
role: master
tier: backend
spec:
ports:
- port: 6379
targetPort: 6379
selector:
app: redis
role: master
tier: backend
---
apiVersion: apps/v1 # for k8s versions before 1.9.0 use apps/v1beta2 and before 1.8.0 use extensions/v1beta1
kind: Deployment
metadata:
name: redis-master
spec:
selector:
matchLabels:
app: redis
role: master
tier: backend
replicas: 1
template:
metadata:
labels:
app: redis
role: master
tier: backend
spec:
containers:
- name: master
image: k8s.gcr.io/redis:e2e # or just image: redis
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
When I check logs from fluentd daemonpods, I see following:
2018-07-03 11:17:05 +0000 [info]: following tail of /var/log/containers/redis-master-585798d8ff-b5p5g_default_master-4c934d19a8e2b2d6143b662425fd8fc238df98433d1c0c32bf328c281ef593ad.log
which, if I'm correct, should give me an info that fluentd is picking up redis container logs. However, I'm unable to see any redis related documents stored in elasticsearch.
This is how part of the configuration for fluentd looks like (kubernetes.conf):
<source>
#type tail
#id in_tail_container_logs
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
format json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</source>
and fluent.conf:
<match **>
#type elasticsearch
#id out_es
log_level info
include_tag_key true
host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
user "#{ENV['FLUENT_ELASTICSEARCH_USER']}"
password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD']}"
reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
logstash_format true
buffer_chunk_limit 2M
buffer_queue_limit 32
flush_interval 5s
max_retry_wait 30
disable_retry_limit
num_threads 8
</match>
Any hint would be very helpful.
Thanks in advance.

I am using fluent bit for the same purpose and I met exactly the same problem quite a few days back. Fluent bit is a light weight version of fluentd, and what worked for me might work for you as well.
What was wrong with my fluent bit was the input configuration. For the tail plugins that tail into large log files, there was some issue with the log rotation. So I lowered my refresh_interval to something like 5 secs (time period over which list of watched files are updated). Then I lowered the mem_buf_limit to something like 5MB (the total size of logs fluent bit takes into memory before flushing that out to the output plugin).
By these changes I was able to get more logs which were earlier not being collected for god knows reason.
I have asked this as an issue. Will update my answer if I get to know the reason.
Hope this helps in anyway. Mainly I suggest you to tweak your input configurations and then see the changes.

Related

How to configure Filebeat on ECK for kafka input?

I have Elasticsearch and Kibana running on Kubernetes. Both created by ECK. Now I try to add Filebeat to it and configure it to index data coming from a Kafka topic. This is my current configuration:
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
name: my-filebeat
namespace: my-namespace
spec:
type: filebeat
version: 7.10.2
elasticsearchRef:
name: my-elastic
kibanaRef:
name: my-kibana
config:
filebeat.inputs:
- type: kafka
hosts:
- host1:9092
- host2:9092
- host3:9092
topics: ["my.topic"]
group_id: "my_group_id"
index: "my_index"
deployment:
podTemplate:
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
securityContext:
runAsUser: 0
containers:
- name: filebeat
In the logs of the pod I can see entries like following
log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":2470,"time":{"ms":192}},"total":{"ticks":7760,"time":{"ms":367},"value":7760},"user":{"ticks":5290,"time":{"ms":175}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":13},"info":{"ephemeral_id":"5ce8521c-f237-4994-a02e-dd11dfd31b09","uptime":{"ms":181997}},"memstats":{"gc_next":23678528,"memory_alloc":15320760,"memory_total":459895768},"runtime":{"goroutines":106}},"filebeat":{"harvester":{"open_files":0,"running":0},"inputs":{"kafka":{"bytes_read":46510,"bytes_write":37226}}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":1.18,"15":0.77,"5":0.97,"norm":{"1":0.0738,"15":0.0481,"5":0.0606}}}}}}
And nor error entries are there. So I assume that the connection to Kafka works. Unfortunately, there no data in the my_index specified above. What do I do wrong?
I guess you are not able to connect to the Elasticsearch mentioned in the output.
As per docs, ECK secures the Elasticsearch deployed and stores it in the Kubernetes Secrets.
https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-beat-configuration.html

Kubernetes Kibana operator failures and Nginx ingress timeouts

I just started implementing a Kubernetes cluster on an Azure Linux VM. I'm very new with all this. The cluster is running on a small VM (2 core, 16gb). I set up the ECK stack using their tutorial online, and an Nginx Ingress controller to expose it.
Most of the day, everything runs fine. I can access the Kibana dashboard, run Elastic queries, Nginx is working. But about once each day, something happens that causes the Kibana Endpoint matching the Kibana Service to not have any IP address. As a result, the Service can't route correctly to the container. When this happens, the Kibana pod has a status of Running, but says that 0/1 are running. It never triggers any restarts, and as a result, the Kibana dashboard becomes inaccessible. I've tried reproducing this by shutting down the Docker container, force killing the pod, but can't reliably reproduce it.
Looking at the logs on the Kibana pod, there are a bunch of errors due to timeouts. The Nginx logs say that it can't find the Endpoint for the Service. It looks like this could potentially be the source. Has anyone encountered this? Does anyone know a reliable way to prevent this?
This should probably be a separate question, but the other issue this causes is completely blocking all Nginx Ingress. Any new requests are not seen in the logs, and the logs completely stop after the message about not finding an endpoint. As a result, all URLs that Ingress is normally responsible for time out, and the whole cluster becomes externally unusable. This is fixed by deleting the Nginx controller pod, but the pod doesn't restart itself. Can someone explain why an issue like this would completely block Nginx? And why the Nginx pod can't detect this and restart?
Edit:
The Nginx logs end with this:
W1126 16:20:31.517113 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:20:34.848942 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:21:52.555873 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
Any further requests timeout and do not appear in the logs.
I don't have logs for the kibana pod, but they were just consistent timeouts to the kibana service default/gwam-kb-http (same as in Nginx logs above). This caused the readiness probe to fail, and show 0/1 Running, but did not trigger a restart of the pod.
Kibana Endpoints when everything is normal
Name: gwam-kb-http
Namespace: default
Labels: common.k8s.elastic.co/type=kibana
kibana.k8s.elastic.co/name=gwam
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2020-11-26T16:27:20Z
Subsets:
Addresses: 10.244.0.6
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
https 5601 TCP
Events: <none>
When I run into this issue, Addresses is empty, and the pod IP is under NotReadyAddresses
I'm using the very basic YAML from the ECK setup tutorial:
Elastic (no problems here)
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: gwam
spec:
version: 7.10.0
nodeSets:
- name: default
count: 3
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: elasticsearch
Kibana:
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: gwam
spec:
version: 7.10.0
count: 1
elasticsearchRef:
name: gwam
Ingress for the Kibana service:
kind: Ingress
apiVersion: extensions/v1beta1
metadata:
name: nginx-ingress-secure-backend-no-rewrite
annotations:
kubernetes.io/ingress.class: nginx
nginx.org/proxy-connect-timeout: "30s"
nginx.org/proxy-read-timeout: "20s"
nginx.org/proxy-send-timeout: "60s"
nginx.org/client-max-body-size: "4m"
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
tls:
- hosts:
- <internal company site>
secretName: gwam-tls-secret
rules:
- host: <internal company site>
http:
paths:
- path: /
backend:
serviceName: gwam-kb-http
servicePort: 5601
Some more environment details:
Kubernetes version: 1.19.3
OS: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1031-azure x86_64)
edit 2:
Seems like I'm getting some kind of network error here. None of my pods can do a dnslookup for kubernetes.default. All the networking pods are running, but after adding logs to CoreDNS, I'm seeing the following:
[ERROR] plugin/errors: 2 1699910358767628111.9001703618875455268. HINFO: read udp 10.244.0.69:35222->10.234.44.20:53: i/o timeout
I'm using Flannel for my network. Thinking of trying to reset and switch to Calico and increasing nf_conntrack_max as some answers suggest.
This ended up being a very simple mistake on my part. I thought it was a pod or DNS issue, but was just a general network issue. My IP forwarding was turned off. I turned it on with:
sysctl -w net.ipv4.ip_forward=1
And added net.ipv4.ip_forward=1 to /etc/sysctl.conf

Access Elasticsearch from minikube/kubernetes

I have a spring boot application which is deployed in Kubernetes on local windows machine using minikube. I also have Elasticsearch running on my local machine (http://localhost:9200).
I want to call Elasticsearch REST endpoints from this spring boot app.
I tried solving this by creating a service without selector but not sure what am i missing.
When accessing the spring boot app using http://#minikube_ip#:#Node_Port#, i get an error "No route to host".
i tried doing minikube ssh and executing curl command, from there also i get the same error. Clearly I am missing something here.
application.yaml
elasticsearch:
hosts:
- http://my-es:80
connectTimeout: 10000
connectionRequestTimeout: 10000
socketTimeout: 10000
maxRetryTimeoutMillis: 60000
deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: kube-es-app
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
run: kube-es-app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
run: kube-es-app
spec:
containers:
- image: elastic-search-app:latest
imagePullPolicy: Never
name: kube-es-app
ports:
- containerPort: 8080
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
---
kind: Service
apiVersion: v1
metadata:
name: my-es
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9200
---
kind: Endpoints
apiVersion: v1
metadata:
name: my-es
subsets:
- addresses:
- ip: <MY_LOCAL_MACHINE_IP>
ports:
- port: 9200
Commands I executed
docker build -t elastic-search-app .
kubectl create -f deployment.yaml
kubectl expose deployment/kube-es-app --type="NodePort" --port 8080
Can anyone help please? I am stuck
If I've got the description right, the Windows machine should have vbox network adapter connected to the Host-only-network the Minikube VM is connected to.
Minikube can access the host machine directly because both are in the same network.
The Minikube is in charge of NAT-ting packages from Pods outside. What you need is to allow Elasticsearch to listen to the vbox- or all interfaces, and enable its port in the Windows firewall. Then the Elasticsearch should be available via IP address of Windows in the Host-only-network.
Apart from that, you might create a service (if you need go by name instead of IP) as discussed here:
Connect to local database from inside minikube cluster,
Minikube:Exposing mysql as a service on localhost.

How I can visualize elasticsearch metrics in prometheus?, both installed in a gke cluster

I have a GKE cluster with this elasticseach logging solution installed
https://console.cloud.google.com/marketplace/details/google/elastic-gke-logging
And prometheus-operator installed by helm inside the same cluster.
I would like configure a grafana dashboard for visualize metrics of my elasticsearch.
I read that elastic application from gke has the elastic_exporter installed... https://github.com/GoogleCloudPlatform/click-to-deploy/blob/master/k8s/elastic-gke-logging/README.md
But if I go to my Prometheus panel I don't see any metric about elasticsearch. I try install another elastic_exporter, but nothing.
I miss something? I forget something? Do you need to configure prometheus to read from the elastic_exporter?
I see the metrics when I do port-forwarding of the elastic_exporter, but I don't see the metrics inside prometheus panel.
# HELP elasticsearch_breakers_estimated_size_bytes Estimated size in bytes of breaker
# TYPE elasticsearch_breakers_estimated_size_bytes gauge
elasticsearch_breakers_estimated_size_bytes{breaker="accounting",cluster="elastic-gke-logging-1-cluster",es_client_node="true",es_data_node="true",es_ingest_node="true",es_master_node="true",host="10.50.2.54",name="elastic-gke-logging-1-elasticsearch-0"} 4.6637464e+07
elasticsearch_breakers_estimated_size_bytes{breaker="fielddata",cluster="elastic-gke-logging-1-cluster",es_client_node="true",es_data_node="true",es_ingest_node="true",es_master_node="true",host="10.50.2.54",name="elastic-gke-logging-1-elasticsearch-0"} 0
elasticsearch_breakers_estimated_size_bytes{breaker="in_flight_requests",cluster="elastic-gke-logging-1-cluster",es_client_node="true",es_data_node="true",es_ingest_node="true",es_master_node="true",host="10.50.2.54",name="elastic-gke-logging-1-elasticsearch-0"} 0
elasticsearch_breakers_estimated_size_bytes{breaker="parent",cluster="elastic-gke-logging-1-cluster",es_client_node="true",es_data_node="true",es_ingest_node="true",es_master_node="true",host="10.50.2.54",name="elastic-gke-logging-1-elasticsearch-0"} 4.6637464e+07
elasticsearch_breakers_estimated_size_bytes{breaker="request",cluster="elastic-gke-logging-1-cluster",es_client_node="true",es_data_node="true",es_ingest_node="true",es_master_node="true",host="10.50.2.54",name="elastic-gke-logging-1-elasticsearch-0"} 0
# HELP elasticsearch_breakers_limit_size_bytes Limit size in bytes for breaker
# TYPE elasticsearch_breakers_limit_size_bytes gauge
Thank you
You are probably missing ServiceMonitor, this should work:
k apply -f -<<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
labels:
release: prom
name: elasticsearch
spec:
endpoints:
- port: metrics
selector:
matchLabels:
app: es-exporter
EOF
Your elasticsearch service must define metrics and have lable app: es-exporter, similar to this:
apiVersion: v1
kind: Service
metadata:
labels:
app: es-exporter
component: elasticsearch
name: elasticsearch
spec:
ports:
- name: transport
port: 9200
protocol: TCP
targetPort: 9200
- name: metrics
port: 9108
protocol: TCP
targetPort: 9108
selector:
component: elasticsearch
type: ClusterIP
After that you should find metrics in Prometheus, to confirm that you can always use Status -> Targets tab in Prometheus.

How do I access this Kubernetes service via kubectl proxy?

I want to access my Grafana Kubernetes service via the kubectl proxy server, but for some reason it won't work even though I can make it work for other services. Given the below service definition, why is it not available on http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana?
grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
namespace: monitoring
name: grafana
labels:
app: grafana
spec:
type: NodePort
ports:
- name: web
port: 3000
protocol: TCP
nodePort: 30902
selector:
app: grafana
grafana-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: monitoring
name: grafana
spec:
replicas: 1
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:4.1.1
env:
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
volumeMounts:
- name: grafana-storage
mountPath: /var/grafana-storage
ports:
- name: web
containerPort: 3000
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 200Mi
cpu: 200m
- name: grafana-watcher
image: quay.io/coreos/grafana-watcher:v0.0.5
args:
- '--watch-dir=/var/grafana-dashboards'
- '--grafana-url=http://localhost:3000'
env:
- name: GRAFANA_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GRAFANA_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
resources:
requests:
memory: "16Mi"
cpu: "50m"
limits:
memory: "32Mi"
cpu: "100m"
volumeMounts:
- name: grafana-dashboards
mountPath: /var/grafana-dashboards
volumes:
- name: grafana-storage
emptyDir: {}
- name: grafana-dashboards
configMap:
name: grafana-dashboards
The error I'm seeing when accessing the above URL is "no endpoints available for service "grafana"", error code 503.
With Kubernetes 1.10 the proxy URL should be slighly different, like this:
http://localhost:8080/api/v1/namespaces/default/services/SERVICE-NAME:PORT-NAME/proxy/
Ref: https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#manually-constructing-apiserver-proxy-urls
As Michael says, quite possibly your labels or namespaces are mismatching. However in addition to that, keep in mind that even when you fix the endpoint, the url you're after (http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana) might not work correctly.
Depending on your root_url and/or static_root_path grafana configuration settings, when trying to login you might get grafana trying to POST to http://localhost:8001/login and get a 404.
Try using kubectl port-forward instead:
kubectl -n monitoring port-forward [grafana-pod-name] 3000
then access grafana via http://localhost:3000/
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
The issue is that Grafana's port is named web, and as a result one needs to append :web to the kubectl proxy URL: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web.
An alternative, is to instead not name the Grafana port, because then you don't have to append :web to the kubectl proxy URL for the service: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web. I went with this option in the end since it's easier.
There are a few factors that might be causing this issue.
The service expects to find one or more supporting endpoints, which it discovers through matching rules on the labels. If the labels don't align, then the service won't find endpoints, and the network gateway function performed by the service will result in 503.
The port declared by the POD and the process within the container are misaligned from the --target-port expected by the service.
Either one of these might generate the error. Let's take a closer look.
First, kubectl describe the service:
$ kubectl describe svc grafana01-grafana-3000
Name: grafana01-grafana-3000
Namespace: default
Labels: app=grafana01-grafana
chart=grafana-0.3.7
component=grafana
heritage=Tiller
release=grafana01
Annotations: <none>
Selector: app=grafana01-grafana,component=grafana,release=grafana01
Type: NodePort
IP: 10.0.0.197
Port: <unset> 3000/TCP
NodePort: <unset> 30905/TCP
Endpoints: 10.1.45.69:3000
Session Affinity: None
Events: <none>
Notice that my grafana service has 1 endpoint listed (there could be multiple). The error above in your example indicates that you won't have endpoints listed here.
Endpoints: 10.1.45.69:3000
Let's take a look next at the selectors. In the example above, you can see I have 3 selector labels on my service:
Selector: app=grafana01-grafana,component=grafana,release=grafana01
I'll kubectl describe my pods next:
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
Namespace: default
Node: 10.10.25.220/10.10.25.220
Start Time: Fri, 14 Jul 2017 03:25:11 +0000
Labels: app=grafana01-grafana
component=grafana
pod-template-hash=1843344063
release=grafana01
...
Notice that the labels on the pod align correctly, hence my service finds pods which provide endpoints which are load balanced against by the service. Verify that this part of the chain isn't broken in your environment.
If you do find that the labels are correct, you may still have a disconnect in that the grafana process running within the container within the pod is running on a different port than you expect.
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
...
Containers:
grafana:
Container ID: docker://69f11b7828c01c5c3b395c008d88e8640c5606f4d865107bf4b433628cc36c76
Image: grafana/grafana:latest
Image ID: docker-pullable://grafana/grafana#sha256:11690015c430f2b08955e28c0e8ce7ce1c5883edfc521b68f3fb288e85578d26
Port: 3000/TCP
State: Running
Started: Fri, 14 Jul 2017 03:25:26 +0000
If for some reason, your port under the container listed a different value, then the service is effectively load balancing against an invalid endpoint.
For example, if it listed port 80:
Port: 80/TCP
Or was an empty value
Port:
Then even if your label selectors were correct, the service would never find a valid response from the pod and would remove the endpoint from the rotation.
I suspect your issue is the first problem above (mismatched label selectors).
If both the label selectors and ports align, then you might have a problem with the MTU setting between nodes. In some cases, if the MTU used by your networking layer (like calico) is larger than the MTU of the supporting network, then you'll never get a valid response from the endpoint. Typically, this last potential issue will manifest itself as a timeout rather than a 503 though.
Your Deployment may not have a label app: grafana, or be in another namespace. Could you also post the Deployment definition?

Resources