Grafana Loki-2.5.0 log retention period is not working - grafana-loki

I upgraded my Loki deployment in kebernet from Loki-2.4.0 to Loki-2.5.0. 
But after upgrading, old chunks are not getting deleted, so they fill up the filesystem.
It looks like the retention period is not working.
I am trying to configure a 7-day retention period for Loki.
Can someone please help me configure the retention period in Loki-2.5.0 ?
My configuration -
auth_enabled: false
server:
http_listen_port: 3100
grpc_server_max_recv_msg_size: 67108864
grpc_server_max_send_msg_size: 67108864
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 15m
chunk_retain_period: 30s
max_transfer_retries: 0
wal:
enabled: true
dir: /loki/wal
schema_config:
configs:
- from: 2021-02-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
chunks:
prefix: chunks_
period: 24h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 72h
ingestion_rate_mb: 32
creation_grace_period: 30m
max_entries_limit_per_query: 50000
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 168h

try to configure limits_config.retention_period: 168h

Related

Promtail: "error sending batch, will retry" status=500

when starting the promtail client, it gives an error:
component=client host=loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (6780207 vs. 4194304)"
promtail config
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: 'http://loki:3100/loki/api/v1/push'
scrape_configs:
- job_name: server-log
pipeline_stages:
static_configs:
- targets:
- localhost
labels:
job: server-log
__path__: /opt/log/*.log
__path_exclude__: /opt/log/jck_*,*.log
I tried to start changing limits on the server, and run promtail with parameters:
/usr/local/bin/promtail-linux-amd64 -config.file=/etc/config-promtail.yml -server.grpc-max-recv-msg-size-bytes 16777216 -server.grpc-max-concurrent-streams 0 -server.grpc-max-send-msg-size-bytes 16777216 -limit.readline-rate-drop -client.batch-size-bytes 2048576
But judging by what I found, this is a grpc protocol error, or rather, in the size of the transmitted message, where the maximum is 4 mb
Changing the server parameters, not the client, solved the problem
server:
http_listen_port: 3100
grpc_listen_port: 9096
grpc_server_max_recv_msg_size: 8388608
grpc_server_max_send_msg_size: 8388608
limits_config:
ingestion_rate_mb: 15
ingestion_burst_size_mb: 30
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 20MB
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 744h
max_query_length: 0h

Loki: Ingester high memory usage

Please advise on the use of memory with the Loki - Ingester component.
I have the following setup: loki distributed v2.6.1 installed through the official helm chart in K8s.
The number of promtail clients is ~1000 hosts. Each of them generates a large load. About 5 million chunks (see screenshot below)
The number of loki_log_messages_total is 235 million per day.
My problem is that the ingester is using about 140 GB of RAM per day, and the memory consumption keeps increasing. I want to understand if this is normal behavior or can I somehow reduce memory usage through config? Tried to adjust various parameters myself, in particular chunk_idle_period and max_chunk_age. But no matter what values I set, memory consumption is still at 100+ GB.
I also tried to reduce the number of labels on the client side, at the moment the labels are as follows:
Here is my config:
auth_enabled: false
chunk_store_config:
max_look_back_period: 0s
compactor:
retention_enabled: true
shared_store: s3
working_directory: /var/loki/compactor
distributor:
ring:
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 5s
tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
frontend_address: loki-distributed-query-frontend:9095
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
ingester:
autoforget_unhealthy: true
chunk_block_size: 262144
chunk_encoding: snappy
chunk_idle_period: 5m
chunk_retain_period: 30s
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
max_chunk_age: 15m
max_transfer_retries: 0
wal:
enabled: false
ingester_client:
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
limits_config:
cardinality_limit: 500000
enforce_metric_name: false
ingestion_burst_size_mb: 300
ingestion_rate_mb: 150
max_cache_freshness_per_query: 10m
max_entries_limit_per_query: 1000000
max_global_streams_per_user: 5000000
max_label_name_length: 1024
max_label_names_per_series: 300
max_label_value_length: 8096
max_query_series: 250000
per_stream_rate_limit: 150M
per_stream_rate_limit_burst: 300M
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 72h
split_queries_by_interval: 30m
memberlist:
join_members:
- loki-distributed-memberlist
querier:
engine:
timeout: 5m
query_timeout: 5m
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 5
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
ttl: 24h
query_scheduler:
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
runtime_config:
file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
configs:
- from: "2022-09-07"
index:
period: 24h
prefix: loki_index_
object_store: aws
schema: v12
store: boltdb-shipper
server:
grpc_server_max_recv_msg_size: 167772160
grpc_server_max_send_msg_size: 167772160
http_listen_port: 3100
http_server_idle_timeout: 300s
http_server_read_timeout: 300s
http_server_write_timeout: 300s
storage_config:
aws:
s3: https:/....
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /var/loki/boltdb_shipper/index
cache_location: /var/loki/boltdb_shipper/cache
shared_store: s3
index_cache_validity: 5m
table_manager:
retention_deletes_enabled: false
retention_period: 0s
In documentation I have not found any examples or information for heavy loads, so I decided to ask the community. I will be very grateful for help.

producer configuration of kafkamirrormaker2 strimzi

i'm using kafka mirrormaker2 in order to copy topic content from Kafka's cluster A to Kafka's cluster B.
i'm getting the following errors
kafka.MirrorSourceConnector-0} flushing 197026 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask) [SourceTaskOffsetCommitter-1]
where do i need to add the producer configuration ?
i tried to put it in 2 places but seems like it's not working:
kafkaMirrorMaker2:
spec:
config:
batch.size: 9000
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
instances:
- enabled: true
name: mirror-maker
replicas: 3
enableMetrics: true
targetCluster: target-kafka
producer:
config:
producer.batch.size: 300000
batch.size: 9000
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
clusters:
source-kafka:
bootstrapServers: sourcekafka:9092
target-kafka:
bootstrapServers: targetkafka:9092
mirrors:
- sourceCluster: source-kafka
targetCluster: target-kafka
topicsPattern: "mytopic"
groupsPattern: "my-replicator"
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
sourceConnector:
config:
max.poll.record: 2000
tasks.max: 9
consumer.auto.offset.reset: latest
offset.flush.timeout.ms: 3000
consumer.request.timeout.ms: 15000
producer.batch.size: 65536
heartbeatConnector:
config:
tasks.max: 9
heartbeats.topic.replication.factor: 3
checkpointConnector:
config:
tasks.max: 9
checkpoints.topic.replication.factor: 3
resources:
requests:
memory: 2Gi
cpu: 3
limits:
memory: 4Gi
cpu: 3
any idea where to locate the producer configuration ?

Unable the get the metrics of spring boot application in Prometheus Operator

i'm trying to get metrics from spring boot application inside my prometheus operator:
eks: ver. 1.18
kube-prometheus-stack:
version: 12.12.1
appVersion: 0.44.0
i checked and the application is indeed pulling out the metrics via endpoint:
http://myloadbalancer/internal-gateway/actuator/prometheus
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.013852972596312008
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.0
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of major GC",cause="Allocation Failure",} 4.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Allocation Failure",} 0.922
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 235.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 2.584
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of major GC",cause="Allocation Failure",} 0.0
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.0
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 8.888016704E9
# HELP tomcat_sessions_active_current_sessions
# TYPE tomcat_sessions_active_current_sessions gauge
tomcat_sessions_active_current_sessions 0.0
# HELP tomcat_sessions_alive_max_seconds
# TYPE tomcat_sessions_alive_max_seconds gauge
tomcat_sessions_alive_max_seconds 0.0
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.13497864E8
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
jvm_buffer_memory_used_bytes{id="direct",} 509649.0
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 1.0
# HELP tomcat_sessions_created_sessions_total
# TYPE tomcat_sessions_created_sessions_total counter
tomcat_sessions_created_sessions_total 0.0
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 8.5375192E7
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total 199.0
# HELP tomcat_sessions_active_max_sessions
# TYPE tomcat_sessions_active_max_sessions gauge
tomcat_sessions_active_max_sessions 0.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 66.0
# HELP logback_events_total Number of error level events that made it to the logs
# TYPE logback_events_total counter
logback_events_total{level="warn",} 2.0
logback_events_total{level="debug",} 0.0
logback_events_total{level="error",} 0.0
logback_events_total{level="trace",} 0.0
logback_events_total{level="info",} 443.0
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 5.36870912E8
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="mapped",} 0.0
jvm_buffer_count_buffers{id="direct",} 18.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
jvm_buffer_total_capacity_bytes{id="direct",} 509649.0
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="heap",id="Tenured Gen",} 1.4229504E8
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 2.9229056E7
jvm_memory_committed_bytes{area="heap",id="Eden Space",} 5.7081856E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 1.01359616E8
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 2555904.0
jvm_memory_committed_bytes{area="heap",id="Survivor Space",} 7077888.0
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 1.31072E7
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1.1599872E7
# HELP spring_kafka_listener_seconds_max Kafka Listener Timer
# TYPE spring_kafka_listener_seconds_max gauge
spring_kafka_listener_seconds_max{exception="ListenerExecutionFailedException",name="fgMessageConsumer-0",result="failure",} 0.0
spring_kafka_listener_seconds_max{exception="none",name="fgMessageConsumer-0",result="success",} 0.0
# HELP spring_kafka_listener_seconds Kafka Listener Timer
# TYPE spring_kafka_listener_seconds summary
spring_kafka_listener_seconds_count{exception="ListenerExecutionFailedException",name="fgMessageConsumer-0",result="failure",} 0.0
spring_kafka_listener_seconds_sum{exception="ListenerExecutionFailedException",name="fgMessageConsumer-0",result="failure",} 0.0
spring_kafka_listener_seconds_count{exception="none",name="fgMessageConsumer-0",result="success",} 9.0
spring_kafka_listener_seconds_sum{exception="none",name="fgMessageConsumer-0",result="success",} 16.017111464
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="heap",id="Tenured Gen",} 5.36870912E8
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 1.22912768E8
jvm_memory_max_bytes{area="heap",id="Eden Space",} 2.14827008E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 5828608.0
jvm_memory_max_bytes{area="heap",id="Survivor Space",} 2.6804224E7
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1.22916864E8
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="Tenured Gen",} 8.6654784E7
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 2.382144E7
jvm_memory_used_bytes{area="heap",id="Eden Space",} 7444976.0
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 9.7431448E7
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 1346432.0
jvm_memory_used_bytes{area="heap",id="Survivor Space",} 571600.0
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 1.1687056E7
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1.1500544E7
# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes 16917.0
# HELP tomcat_sessions_rejected_sessions_total
# TYPE tomcat_sessions_rejected_sessions_total counter
tomcat_sessions_rejected_sessions_total 0.0
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.616689221264E9
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 37.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 36.0
# HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
# TYPE system_load_average_1m gauge
system_load_average_1m 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 30.0
# HELP tomcat_sessions_expired_sessions_total
# TYPE tomcat_sessions_expired_sessions_total counter
tomcat_sessions_expired_sessions_total 0.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="runnable",} 10.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 17.0
jvm_threads_states_threads{state="timed-waiting",} 9.0
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP process_uptime_seconds The uptime of the Java virtual machine
# TYPE process_uptime_seconds gauge
process_uptime_seconds 45380.981
# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/health",} 6032.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/health",} 5.492759869
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/health",} 7.97605E-4
# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
so its all good from this end.
this is my ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: internal-gateway-service-monitor
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: internal-gateway
endpoints:
- port: http
path: '/actuator/prometheus'
interval: 10s
honorLabels: true
this is my service:
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: perf4-backend
meta.helm.sh/release-namespace: perf4
creationTimestamp: "2021-03-23T13:00:47Z"
labels:
app.kubernetes.io/managed-by: Helm
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:meta.helm.sh/release-name: {}
f:meta.helm.sh/release-namespace: {}
f:labels:
.: {}
f:app.kubernetes.io/managed-by: {}
f:spec:
f:externalTrafficPolicy: {}
f:ports:
.: {}
k:{"port":80,"protocol":"TCP"}:
.: {}
f:name: {}
f:port: {}
f:protocol: {}
f:targetPort: {}
f:selector:
.: {}
f:app: {}
f:sessionAffinity: {}
f:type: {}
manager: Go-http-client
operation: Update
time: "2021-03-23T13:00:47Z"
name: internal-gateway
namespace: perf4
resourceVersion: "18659"
selfLink: /api/v1/namespaces/perf4/services/internal-gateway
uid: 75f89f23-d76e-4701-80f9-a029ce0f1153
spec:
clusterIP: 172.20.105.66
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 31500
port: 80
protocol: TCP
targetPort: 8070
selector:
app: internal-gateway
sessionAffinity: None
type: NodePort
status:
loadBalancer: {}
this is my pod yaml:
(removed unnecessary fields)
apiVersion: v1
kind: Pod
metadata:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
kubernetes.io/psp: eks.privileged
generateName: fg-internal-gateway-deployment-76cd98ccd8-
labels:
app: internal-gateway
pod-template-hash: 76cd98ccd8
version: "92095"
name: fg-internal-gateway-deployment-76cd98ccd8-ksmgt
namespace: perf4
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: fg-internal-gateway-deployment-76cd98ccd8
uid: 69301225-d013-47e4-a126-b525f39ce608
resourceVersion: "801092"
selfLink: /api/v1/namespaces/perf4/pods/fg-internal-gateway-deployment-76cd98ccd8-ksmgt
uid: 5fedee50-b572-4949-8055-9e58a7053b6a
image:
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8070
scheme: HTTP
initialDelaySeconds: 140
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 1
name: internal-gateway
ports:
- containerPort: 8070
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8070
scheme: HTTP
initialDelaySeconds: 140
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 3Gi
requests:
cpu: "1"
memory: 3Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-vcnjm
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName:
nodeSelector:
role: fgworkers
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- key: gated
operator: Equal
value: "true"
- key: preprod
operator: Equal
value: "true"
- key: staging
operator: Equal
value: "true"
- key: fgworkers
operator: Equal
value: "true"
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-vcnjm
secret:
defaultMode: 420
secretName: default-token-vcnjm
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-03-25T14:42:35Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-03-25T14:45:14Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-03-25T14:45:14Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-03-25T14:42:35Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID:
image:
imageID:
lastState: {}
name: internal-gateway
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-03-25T14:42:41Z"
hostIP:
phase: Running
podIP:
podIPs:
- ip:
qosClass: Guaranteed
startTime: "2021-03-25T14:42:35Z"
and i used the label app: internal-gateway same as my pod spec.
this is what i'm getting in prometheus:
what can be the issue?
The problem is the servicemonitor can't find your service
the problem is your selector in the servicemonitor definition is not selecting the label of the service
solution:
change the label of the service definition to be the same as the matchLabeles definition of your servicemonitor
like that:
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: perf4-backend
meta.helm.sh/release-namespace: perf4
creationTimestamp: "2021-03-23T13:00:47Z"
labels:
app: internal-gateway
Make sure to check the port name you have defined in service and serviceMonitor is same, I was also having same issue so I made the same name and it started showing up with proper app label

MiNiFi - NiFi Connection Failure: Unknown Host Exception : Able to telnet host from the machine where MiNiFi is running

I am running MiNiFi in a Linux Box (gateway server) which is behind my company's firewall. My NiFi is running on an AWS EC2 cluster (running in standalone mode).
I am trying to send data from the Gateway to NiFi running in AWS EC2.
From gateway, I am able to telnet to EC2 node with the public DNS and the remote port which I have configured in the nifi.properties file
nifi.properties
# Site to Site properties
nifi.remote.input.host=ec2-xxx.us-east-2.compute.amazonaws.com
nifi.remote.input.secure=false
nifi.remote.input.socket.port=1026
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs
Telnet connection from Gateway to NiFi
iot1#iothdp02:~/minifi/minifi-0.5.0/conf$ telnet ec2-xxx.us-east-2.compute.amazonaws.com 1026
Trying xx.xx.xx.xxx...
Connected to ec2-xxx.us-east-2.compute.amazonaws.com.
Escape character is '^]'.
The Public DNS is resolving to the correct Public IP of the EC2 node.
From the EC2 node, when I do nslookup on the Public DNS, it gives back the private IP.
From AWS Documentation: "The public IP address is mapped to the primary private IP address through network address translation (NAT). "
Hence, I am not adding the Public DNS and the Public IP in /etc/host file in the EC2 node.
From MiNiFi side, I am getting the below error:
minifi-app.log
iot1#iothdp02:~/minifi/minifi-0.5.0/logs$ cat minifi-app.log
2018-11-14 16:00:47,910 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:00:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:01:02,334 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 20 milliseconds (Stop-the-world time = 6 milliseconds, Clear Edit Logs time = 4 millis), max Transaction ID -1
2018-11-14 16:02:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:02:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:03:02,354 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 18 milliseconds (Stop-the-world time = 3 milliseconds, Clear Edit Logs time = 5 millis), max Transaction ID -1
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.a.n.r.util.SiteToSiteRestApiClient Failed to get controller from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi-api due to java.net.UnknownHostException: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.apache.nifi.controller.FlowController Unable to communicate with remote instance RemoteProcessGroup[http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi] due to org.apache.nifi.controller.exception.CommunicationsException: org.apache.nifi.controller.exception.CommunicationsException: Unable to communicate with Remote NiFi at URI http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi due to: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:05:02,380 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 25 milliseconds (Stop-the-world time = 8 milliseconds, Clear Edit Logs time = 6 millis), max Transaction ID -1
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:07:02,399 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with
MiNiFi config.yml
MiNiFi Config Version: 3
Flow Controller:
name: Gateway-IDS_v0.1
comment: "1. ConsumeMQTT - MiNiFi will consume mqtt messages in gateway\n2. Remote\
\ Process Group will send messages to NiFi "
Core Properties:
flow controller graceful shutdown period: 10 sec
flow service write delay interval: 500 ms
administrative yield duration: 30 sec
bored yield duration: 10 millis
max concurrent threads: 1
variable registry properties: ''
FlowFile Repository:
partitions: 256
checkpoint interval: 2 mins
always sync: false
Swap:
threshold: 20000
in period: 5 sec
in threads: 1
out period: 5 sec
out threads: 4
Content Repository:
content claim max appendable size: 10 MB
content claim max flow files: 100
always sync: false
Provenance Repository:
provenance rollover time: 1 min
implementation: org.apache.nifi.provenance.MiNiFiPersistentProvenanceRepository
Component Status Repository:
buffer size: 1440
snapshot frequency: 1 min
Security Properties:
keystore: ''
keystore type: ''
keystore password: ''
key password: ''
truststore: ''
truststore type: ''
truststore password: ''
ssl protocol: ''
Sensitive Props:
key:
algorithm: PBEWITHMD5AND256BITAES-CBC-OPENSSL
provider: BC
Processors:
- id: 6396f40f-118f-33f4-0000-000000000000
name: ConsumeMQTT
class: org.apache.nifi.processors.mqtt.ConsumeMQTT
max concurrent tasks: 1
scheduling strategy: TIMER_DRIVEN
scheduling period: 0 sec
penalization period: 30 sec
yield period: 1 sec
run duration nanos: 0
auto-terminated relationships list: []
Properties:
Broker URI: tcp://localhost:1883
Client ID: nifi
Connection Timeout (seconds): '30'
Keep Alive Interval (seconds): '60'
Last Will Message:
Last Will QoS Level:
Last Will Retain:
Last Will Topic:
MQTT Specification Version: '0'
Max Queue Size: '10'
Password:
Quality of Service(QoS): '0'
SSL Context Service:
Session state: 'true'
Topic Filter: MQTT
Username:
Controller Services: []
Process Groups: []
Input Ports: []
Output Ports: []
Funnels: []
Connections:
- id: f0007aa3-cf32-3593-0000-000000000000
name: ConsumeMQTT/Message/85ebf198-0166-1000-5592-476a7ba47d2e
source id: 6396f40f-118f-33f4-0000-000000000000
source relationship names:
- Message
destination id: 85ebf198-0166-1000-5592-476a7ba47d2e
max work queue size: 10000
max work queue data size: 1 GB
flowfile expiration: 0 sec
queue prioritizer class: ''
Remote Process Groups:
- id: c00d3132-375b-323f-0000-000000000000
name: ''
url: http://ec2-xxx.us-east-2.compute.amazonaws.com:9090
comment: ''
timeout: 30 sec
yield period: 10 sec
transport protocol: RAW
proxy host: ''
proxy port: ''
proxy user: ''
proxy password: ''
local network interface: ''
Input Ports:
- id: 85ebf198-0166-1000-5592-476a7ba47d2e
name: From MiNiFi
comment: ''
max concurrent tasks: 1
use compression: false
Properties:
Port: 1026
Host Name: ec2-xxx.us-east-2.compute.amazonaws.com
Output Ports: []
NiFi Properties Overrides: {}
Any pointers on how to troubleshoot this issue?
In MiNiFi config.yml, I changed the URL under Remote Process Groups from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090 to http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi.

Resources