Promtail: "error sending batch, will retry" status=500 - loki

when starting the promtail client, it gives an error:
component=client host=loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (6780207 vs. 4194304)"
promtail config
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: 'http://loki:3100/loki/api/v1/push'
scrape_configs:
- job_name: server-log
pipeline_stages:
static_configs:
- targets:
- localhost
labels:
job: server-log
__path__: /opt/log/*.log
__path_exclude__: /opt/log/jck_*,*.log
I tried to start changing limits on the server, and run promtail with parameters:
/usr/local/bin/promtail-linux-amd64 -config.file=/etc/config-promtail.yml -server.grpc-max-recv-msg-size-bytes 16777216 -server.grpc-max-concurrent-streams 0 -server.grpc-max-send-msg-size-bytes 16777216 -limit.readline-rate-drop -client.batch-size-bytes 2048576
But judging by what I found, this is a grpc protocol error, or rather, in the size of the transmitted message, where the maximum is 4 mb

Changing the server parameters, not the client, solved the problem
server:
http_listen_port: 3100
grpc_listen_port: 9096
grpc_server_max_recv_msg_size: 8388608
grpc_server_max_send_msg_size: 8388608
limits_config:
ingestion_rate_mb: 15
ingestion_burst_size_mb: 30
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 20MB
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 744h
max_query_length: 0h

Related

How to configure the retryOnResultPredicate in resilience4j?

I want to set failAfterMaxAttempts to true to get the MaxRetriesExceededException at the end of max retry. According to the doc we need to set the predicate for retryOnResultPredicate with failAfterMaxAttempts. Can someone help with the example config ? So the retryOnResultPredicate should evaluate the response http status.
resilience4j.retry:
configs:
default:
maxAttempts: 3
waitDuration: 100
failAfterMaxAttempts: true
retryOnResultPredicate:
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.util.concurrent.TimeoutException
- java.io.IOException
ignoreExceptions:
- io.github.robwin.exception.BusinessException

Loki: Ingester high memory usage

Please advise on the use of memory with the Loki - Ingester component.
I have the following setup: loki distributed v2.6.1 installed through the official helm chart in K8s.
The number of promtail clients is ~1000 hosts. Each of them generates a large load. About 5 million chunks (see screenshot below)
The number of loki_log_messages_total is 235 million per day.
My problem is that the ingester is using about 140 GB of RAM per day, and the memory consumption keeps increasing. I want to understand if this is normal behavior or can I somehow reduce memory usage through config? Tried to adjust various parameters myself, in particular chunk_idle_period and max_chunk_age. But no matter what values I set, memory consumption is still at 100+ GB.
I also tried to reduce the number of labels on the client side, at the moment the labels are as follows:
Here is my config:
auth_enabled: false
chunk_store_config:
max_look_back_period: 0s
compactor:
retention_enabled: true
shared_store: s3
working_directory: /var/loki/compactor
distributor:
ring:
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 5s
tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
frontend_address: loki-distributed-query-frontend:9095
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
ingester:
autoforget_unhealthy: true
chunk_block_size: 262144
chunk_encoding: snappy
chunk_idle_period: 5m
chunk_retain_period: 30s
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
max_chunk_age: 15m
max_transfer_retries: 0
wal:
enabled: false
ingester_client:
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
limits_config:
cardinality_limit: 500000
enforce_metric_name: false
ingestion_burst_size_mb: 300
ingestion_rate_mb: 150
max_cache_freshness_per_query: 10m
max_entries_limit_per_query: 1000000
max_global_streams_per_user: 5000000
max_label_name_length: 1024
max_label_names_per_series: 300
max_label_value_length: 8096
max_query_series: 250000
per_stream_rate_limit: 150M
per_stream_rate_limit_burst: 300M
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 72h
split_queries_by_interval: 30m
memberlist:
join_members:
- loki-distributed-memberlist
querier:
engine:
timeout: 5m
query_timeout: 5m
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 5
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
ttl: 24h
query_scheduler:
grpc_client_config:
max_recv_msg_size: 167772160
max_send_msg_size: 167772160
runtime_config:
file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
configs:
- from: "2022-09-07"
index:
period: 24h
prefix: loki_index_
object_store: aws
schema: v12
store: boltdb-shipper
server:
grpc_server_max_recv_msg_size: 167772160
grpc_server_max_send_msg_size: 167772160
http_listen_port: 3100
http_server_idle_timeout: 300s
http_server_read_timeout: 300s
http_server_write_timeout: 300s
storage_config:
aws:
s3: https:/....
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /var/loki/boltdb_shipper/index
cache_location: /var/loki/boltdb_shipper/cache
shared_store: s3
index_cache_validity: 5m
table_manager:
retention_deletes_enabled: false
retention_period: 0s
In documentation I have not found any examples or information for heavy loads, so I decided to ask the community. I will be very grateful for help.

Grafana Loki-2.5.0 log retention period is not working

I upgraded my Loki deployment in kebernet from Loki-2.4.0 to Loki-2.5.0. 
But after upgrading, old chunks are not getting deleted, so they fill up the filesystem.
It looks like the retention period is not working.
I am trying to configure a 7-day retention period for Loki.
Can someone please help me configure the retention period in Loki-2.5.0 ?
My configuration -
auth_enabled: false
server:
http_listen_port: 3100
grpc_server_max_recv_msg_size: 67108864
grpc_server_max_send_msg_size: 67108864
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 15m
chunk_retain_period: 30s
max_transfer_retries: 0
wal:
enabled: true
dir: /loki/wal
schema_config:
configs:
- from: 2021-02-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
chunks:
prefix: chunks_
period: 24h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 72h
ingestion_rate_mb: 32
creation_grace_period: 30m
max_entries_limit_per_query: 50000
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 168h
try to configure limits_config.retention_period: 168h

producer configuration of kafkamirrormaker2 strimzi

i'm using kafka mirrormaker2 in order to copy topic content from Kafka's cluster A to Kafka's cluster B.
i'm getting the following errors
kafka.MirrorSourceConnector-0} flushing 197026 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask) [SourceTaskOffsetCommitter-1]
where do i need to add the producer configuration ?
i tried to put it in 2 places but seems like it's not working:
kafkaMirrorMaker2:
spec:
config:
batch.size: 9000
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
instances:
- enabled: true
name: mirror-maker
replicas: 3
enableMetrics: true
targetCluster: target-kafka
producer:
config:
producer.batch.size: 300000
batch.size: 9000
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
clusters:
source-kafka:
bootstrapServers: sourcekafka:9092
target-kafka:
bootstrapServers: targetkafka:9092
mirrors:
- sourceCluster: source-kafka
targetCluster: target-kafka
topicsPattern: "mytopic"
groupsPattern: "my-replicator"
offset.flush.timeout.ms: 3000
producer.buffer.memory: 5000
sourceConnector:
config:
max.poll.record: 2000
tasks.max: 9
consumer.auto.offset.reset: latest
offset.flush.timeout.ms: 3000
consumer.request.timeout.ms: 15000
producer.batch.size: 65536
heartbeatConnector:
config:
tasks.max: 9
heartbeats.topic.replication.factor: 3
checkpointConnector:
config:
tasks.max: 9
checkpoints.topic.replication.factor: 3
resources:
requests:
memory: 2Gi
cpu: 3
limits:
memory: 4Gi
cpu: 3
any idea where to locate the producer configuration ?

MiNiFi - NiFi Connection Failure: Unknown Host Exception : Able to telnet host from the machine where MiNiFi is running

I am running MiNiFi in a Linux Box (gateway server) which is behind my company's firewall. My NiFi is running on an AWS EC2 cluster (running in standalone mode).
I am trying to send data from the Gateway to NiFi running in AWS EC2.
From gateway, I am able to telnet to EC2 node with the public DNS and the remote port which I have configured in the nifi.properties file
nifi.properties
# Site to Site properties
nifi.remote.input.host=ec2-xxx.us-east-2.compute.amazonaws.com
nifi.remote.input.secure=false
nifi.remote.input.socket.port=1026
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs
Telnet connection from Gateway to NiFi
iot1#iothdp02:~/minifi/minifi-0.5.0/conf$ telnet ec2-xxx.us-east-2.compute.amazonaws.com 1026
Trying xx.xx.xx.xxx...
Connected to ec2-xxx.us-east-2.compute.amazonaws.com.
Escape character is '^]'.
The Public DNS is resolving to the correct Public IP of the EC2 node.
From the EC2 node, when I do nslookup on the Public DNS, it gives back the private IP.
From AWS Documentation: "The public IP address is mapped to the primary private IP address through network address translation (NAT). "
Hence, I am not adding the Public DNS and the Public IP in /etc/host file in the EC2 node.
From MiNiFi side, I am getting the below error:
minifi-app.log
iot1#iothdp02:~/minifi/minifi-0.5.0/logs$ cat minifi-app.log
2018-11-14 16:00:47,910 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:00:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:01:02,334 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 20 milliseconds (Stop-the-world time = 6 milliseconds, Clear Edit Logs time = 4 millis), max Transaction ID -1
2018-11-14 16:02:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:02:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:03:02,354 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 18 milliseconds (Stop-the-world time = 3 milliseconds, Clear Edit Logs time = 5 millis), max Transaction ID -1
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.a.n.r.util.SiteToSiteRestApiClient Failed to get controller from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi-api due to java.net.UnknownHostException: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.apache.nifi.controller.FlowController Unable to communicate with remote instance RemoteProcessGroup[http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi] due to org.apache.nifi.controller.exception.CommunicationsException: org.apache.nifi.controller.exception.CommunicationsException: Unable to communicate with Remote NiFi at URI http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi due to: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:05:02,380 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 25 milliseconds (Stop-the-world time = 8 milliseconds, Clear Edit Logs time = 6 millis), max Transaction ID -1
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:07:02,399 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with
MiNiFi config.yml
MiNiFi Config Version: 3
Flow Controller:
name: Gateway-IDS_v0.1
comment: "1. ConsumeMQTT - MiNiFi will consume mqtt messages in gateway\n2. Remote\
\ Process Group will send messages to NiFi "
Core Properties:
flow controller graceful shutdown period: 10 sec
flow service write delay interval: 500 ms
administrative yield duration: 30 sec
bored yield duration: 10 millis
max concurrent threads: 1
variable registry properties: ''
FlowFile Repository:
partitions: 256
checkpoint interval: 2 mins
always sync: false
Swap:
threshold: 20000
in period: 5 sec
in threads: 1
out period: 5 sec
out threads: 4
Content Repository:
content claim max appendable size: 10 MB
content claim max flow files: 100
always sync: false
Provenance Repository:
provenance rollover time: 1 min
implementation: org.apache.nifi.provenance.MiNiFiPersistentProvenanceRepository
Component Status Repository:
buffer size: 1440
snapshot frequency: 1 min
Security Properties:
keystore: ''
keystore type: ''
keystore password: ''
key password: ''
truststore: ''
truststore type: ''
truststore password: ''
ssl protocol: ''
Sensitive Props:
key:
algorithm: PBEWITHMD5AND256BITAES-CBC-OPENSSL
provider: BC
Processors:
- id: 6396f40f-118f-33f4-0000-000000000000
name: ConsumeMQTT
class: org.apache.nifi.processors.mqtt.ConsumeMQTT
max concurrent tasks: 1
scheduling strategy: TIMER_DRIVEN
scheduling period: 0 sec
penalization period: 30 sec
yield period: 1 sec
run duration nanos: 0
auto-terminated relationships list: []
Properties:
Broker URI: tcp://localhost:1883
Client ID: nifi
Connection Timeout (seconds): '30'
Keep Alive Interval (seconds): '60'
Last Will Message:
Last Will QoS Level:
Last Will Retain:
Last Will Topic:
MQTT Specification Version: '0'
Max Queue Size: '10'
Password:
Quality of Service(QoS): '0'
SSL Context Service:
Session state: 'true'
Topic Filter: MQTT
Username:
Controller Services: []
Process Groups: []
Input Ports: []
Output Ports: []
Funnels: []
Connections:
- id: f0007aa3-cf32-3593-0000-000000000000
name: ConsumeMQTT/Message/85ebf198-0166-1000-5592-476a7ba47d2e
source id: 6396f40f-118f-33f4-0000-000000000000
source relationship names:
- Message
destination id: 85ebf198-0166-1000-5592-476a7ba47d2e
max work queue size: 10000
max work queue data size: 1 GB
flowfile expiration: 0 sec
queue prioritizer class: ''
Remote Process Groups:
- id: c00d3132-375b-323f-0000-000000000000
name: ''
url: http://ec2-xxx.us-east-2.compute.amazonaws.com:9090
comment: ''
timeout: 30 sec
yield period: 10 sec
transport protocol: RAW
proxy host: ''
proxy port: ''
proxy user: ''
proxy password: ''
local network interface: ''
Input Ports:
- id: 85ebf198-0166-1000-5592-476a7ba47d2e
name: From MiNiFi
comment: ''
max concurrent tasks: 1
use compression: false
Properties:
Port: 1026
Host Name: ec2-xxx.us-east-2.compute.amazonaws.com
Output Ports: []
NiFi Properties Overrides: {}
Any pointers on how to troubleshoot this issue?
In MiNiFi config.yml, I changed the URL under Remote Process Groups from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090 to http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi.

Resources