Opensearch throws 429 error when Fluentbit outputs log under heavy load - performance

With the below fluentbit configuration we are getting errors from opensearch under heavy load.
Http bulk requests to opensearch by fluentbit(respresenting 429 errors as spike)
Fluentbit config:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Mem_Buf_Limit 400M
storage.type filesystem
Skip_Long_Lines On
Refresh_Interval 1
Rotate_Wait 600
[OUTPUT]
Name es
Match kube.*
Host ${ES_HOST}
Port ${PORT}
Buffer_Size False
AWS_Auth Off
AWS_Role_ARN ${ES_ARN}
AWS_External_ID ${ES_IAMROLE}
HTTP_User ${ES_USER}
HTTP_Passwd ${ES_PASSWD}
tls On
tls.verify Off
Trace_Output ${TRACE_OUTPUT}
Trace_Error On
Replace_Dots On
Index fluentbit
Type flb
AWS_Region ${AWS_REGION}
Logstash_Format On
Logstash_Prefix ${ES_LOGSTASHPREFIX}_app_log
Logstash_DateFormat %Y.%m.%d
Retry_Limit 10
storage.total_limit_size 1G
For resolving this we have upgraded our opensearch instance type from r5.xlarge.search(4 nodes) to r5.2xlarge.search(3 nodes) but that also didn't solve the issue.
We have also increased the ES index refresh_interval to 60s but that didn't help.
We read that output to ES from fluentbit can be controlled via buffering so we decreased Mem_Buf_Limit to 400M and it didn't help.
Can someone help if can try any other things or we are missing something.

The issue here is not that of fluentbit but is of opensearch/elasticsearch.
The HTTP 429 errors (es_request_rejected_exception) in ES occur when too many requests are sent to the cluster, than what the thread pool for it can handle. The thread pool in OpenSearch for different tasks are allocated differently with search operations getting a larger share. The option to manually modify thread pool allocation is not available for versions 5.1 and later.
You can try to resolve this by few ways.
1: Refresh rate (you already did that and it didn't help).
2: Change the indexing speed. Try to send logs with an interval greater than your current.
3: Upscale (you did and it didn't work either)
You can get an idea with the following formula for thread pools.
Number of thread pools allocated for writes = Number of Virtual CPUs (your case)
Number of thread pools allocated for search = ((3 * Number of virtual CPUs)/2) + 1
So, I am guessing your issue here is a big number of shards! You can either decrease the shards for each index or if you are having this issue only once in a while when there is extra load, you can change the replica count to 0 and when the period is finished, change it back to the original.
Check these two links to find out more about optimizing your ES domain.
indexing performance
Best practices

Related

Data too large ElasticSearch issue along with Readiness probe failed

We have set up an EFK stack for our project and from yesterday kibana seems down. When we initially troubleshooter we have found the following errors:
Readiness probe failed: Error: Got HTTP code 503 but expected a 200 & Readiness probe failed: Error: Got HTTP code 000 but expected a 200
Later we found the same issue with elasticsearch pod as well. along with this we found the following issue with Data request limit:
FATAL
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limitof
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limit of
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"},"status":429}
We have tried changing the REDINESS_PROBE_TIMEOUT, Initial Delay, Timeout, Probe Period, Success Threshold, and Failure Threshold. Also tried increasing the Indicess Breaker limit but it's not reflecting we can see error still taking old limits, tried fixing circuit_breaking_exception by adding ES_JAVA_OPTS values as well.
Nothing seems to be working, any help would be appreciated.
the same phenomenon occurred during the service operation. This issue is identified as a memory shortage. So there are several ways to think about it over.
Physical Memory Expansion (Scale Out)
Additional equipment due to insufficient memory available
Lower load through monitoring
If circuit_breaking_exception remains in the log, develop a monitoring device that lowers the load
Setting java_opts
You can set memory usage, but it's meaningless if you don't have enough hardware memory

Increase queue capacity in ElasticSearch

Elastic version 7.8
I'm getting an error when running this code for thousands of records:
var bulkIndexResponse = await _client.BulkAsync(i => i
.Index(indexName)
.IndexMany(bases));
if (!bulkIndexResponse.IsValid)
{
throw bulkIndexResponse.OriginalException;
}
It eventually crashes with the following error:
Invalid NEST response built from a successful (200) low level call on POST: /indexname/_bulk
# Invalid Bulk items:
operation[1159]: index returned 429 _index: indexname _type: _doc _id: _version: 0 error: Type:
es_rejected_execution_exception Reason: "Could not perform enrichment, enrich coordination queue at
capacity [1024/1024]"
I would like to know how this enrich coordination queue capacity can be increased to accommodate continuous calls of BulkAsync with around a thousand records on each call.
you can check what thread_pool is getting full by /_cat/thread_pool?v and increase the queue (as ninja said) in elasticsearch.yml for each node.
but increasing queue size affect heap consumption and subsequently maybe it would affect performance.
when you get this error it may have two reason. first you are sending large bulk request. try to decrease the bulk request under 500 or lower. second you have some performance issue. try to find and solve the issue. maybe you should add more node to your cluster.
Not sure what version you are, but this enrich coordination queue seems to be the bulk queue and you can increase the queue size(these are node specific) by changing the elasticsearch.yml of that node.
Refer threadpools in ES for more info.

Kafka elasticsearch connector - 'Flush timeout expired with unflushed records:'

I have a strange problem with kafka -> elasticsearch connector. First time when I started it all was great, I received a new data in elasticsearch and checked it through kibana dashboard, but when I produced new data in to kafka using the same producer application and tried to start connector one more time, I didn't get any new data in elasticsearch.
Now I'm getting such errors:
[2018-02-04 21:38:04,987] ERROR WorkerSinkTask{id=log-platform-elastic-0} Commit of offsets threw an unexpected exception for sequence number 14: null (org.apache.kafka.connect.runtime.WorkerSinkTask:233)
org.apache.kafka.connect.errors.ConnectException: Flush timeout expired with unflushed records: 15805
I'm using next command to run connector:
/usr/bin/connect-standalone /etc/schema-registry/connect-avro-standalone.properties log-platform-elastic.properties
connect-avro-standalone.properties:
bootstrap.servers=kafka-0.kafka-hs:9093,kafka-1.kafka-hs:9093,kafka-2.kafka-hs:9093
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
# producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
# consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
#rest.host.name=
rest.port=8084
#rest.advertised.host.name=
#rest.advertised.port=
plugin.path=/usr/share/java
and log-platform-elastic.properties:
name=log-platform-elastic
key.converter=org.apache.kafka.connect.storage.StringConverter
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=member_sync_log, order_history_sync_log # ... and many others
key.ignore=true
connection.url=http://elasticsearch:9200
type.name=log
I checked connection to kafka brokers, elasticsearch and schema-registry(schema-registry and connector are on the same host at this moment) and all is fine. Kafka brokers are running on port 9093 and I'm able to read data from topics using kafka-avro-console-consumer.
I'll be gratefull for any help on this!
Just update flush.timeout.ms to bigger than 10000 (10 seconds which is the default)
According to documentation:
flush.timeout.ms
The timeout in milliseconds to use for periodic
flushing, and when waiting for buffer space to be made available by
completed requests as records are added. If this timeout is exceeded
the task will fail.
Type: long Default: 10000 Importance: low
See documentation
We can optimized Elastic search configuration to solve issue. Please refer below link for configuration parameter
https://docs.confluent.io/current/connect/kafka-connect-elasticsearch/configuration_options.html
Below are key parameter which can control message rate flow to eventually help to solve issue:
flush.timeout.ms: Increase might help to give more breath on flush time
The timeout in milliseconds to use for periodic flushing, and when
waiting for buffer space to be made available by completed requests as
records are added. If this timeout is exceeded the task will fail.
max.buffered.records: Try reducing buffer record limit
The maximum number of records each task will buffer before blocking
acceptance of more records. This config can be used to limit the
memory usage for each task
batch.size: Try reducing batch size
The number of records to process as a batch when writing to
Elasticsearch
tasks.max: Number of parallel thread(consumer instance) Reduce or Increase. This will control Elastic Search if bandwidth not able to handle reduce task may help.
It worked my issue by tuning above parameters

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources