ElasticSearch Timeout to Cancel Expensive Long-Running Queries - elasticsearch

In Elasticsearch 2.x is there a way to cancel queries after a certain timeout to reduce server load ie circuit break based on request time?
Specifically if you send expensive queries ie large number of wildcards or complex aggregations, these could effectively bring down a cluster - the cluster is unable to service new requests. Is there a server level config that times out these queries after a certain amount of time?
The timeout parameter in the request body is best effort and just specifies the time that you receive a http response... the query is still in flight and will consume cluster resources until it finishes executing. The circuit breaker option https://www.elastic.co/guide/en/elasticsearch/reference/2.3/circuit-breaker.html is the behavior I would like to see, but based on a request timeout instead of by amount of heap used.

Related

Design Pattern - Spring KafkaListener processing 1 million records in 1 hour

My spring boot application is going to listen to 1 million records an hour from a kafka broker. The entire processing logic for each message takes 1-1.5 seconds including a database insert. Broker has 64 partitions, which is also the concurrency of my #KafkaListener.
My current code is only able to process 90 records in a minute in a lower environment where I am listening to around 50k records an hour. Below is the code and all other config parameters like max.poll.records etc are default values:
#KafkaListener(id="xyz-listener", concurrency="64", topics="my-topic")
public void listener(String record) {
// processing logic
}
I do get "it is likely that the consumer was kicked out of the group" 7-8 times an hour. I think both of these issues can be solved through isolating listener method and multithreading processing of each message but I am not sure how to do that.
There are a few points to consider here. First, 64 consumers seems a bit too much for a single application to handle consistently.
Considering each poll by default fetches 500 records per consumer at a time, your app might be getting overloaded and causing the consumers to get kicked out of the group if a single batch takes more than the 5 minutes default for max.poll.timeout.ms to be processed.
So first, I'd consider scaling the application horizontally so that each application handles a smaller amount of partitions / threads.
A second way to increase throughput would be using a batch listener, and handling processing and DB insertions in batches as you can see in this answer.
Using both, you should be processing a sensible amount of work in parallel per app, and should be able to achieve your desired throughput.
Of course, you should load test each approach with different figures to have proper metrics.
EDIT: Addressing your comment, if you want to achieve this throughput I wouldn't give up on batch processing just yet. If you do the DB operations row by row you'll need a lot more resources for the same performance.
If your rule engine doesn't do any I/O you can iterate each record from the batch through it without losing performance.
About data consistency, you can try some strategies. For example, you can have a lock to ensure that even through a rebalance only one instance will process a given batch of records at a given time - or perhaps there's a more idiomatic way of handling that in Kafka using the rebalance hooks.
With that in place, you can batch load all the information you need to filter out duplicated / outdated records when you receive the records, iterate each record through the rule engine in memory, and then batch persist all results, to then release the lock.
Of course, it's hard to come up with an ideal strategy without knowing more details about the process. The point is by doing that you should be able to handle around 10x more records within each instance, so I'd definitely give it a shot.

How can ı constraint nifi processors response ? (queue) Apache Nifi

When I request an API in Nifi, more than one response returns. And the content of these responses is the same. If I don't stop the processor, it keeps coming. I keep turning the processor on and off quickly. Is there a way to restrict this?
Can I have the API return a certain number of times no matter how many requests it sends? For example, return only 3 requests.
NiFi flows are intended to be always-on streams. If you go to the Scheduling tab of a processor's config, you'll see that, by default, it is scheduled to run continuously (0 ms).
If you don't want this style of streaming behaviour, you need to change the Scheduling of the processor.
You can change it to only schedule the processor every X seconds, or you can change it to run based on a CRON expression.

What happens if I give a large value to server.tomcat.max-threads to handle load on my application?

There are around 1000+ jobs running through our service in a day and around 70-80 jobs starting at the same time and running parallelly.
To handle this, we looked that increasing the number of max threads to a large number to server.tomcat.max-threads property of our Spring application should work but I do not have full confidence as to what all can be the side effects of having a huge number like 800 to this property.
Can you please help here.
The default installation of Tomcat sets the maximum number of HTTP servicing threads at 200. Effectively, this means that the system can handle a maximum of 200 simultaneous HTTP requests. When the number of simultaneous HTTP requests exceeds this count, the unhandled requests are placed in a queue, and the requests in this queue are serviced as processing threads become available. This default queue length is 100. At these default settings, a large web load that can generate over 300 simultaneous requests will surpass the thread availability, resulting in service unavailable (HTTP 503).
More reference: https://docs.bmc.com/docs/brid91/en/tomcat-container-workload-configuration-825210082.html
How to run multiple servlets execution in parallel for Tomcat?
If this is a batch job like configuration, you can use spring batch.

Validate newly created server support the same load

We are creating a new hosted server for one of our APIs on managed containers (Kubernetes) and we're trying to validate that it can handle at least the same amount of traffic load requests.
We've started with one of the APIs, where we would need to handle at least 140k requests per minute, all endpoints combined.
To verify this, I created a simple JMeter test as follows:
-Test Plan
---Thread Group Endpoint1
-----HTTP Request -> a GET request with query params for /path1
---Thread Group Endpoint2
-----HTTP Request -> a GET request with query params for /path2
For a local test, I used the following setup:
Thread Groups Endpoint1 and Endpoint2 are set to 200 threads (users), ramp-up period of 1s, loop count = forever and duration 60s.
Using a Summary Report listener when running the test gets me a total of ~9300 # Samples.
Using this approach, is it safe to just increase the number of threads (users) for the Thread Groups until I reach the desired 140k requests per minute?
Note: I only used JMeter a little before, so I'm aware that the entire approach may be wrong, therefore any suggestions and steering to the right path are more than welcomed.
Your approach is viable as long as it represents real-life application usage. If it has 2 endpoints with equally/evenly distributed load - your setup is just fine. If there are more endpoints and some of them are used more than the others - consider defining the workload correspondingly either using different Thread Groups or other distribution mechanism such as Throughput Controller
Increasing the number of threads is also fine, however consider increasing the load gradually, to wit increase ramp-up time so your test could have:
Arrivals phase
Time to hold the load
Ramp-down phase
This way you will be able to correlate various metrics like increasing response time, throughput, number of errors, etc. with the increasing load. Also you will be able to state what was the number of threads/requests per second when the system reached saturation point/breaking point and does it recover when the load gets back.
Also make sure you're following JMeter Best Practices as 2300/2500 requests per second is not something JMeter can support out of the box and you will need to do some tuning, at least increase JVM Heap size allocated to JMeter.
You may not be able to achieve the desired 140k requests per minute using a single Jmeter Machine, in that case you'll need Distributed Load Testing approach here.
refer: http://jmeter.apache.org/usermanual/jmeter_distributed_testing_step_by_step.html
Also keeping the ramp-up period of 1 second will lead to spike and unrealistic load in the system which will not give proper result unless you've pre-warmed your server, you should gradually increase the load as per real/estimated traffic pattern.

Elasticsearch timeout true but still get result

I'm setting the timeout to 10ms to my search query, so I'm expecting that elasticsearch search query should timeout in 10ms.
In the response, I do get "timed_out":true but the query doesnt seem to timeout. It still runs for a few hundred milliseconds.
Sample response:
{
"took": 460,
"timed_out": true,
....
Is this the expected behavior or am I missing something here ? My goal is to terminate the query if its taking too long so that it doesnt put load on the cluster.
What to expect from query timeout?
Elasticsearch query running with timeout set may return partial or empty results (if timeout has expired), from the Elasticsearch Guide:
The timeout parameter tells shards how long they are allowed to
process data before returning a response to the coordinating node. If
there was not enough time to process all data, results for this shard
will be partial, even possibly empty.
The documentation of the Request Body Search parameters also tells this:
timeout
A search timeout, bounding the search request to be executed within
the specified time value and bail with the hits accumulated up to that
point when expired. Defaults to no timeout.
For further details please consult this page in the guide.
How to terminate queries that run too long?
Looks like Elasticsearch does not have an ultimate answer, rather several workarounds for particular cases. Here they are.
There isn't a way to protect system from DoS attacks (as of year 2015). Long-running queries can be limited with timeout or terminate_after query parameters. terminate_after is like timeout but it counts the number of documents per shard. Both of these parameters are more like recommendations to Elasticsearch, means that some long-running queries can still pass through the desired max execution time (like a script query for instance).
Since then Task Management API was introduced and monitoring and cancelling long-running tasks became possible. This means that you will have to write some additional code that will check the health of the cluster and cancel the tasks.

Resources