How to kill the thread of searching request on elasticsearch cluster? Is there some API to do this? - elasticsearch

I made a elasticsearch cluster with big data, and the client can send searching request to it.
Sometimes, the cluster costs much time to deal with one request.
My question is, is there any API to kill the specified thread which cost too much time?

I wanted to follow up on this answer now that elasticsearch 1.0.0 has been released. I am happy to announce that there is new functionality that has been introduced that implements some protection for the heap, called the circuit breaker.
With the current implementation, the circuit breaker tries to anticipate how much data is going to be loaded into the field data cache, and if it's greater than the limit (80% by default) it will trip the circuit breaker and there by kill your query.
There are two parameters for you to set if you want to modify them:
indices.fielddata.breaker.limit
indices.fielddata.breaker.overhead
The overhead is the constant that is used to estimate how much data will be loaded into the field cache; this is 1.03 by default.
This is an exciting development to elasticsearch and a feature I have been waiting to be implemented for months.
This is the pull request if interested in seeing how it was made; thanks to dakrone for getting this done!
https://github.com/elasticsearch/elasticsearch/pull/4261
Hope this helps,
MatthewJ

Currently it is not possible to kill or stop the long running queries, But Elasticsearch is going to add a task management api to do this. The API is likely to be added in Elasticsearch 5.0, maybe in 2016 or later.
see Task management 1 and Task management 2.

Related

Is there any way to check the relocation progress of shards in elasticsearch?

Yesterday, I was adding a node to production elasticsearch cluster once I added it I can use /_cat/health api to check number of relocating shards. And there is another api /_cat/shards to check which shards are getting relocated. However, is there any way or api to check live progress of shards/data movement to the newly added node. Suppose there is a 13GB shards, we've added a node to es cluster can we check how much percent, GBs(MBs or KBs) has moved currently so that we can have a estimate of how much time it will take for reallocation.
Can this be implemented by on our own or suggest this to elasticsearch? If it can be implemented on our own, how to proceed or what pre-requisites I need to know?
you have
GET _cat/recovery?active_only=true&v
GET _cat/recovery?active_only=true&h=index,shard,source_node,target_node,bytes_percent,time
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html
Take a look to the Pending Tasks API :
https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html
The task management API returns information about tasks currently executing on one or more nodes in the cluster.
GET /_tasks
You can also see the reasons for the allocation using the allocation explain API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html
GET _cluster/allocation/explain

Is there a "best practice" in microservice development for versioning a database table?

A system is being implemented using microservices. In order to decrease interactions between microservices implemented "at the same level" in an architecture, some microservices will locally cache copies of tables managed by other services. The assumption is that the locally cached table (a) is frequently accessed in a "read mode" by the microservice, and (b) has relatively static content (i.e., more of a "lookup table" vice a transactional content).
The local caches will maintain synch using inter-service messaging. As the content should be fairly static, this should not be a significant issue/workload. However, on startup of a microservice, there is a possibility that the local cache has gone stale.
I'd like to implement some sort of rolling revision number on the source table, so that microservices with local caches can check this revision number to potentially avoid a re-synch event.
Is there a "best practice" to this approach? Or, a "better alternative", given that each microservice is backed by it's own database (i.e., no shared database)?
In my opinion you shouldn't be loading the data at start up. It might be bit complicated to maintain version.
Cache-Aside Pattern
Generally in microservices architecture you consider "cache-aside pattern". You don't build the cache at front but on demand. When you get a request you check the cache , if it's not there you update the cache with latest value and return response, from there it's always returned from cache. The benefit is you don't need to load everything at front. Say you have 200 records, while services are only using 50 of them frequently , you are maintaining the extra cache that may not be required.
Let the requests build the cache , it's the one time DB hit . You can set the expiry on cache and incoming request build it again.
If you have data which is totally static (never ever change) then this pattern may not be worth a discussion , but if you have a lookup table that can change even once a week, month, then you should be using this pattern with longer cache expiration time. Maintaining the version could be costly. But really upto you how you may want to implement.
https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside
We ran into this same issue and have temporarily solved it by using a LastUpdated timestamp comparison (same concept as your VersionNumber). Every night (when our application tends to be slow) each service publishes a ServiceXLastUpdated message that includes the most recent timestamp when the data it owns was added/edited. Any other service that subscribes to this data processes the message and if there's a mismatch it requests all rows "touched" since it's last local update so that it can get back in sync.
For us, for now, this is okay as new services don't tend to come online and be in use same day. But, our plan going forward is that any time a service starts up, it can publish a message for each subscribed service indicating it's most recent cache update timestamp. If a "source" service sees the timestamp is not current, it can send updates to re-sync the data. This has the advantage of only sending the needed updates to the specific service(s) that need it even though (at least for us) all services subscribed have access to the messages.
We started with using persistent Queues so if all instances of a Microservice were down, the messages would just build up in it's queue. There are 2 issues with this that led us to build something better:
1) It obviously doesn't solve the "first startup" scenario as there is no queue for messages to build up in
2) If ANYTHING goes wrong either in storing queued messages or processing them, you end up out of sync. If that happens, you still need a proactive mechanism like we have now to bring things back in sync. So, it seemed worth going this route
I wouldn't say our method is a "best practice" and if there is one I'm not aware of it. But, the way we're doing it (including planned future work) has so far proven simple to build, easy to understand and monitor, and robust in that it's extremely rare we get an event caused by out-of-sync local data.

Figure out Indexing error in elasticsearch?

I am using ES 1.x version and having trouble to find the errors while indexing some document.
Some documents are not getting indexed and all I saw is below lines in ES logs.
stop throttling indexing: numMergesInFlight=2, maxNumMerges=3
now throttling indexing: numMergesInFlight=4, maxNumMerges=3
I did a quick google and understood the high level of these errors but would like to understand below:
Will ES retry the documents which were throttled?
Is there is any way to know the documents which were throttled by enabling some detailed logging and if yes, then in which classes?
I don't see any error message, apart from above INFO logs. Is there is a way to enable verbose logging for indexing which shows what exactly is going on during indexing?
The throttling messages you see in the logs are not the issue. throttling is happening in the background in order for elasticsearch to protect against segments explosion. see explanation here: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#segments-and-merging
The throttling does not drop messages, but just slows down indexing, which cause a back pressure to the indexers and external queues.
When indexing fail you should get an error response for the index/bulk request. in order to tell what is the issue you should inspect the responses ES provide for the index/bulk requests. Logs might not tell the full story, as it depends on the log level configuration, which is per module in ES.
Another option is that you are able to index, but the logs don't have the timestamp you think it does. check _cat/indices in order to see if the docs count increases when you index. if the doc counts increases it means that the indexed docs are there, and you need to refine your searches.
elasticsearch does not do retries to the best of my knowledge, that is up to the client (though i havent used 1.x in quite some time)
logstash, for example, retries batches it gets 503 and 429 on exactly for these kinds of reasons https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/master/lib/logstash/outputs/elasticsearch.rb#L55

Logstash - ElasticSearch - Kibana :: delay of 10 or more seconds

I've a default "elasticsearch" (ubuntu 14.04, ES v1.2) stack deployed, from redis to Kibana.
I'm sending a #now value with the current date when it's sent, and Elasticsearch assigns his own #timestamp.
Well, if you calculate (#timestamp - #now) there will be more than 10 seconds (sometimes even a minute) of lag/delay.
Is it the normal behaviour? I haven't tune too much my instance, but I'm sending very few events and doesn't look likes a problem of performance/memory/IO.
Any hint is welcome.
You have at least 5 pieces of software along the way (you don't mention what shipper you're using).
First, make sure everything's "warm" when you're looking at results. logstash and elasticsearch are JVM-based, so there's all that overhead to worry about. I usually give them 2 minutes before I start measuring anything.
Secondly, look for buffer sizes, which could cause more impact in a low-volume environment like yours. Does your shipper send every message, or batch (logstash has a default of 50 documents per batch when used as a shipper to redis)? What about when reading from redis (default is 1, but can be changed)? What about sending from logstash to elasticsearch (default is 1,000 though it is also flushed every second)?
What about your hardware all along the chain? CPU utilization? RAM allocation? SSD vs. spinning disks? Network latency? Garbage collection?
How much filtering are you doing on the shipper or the indexer? A lot of bad regexps?
Or even the basics - are the clocks set identically?
[ I can already see the SO police suggesting that this should be a comment and not an answer. However, you'll notice specific things mentioned for the OP to research, and the lesson that there are a lot of knobs to turn. ]

Replacing Nagios HTTP with custom (select/poll driven) daemon?

I have a a Nagios configuration which is performing a number of tests on a few hundred nodes; one of these is a variant of check_http. It's not configured to --enable-embedded-perl (ePN) but we'll be changing that soon. Even with ePN enabled I'm concerned about the model where each execution of this Perl HTTP+SSL check will be handling only a single target.
I'd like to write a simple select() (or poll() / epoll()) driven daemon which creates connections to multiple targets concurrently, reads the results and spits out results in a form that's useable to Nagios as if it were results from a passive check.
Is there a guide to how one could accomplish this? What's the interface or API for providing batched check updates to Nagios?
One hack I'm considering would be to have my daemon update a Redis store (with a key for each target, and a short expiration time) and replace check_http with a very small, lightweight GET of the local Redis instance on the key (the GET would either get the actual results for Nagios or a "(nil)" response which will be treated as if the HTTP connection had timed out.
However, I'm also a bit skeptical of my idea since I'd think someone has already something like this by now.
(BTW: I'm ready to be convinced to switch to something like Icinga or Zabbix or Zenoss or OpenNMS ... pretty much anything that will scale better).
As to whether or not to let Nagios handle the scheduling and checks, I'll leave that to you as it varies depending on your version of Nagios (newer versions can run these checks concurrently), and why you want a separate daemon for it. egarding versioning of Nagios, version 3 IIRC uses concurrent checks, and scales thusly to larger node counts than you report.
However, I can answer the Redis route concept as I've done it with Postfix queue stats and TTFB tracking for web sites.
Setting up the check using Python with the curl and multiprocessing modules is fairly straightforward as is dumping it into Redis. An expiration of I'd say no more than the interval would be a solid idea to keep the DB from growing. I'd recommend tis value be no more (or possibly just less than) the check interval to avoid grabbing stale check results. If the currently running check hasn't completed and the Redis-to-Nagios check runs, pulling in the previous check, you can miss failed checks.
For the Redis-To-Nagios check a simple redis-cli+bash scripting or Python check to pull the data for a given host, returning OK or otherwise depending on your data is fairly simple and would run quickly enough.
I'd recommend running the Redis instance on the Nagios check server to ensure minimum latency and avoid a network issue causing false alerts on your checks. I would also recommend a Nagios check on your Redis instance and the checking daemon. Make the check_http replacement check dependent on the Redis and http_check daemons running. THus you have a dependency chain as follows:
Redis -> http_checkd -> http_check_replacement
This will prevent false alerts on http_check_replacement by identifying the problem. For example, if your redis_checkd dies you get alerted to that, not 200+ "failed http_check_replacement" ones.
Also, since your data in Redis is by definition transient, I would disable the disk persistence. No need to write to disk when the data is constantly rotating.
On a side note, I would recommend, if using libcurl, you pull statistics from libcurl about how long it takes to get the connection open and how long the server to to respond (Time To First Byte - TTFB) and take advantage of Nagios's ability to store check statistics. You may well reach a time when having that data is really handy for troubleshooting and performance analysis.
I have a CLI Tool I've written in C which does this and uploads it into a local Redis instance. It is fast - barely more than the time to get the URL. I'm expecting it be open sourced this week, I can add Nagios style output to it fairly easily. In fact, I think I'll do that in the next week or two.

Resources