Elasticsearch slowed down suddenly - elasticsearch

I'm using elasticsearch for my website. Everything was normal, I received response within max 60ms but it suddenly slowed down. Now I'm getting response in min 200ms.

It is likely that your web server is causing this rather than the Elasticsearch service itself. If you can, check the connection logs on your server to see if it is receiving a lot of requests. If you are hosting your Elasticsearch instance on a public-facing website, it's very possible that someone, or multiple people are sending a lot of requests or queries to it which could be causing it to slow down.
It may be a good idea to set up your Elasticsearch behind Apache2 or something similar to protect against this. That way, you can limit the requests to specific IPs and restrict the HTTP methods called against the Elasticsearch cluster.

Related

How to limit Couchbase client from trying to connect to Couchbase server when it's down?

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.
Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.
Got any ideas that can help?
Edit:
Here's a sample app.
Steps to reproduce the problem:
svn export https://github.com/asarkar/spring/trunk/beer-demo.
From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.
The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.
As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.
The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.
All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.
Hope that helps and glad to get any feedback on what you think we should do differently.
Solved this issue myself. The client I designed handles the following use cases:
The client startup must be resilient of CB failure/availability.
The client must not fail the request, but return a degraded response instead, if CB is not available.
The client must reconnect should a CB failover happens.
I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"

Timeouts with 503 response in Elasticsearch for part of the _cat API

I have a quite large elasticsearch cluster with more than 100 nodes, and I have a problem when sometimes the cluster starts returning 504 http codes and timing out for requests to the _cat API, in particular /_cat/indices and /_cat/shards. As a result, KOPF it's not loading, I guess because it calls this same API under the hood. This happens even when the cluster is green, and it's only solved when I restart the cluster. Indexing and search, even from Kibana, work ok, as well as other APIs like _cluster/health?level=shards and _cat/nodes
I'm using Elasticsearch 1.7.1, any idea why this might be happening? I know I have to upgrade the version, but I would like to understand what it's going on here.
Note this question is similar to Elasticsearch Not Responding to Certain API Calls / Kibana and Head not loading, but that question hasn't been answered yet.

How to troubleshoot Kibana Time-Outs

I have been experiencing an issue where occasionally my Kibana stops working stating a time-out trying to connect to elasticsearch as the cause. (I have marvel installed. Something like: "plugin:elasticsearch Request Timeout"
Usually these go away by the next day, and occasionally I have been able to re-gain access to my data by increasing the timeout on kibana. However I can't figure out how to troubleshoot this issue. I suspect it may be that ES is storing some extremely large individual documents but I cannot find them, there's just too many logs to dig through by hand.
My elasticsearch cluster is perfectly healthy (green on health check), even when kibana cannot access it.
Where can I possibly start to try and troubleshoot why we are getting timeouts here? when I expand the timeout window, kibana comes back, and everything works FINE.
Any tips on where to start searching would be enormously appreciated!!

Elastic Search Load Testing

I have a single node elastic search server running on ec2. I want to do some load testing using search requests with random search queries. I am using JMeter for load testing with two different approaches -
HTTP Client - When I test using these clients with 10k/20k/50k of requests, it works fine.
ES Transport Client - This works fine with approx 2k of requests.
Here are the steps I have followed -
Instantiating client on every run and close it once the test finished.
Once client instantiates, I start the jmeter sampling and send the search request.
After this run, stops the sampling.
I am getting No Node Available Exception after 2k of request with transport client.
ES Server is running with 3g of memory and have given 6g of memory to load tester.
Please help me if there is some config modification required and if I am not using the correct approach to test the load.
Thanks in Advance.
What kind of responses are you getting from the http test? Have you verified you are getting valid responses for all 10~50k requests? It might be perhaps your cluster cannot take on the load you're putting on it for either test. Since TransportClient is more intimately coupled to the ES server, you will explicitly see errors that come back from TransportClient, but if you're simply sending requests via HTTP without validating the response, it's easy to miss any issues.
Although, before taking a stab in the dark like I just did, I would also check to see what kind of QPS you are getting using the HTTP method vs the TC method, what your CPU/memory look like throughout both tests, what the response times look like, etc. It helps to monitor the health of your system throughout the process to detect any symptoms that might help explain the cause.

How can I increase SSL performance with Elastic Beanstalk

I really like Elastic Beanstalk and managed to get my webapp (Spring MVC, Hibernate, ...) up and running using SSL on a Tomcat7 64-bit container.
A major concern to me is performance (I thought using the Amazon cloud would help here).
To benchmark my server performance I am using blitz.io (which uses the amazon cloud to have multiple clients access my webservice simultaneously).
My very first simple performance test already got me wondering:
I benchmarked a health check url (which basically just prints "I'm ok").
Without SSL: Looks fine.
13 Hits/s with a response time of 9ms
230 Hits/s with a response time of 8ms
With SSL: Not so fine.
13 Hits/s with a response time of 44ms (Ok, this should be a bit larger due to encryption overhead)
30 Hits/s with a response time of 3.6s!
Going higher left me with connection timeouts (timeout = 10s).
I tried using a larger EC2 instance in the background with essentially the same result.
If I am not mistaken, the Load Balancer before the EC2 Instances serves as an endpoint for SSL encryption. How do I increase this performance?
Can this be done with elastic beanstalk? Or do I need to setup my own load balancer etc.?
I also did some tests using Heroku (albeith with a slightly different technology stack, play! vs. SpringMVC). Here I also saw the increased response time, but it stayed mostly constant. I am assuming they are using quite performant SSL endpoints. How do I get that for Elastic Beanstalk?
It seems my testing method was flawed.
Amazon's Elastic Load Balancers seem to go up to 10k SSL requests per second.
See this great writeup:
http://blog.mattheworiordan.com/post/24620577877/part-2-how-elastic-are-amazon-elastic-load-balancers
SSL requires a handshaking before a secure transmission channel is opened. Once the handshaking is done, which involves several roundtrips, the data is transmitted.
When you are just hitting a page using a load tester, it is doing the handshake for each and every hit. It is not reusing an already established session.
That's not how browsers are going to do. Browse will do handshake once and then reuse the open encrypted session for all the subsequent requests for a certain duration.
So, I would not be very worried about the results. I suggest you try a tool like www.browsermob.com to see how long a full page with many image, js, css etc takes to load over SSL vs non-SSL. That will be a fair comparison.
Helps?

Resources