How to Fix Read timed out in Elasticsearch - elasticsearch

I used Elasticsearch-1.1.0 to index tweets.
The indexing process is okay.
Then I upgraded the version. Now I use Elasticsearch-1.3.2, and I get this message randomly:
Exception happened: Error raised when there was an exception while talking to ES.
ConnectionError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)) caused by: ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)).
Snapshot of the randomness:
Happened --33s-- Happened --27s-- Happened --22s-- Happened --10s-- Happened --39s-- Happened --25s-- Happened --36s-- Happened --38s-- Happened --19s-- Happened --09s-- Happened --33s-- Happened --16s-- Happened
--XXs-- = after XX seconds
Can someone point out on how to fix the Read timed out problem?
Thank you very much.

Its hard to give a direct answer since the error your seeing might be associated with the client you are using. However a solution might be one of the following:
1.Increase the default timeout Globally when you create the ES client by passing the timeout parameter. Example in Python
es = Elasticsearch(timeout=30)
2.Set the timeout per request made by the client. Taken from Elasticsearch Python docs below.
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
The above will give the cluster some extra time to respond

Try this:
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)
It might won't fully avoid ReadTimeoutError, but it minimalize them.

Read timeouts can also happen when query size is large. For example, in my case of a pretty large ES index size (> 3M documents), doing a search for a query with 30 words took around 2 seconds, while doing a search for a query with 400 words took over 18 seconds. So for a sufficiently large query even timeout=30 won't save you. An easy solution is to crop the query to the size that can be answered below the timeout.

For what it's worth, I found that this seems to be related to a broken index state.
It's very difficult to reliably recreate this issue, but I've seen it several times; operations run as normal except certain ones which periodically seem to hang ES (specifically refreshing an index it seems).
Deleting an index (curl -XDELETE http://localhost:9200/foo) and reindexing from scratch fixed this for me.
I recommend periodically clearing and reindexing if you see this behaviour.

Increasing various timeout options may immediately resolve issues, but does not address the root cause.
Provided the ElasticSearch service is available and the indexes are healthy, try increasing the the Java minimum and maximum heap sizes: see https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html .
TL;DR Edit /etc/elasticsearch/jvm.options -Xms1g and -Xmx1g

You also should check if all fine with elastic. Some shard can be unavailable, here is nice doc about possible reasons of unavailable shard https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

Related

AWS Elasticsearch - Delete_by_query, how to find task id so can continue with code when delete is done, and when to do wait_for _completion

I'm using AWS Lambda to do a delete_by_query on an Elasticsearch index so I get rid of everything older than 7 days. That works, but I noticed that the count of the documents is the same before and after, so if I were to run a query in Elasticsearch I may not get correct results until the delete_by_query is completed.
I found this post (python 3.x - Right way to delete and then reindex ES documents - Stack Overflow) that states that it is "best to set wait_for_completion to False. In this case you'll get task details and will be able to track task progress." For one, I haven't found anything that states why this is the case, unless your delete takes 4 hours like that example.
I found code to determine if the delete_by_query is still running at this wonderful site here and tried:
es_client.tasks(detailed=True,actions="*/delete/byquery")
However, I'm getting the message that
'TasksClient' object is not callable.
I am not entirely sure if that is true or not , or if my syntax is incorrect and thus that is why it is not working. It doesn't make sense that I can't programmatically query Tasks with python if I can do it in the console and with curl.
If it is not good to set wait_for_completion to False, and I can't query this with Python, how am I to programmatically get any information about the task or an understanding as to whether I can go ahead with the analytical queries or whatever else I want to do that depends on this task being done?
Okay, I'm not entirely sure why you are getting that error, so I can't help with that in particular. But, I noticed that the python elasticsearch documentation on how to get the task id from the delete_by_query when wait_for_completion is set to false isn't very clear, so I'm going to provide this in case it helps.
from elasticsearch import Elasticsearch
es = Elasticsearch()
response = es.delete_by_query(index=someIndex, body=someQuery, wait_for_completion=False)
# get task id
print(response['task'])
Hope that helps!

Nodes loading, but Elasticsearch has 0 shards

I was testing out a 20 node cluster with default replicates, default sharding, and recently wanted to rename the cluster from the default "elasticsearch." So, I updated the config cluster name, and additionally renamed the data from
mylocation/data/OldName
to
mylocation/data/NewName
Which of course contain:
nodes/0
nodes/1
etc...
About a month later, I'm loading up my cluster again, and I see that while all 20 nodes come back online, it says 0 active shards, 0 primary shards, etc. where this should be several thousand. Status is green, nothing is initializing, nothing looks amiss except I have no data. I look in nodes/0 and I see nodes/0/indices/ are well populated with my index names: the data is actually on the disk. But it seems there's nothing I can do to get it to actually load the shards. The config is using the correct Des.path.data=mylocation/data/.
What could be wrong and how can I debug it? I'm fairly confident I ran this for a week after loading it, but it was some time ago and perhaps other things have changed. It just oddly seems to not be recognize any of the data it's pointing at, and it isn't giving me any kind of "I don't see your data" or "cannot read or write that data" error message.
Update
After it gets started it says:
Recovered [0] indices into cluster_state.
I googled this and it sounded like version compatibility. Checked my binaries and this did not appear to be an issue. I'm using 1.3.2 on all.
Update 2
One of 20 nodes repeatly fails with
ElasticsearchillegalStateException[failed to obtain node lock, is the following location writable?]
It lists the correct data dir, which is writable. Should I delete the node lock? Some node.locks are 664 and some are 640 when the cluster is off. Is this normal or possibly the relic of an unclean shutdown?
Are some of these replicates? I have 40 nodes, 20 are 640 and 20 are 664.
Update 3
There are write locks in place at the lucene level. So
data/NewName/nodes/1/indices/indexname/4/index/write.lock
exists. Is this why moving shards fails? Can i safely delete each of these write locks or is there shared state in the _state file that would lead to inconsistency?

Google Calendar API "The requested minimum modification time lies too far in the past." after just one day

My code fetches calendar events using service.events().list() with the following parameters:
timeMax: 2015-11-13T04:12:44.263000Z
timeMin: 2014-05-17T04:12:44.263000Z
updatedMin: 2014-11-12T14:56:20.395000Z # = yesterday
I know there's a limit on the updatedMin param that prevents it to be too far in the past, but lately I get the following error even when updatedMin is yesterday:
The requested minimum modification time lies too far in the past.
Everywhere this error is mentioned, they are talking about a limit that is approx. 20 days in the past, certainly not one day.
Any ideas what is causing this error?
#Tzach, I tried the above query in API explorer with the same values and it returned the results without any error unless its greater than 20days. As Luc said, better to switch to syncTokens which saves the bandwidth.

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

How do I absolutely ensure that a Phusion Passenger instance stays alive?

I'm having a problem where no matter what I try all Passenger instances are destroyed after an idle period (5 minutes, but sometimes longer). I've read the Passenger docs and related questions/answers on Stack Overflow.
My global config looks like this:
PassengerMaxPoolSize 6
PassengerMinInstances 1
PassengerPoolIdleTime 300
And my virtual config:
PassengerMinInstances 1
The above should ensure that at least one instance is kept alive after the idle timeout. I'd like to avoid setting PassengerPoolIdleTime to 0 as I'd like to clean up all but one idle instance.
I've also added the ruby binary to my CSF ignore list to prevent the long running process from being culled.
Is there somewhere else I should be looking?
Have you tried to set the PassengerMinInstances to anything other than 1 like 3 and see that work?
Ok, I found the answer for you on this link: http://groups.google.com/group/phusion-passenger/browse_thread/thread/7557f8ef0ff000df/62f5c42aa1fe5f7e . Look at the last comment by Phusion guy.
Is there a way to ensure that I always have 10 processes up and
running, and that each process only serves 500 requests before being
shut down?
"Not at this time. But the current behavior is such that the next time
it determines that more processes need to be spawned it will make sure
L at least PassengerMinInstances processes exist."
I have to say their documentation doesn't seem to match what the current behavior.
This seems to be quite a common problem for people running Apache on WHM/cPanel:
http://techiezdesk.wordpress.com/2011/01/08/apache-graceful-restart-requested-every-two-hours/
Enabling piped logging sorted the problem out for me.

Resources