I have a scenario where multiple requests are hitting the ES server concurrently resulting in 409 version conflict. So I followed the official documentation of ES and started using the retry_on_conflict query parameter in the header.
What are the side effects of using the param?
Will there be any data loss by using the query param?
Will both the documents get merged ?
Any suggestions would be appreciated.
Official documentation:
retry_on_conflict
(Optional, integer) Specify how many times should the operation be retried when a conflict occurs. Default: 0.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
Related
My existing system has some search SQL procedures that returns the data based on some filters. Now, to improve searches we have decided to use Elasticsearch for all our searches. We are in phase of making a prototype for now.
Below is what i have done till now:-
De-normalize all the data from my RDBMS and store into Elasticsearch using Logstash.
Query data from Elasticsearch based on the parameters using Elastisearch SQL API.
The main problem is the Pagination. Elasticsearch Sql has support for sending fetch_size parameter and in result it returns the cursor for the next set of records.
Cursor is fine if you want to get to the next paged set of results, but if a user wants to go from page 10 to page 100, how can we achieve that ?
I also searched for offset and skip support in elasticsearch SQL but could not find any references.
Has anyone faced such an issue ? I would appreciate any help or suggestions.
I tried to follow the link https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-pagination.html
{
"query" : "Select client_clientid, clientpolicy_policyname from client_paged_list group by client_clientid, clientpolicy_policyname",
"fetch_size": 5
}
I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.
By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.
I have used Elastic Search High Level Client to search the elastic index and process the results. I have used the following code to do the same.
restHighLevelClient.search(searchRequest,RequestOptions.DEFAULT);
However, rest client uses "GET" to query the data. However, I want to send this as a Post request to Elastic Search. Any help on this would be highly appreciated.
After discussion (see comments), there was no need to force the High Level Rest Client to use POST instead of GET as GET is using behind the scene GET with body.
We are using elastic search version 5.4.1 in our production environments. The cluster setup is 3 data, 3 query, 3 master nodes. Of late we are observing a lot of slow queries in a particular data node and the [index][shard] present in that are just replicas.
I don't find many deleted docs or memory issues that could directly cause the slowness.
Any pointers on how to go about the investigation here would be helpful.
Thanks!
Many things are happening during one ES query. First, check the took field returned by ElasticSearch.
took – time in milliseconds for Elasticsearch to execute the search
However, the took field is the time that it
took ES to process the query on its side. It doesn't include
serializing the request into JSON on the client
sending the request over the network
deserializing the request from JSON on the server
serializing the response into JSON on the server
sending the response over the network
deserializing the response from JSON on the client
As such, I think you should try to identify the exact step that is slow.
Reference: Query timing: ‘took’ value and what I’m measuring
I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.
Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?
Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)
refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
in short what you need to do is
0.) ensure you have _source set to true
1.) use scan and scroll API , pass your filtered query with search type scan,
2.)fetch documents using scroll id
2.) bulk index the result using the source field which returns you the json used to index data
refer:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
guide/en/elasticsearch/guide/current/bulk.html
guide/en/elasticsearch/guide/current/reindex.html
es 2.3 has an experimental feature that allows reindex from a query
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html