Loading all documents from ElasticSearch takes too long - ruby

In order to load all the documents index by ElasticSearch, I am using the following query through tire.
def all
max = total
Tire.search 'my_documents' do
query { all }
size max
end.results.map { |entry| entry.to_hash }
end
Where max, respectively total is a count query to return the number of present documents. I have indexed about 10,000 documents. Currently, the request takes too long.
I am aware, that I should not query all documents like this. What is the best alternative here? Using pagination, if yes, toward which metric would I define the number of documents per page?
I am also planning to extend the size of the documents, to 100,000 or even 1,000,000 and I don't see yet how this can scale.
I appreciate every comment.
Rationale: I do this, because I am running calculations over these data. Hence, I need all the data, run the computations and save the results back into the documents.

Have a look at the scroll API, which is highly optimized to fetch a large amount of results. It uses the scan search type and doesn't support sorting but let you provide a query to filter the documents you want to fetch. Have a look at the reference to know more about it. Remember the size that you define in the request is per shard; that means that if you have 5 primary shards, setting 10 would lead to have 50 results back per request.

Related

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

ES returns only 10 records by default. How to get all records without using scroll API

When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.

ElasticSearch: optimise the storage in indexes and the time response for requests

In a Kafka server I have N types of messages, one for each IOT application. I want to store these messages in Elastisearch in different indexes. Do you know which is the most optimizing method for that use case in order to have the lower time response for request regarding every message type ?
Furthermore, it is adivised to create an index per day like this: "messageType-%{+YYYY.MM.dd}"; Is this a way for my use case?
Finally, concerning the previous way, if I have a request with a time range for instance from 2016.06.01 to 2016.07.04, does elasticsearch search directly in the indexes "messageType-%{+2016.06.01}", "messageType-%{+2016.06.02}", ..., "messageType-%{+2016.07.04}" ?
Thanks in advance,
J
If you plan to purge docs after a certain time, creating indexes based on time is a good idea because you can drop indexes after certain time.
You can search against all indexes or more preferably you should specify the indexes you want to search against.
For example, you could do a search against /index1,index2/_search where you determine index1, index2 from the query or you can just hit /_search which will search all indexes (slower)

Elasticsearch caching a single field for quick response

I have a cluster of 10 nodes where I index about a 100 million records daily. Total close to 6 billion records. I am constantly loading data. Each record has about 75 fields associated with it. 99% of my queries are based on the same field query. Essentially select * from table where groupid = 'value'. The majority of the queries returning bring back about a hundred records.
My queries currently take about 30 seconds to run the first 2 times and then are in the milliseconds. The problem is that all the user queries are searching for a different groupID so there queries are going to be slow for the most part until they run it the third time.
Is it possible to "cache" the groupid field so that I can get sub second queries.
My current query looks like this. (Psuedo-query) (I'm using non-analyzed field which I believe is better?)
query : {
filtered : {
filter : {
"term" : { groupID : "valuex" }
}
}
}
I"ve researched and not sure how to go about this. I've looked into doc_values = yes and possibly field cache?
I do not care about scoring, aggregates. My only use case is to filter out records and only bringing back the 100 or so out of 5 billion that have the correct groupID.
We have about 64G Memory on each server.
Just looking for help on how to achieve optimal performance/caching? or anything else that would help.
I thought about routing but this would be difficult based on our groupid values.
thanks
Starting from elasticsearch 2.0 we did some caching changes, like:
Keeps track of 256 most recently used queries
Only caches those that appear 5 times or more
Does not cache segments which have less than 10000 documents or 3% of the documents of the index
Wondering if you are hitting this last one.
Note that we did that because the File System cache might be probably better than internal caching.
Could you try with a bool query instead of a filtered query BTW? Filtered has been deprecated (and is removed in 5.0). And see how it performs?

Elasticsearch query to return all record programmatically throws out of memory exception

I need to retrieve all records from elasticsearch and do statistical analysis on the data. The number of records are not that high 500000 records. Each record has 7 columns, 5 of these columns are type String (single word value). So the size of data to me is not that big at all. I am getting 'out of memory exception' when executing the following:
SearchResponse response = client.prepareSearch(indexFrom).setTypes(typeFrom)
.setQuery(matchAllQuery()).setSize(SIZE)
.execute().actionGet();
SIZE=500000
Any help/suggestions?
I am setting Xmx10g.
Thanks.
-Vera
If you just need to recover all documents unsorted like this, you should use a scan and scroll search.
To sum up, it combines the use of :
search of type scan which disable sorting of results (thus save some memory)
scroll API which is quite similar to a cursor for DB by allowing you to look through results by small batches.
I think it could solve your memory problem.

Resources