I have to use sorting along with from and size parameters in the elastic search query.
I am querying elastic search with 0 to 100 records and then 101 to 200 records and then 201 to 300 records. I have to sort the entire set with salary field.
will the from/size support sorting for the whole set.
Will it sort the whole set and takes 100 records at a time?
Thanks,
The sort will happen first, and then the size will take records 100-200 AFTER it's been sorted.
Related
We have over 100 million data store in Elasticsearch.
Dataset are too much to be fully loaded into our service memory.
Each data has a column called amount. The search is to find out several (sometimes over 10 thousand) target data that their sum of the amount equals or close to an input value.
Below is out current solution:
We merge the 100 million data input 4000 buckets by using ES's bucket. Each bucket's amount is the sum of every data it contains.
We load the 4000 buckets into our service. Afterwards we find out the solution mentioned above based on the 4000 buckets.
The obvious disadvantage is the lack of accuracy. The difference between the sum of results we find and the input target is sometimes quite large.
We are three young guys lack of experience, we need some instructions.
I am in a real pain with Elastic App Search, the indexing limit for bulk index is 100 according to the docs:
https://www.elastic.co/guide/en/app-search/current/limits.html
I was trying to create all the promises and then do promise.all(allPromises), but it's failing to index everything, and the response of when this fails still return 200, and you have to loop over:
res.data (all the 100 documents array), and look if they have error field.
Is there any solution to index lot of document fast? Because indexing 1 million with loop to await between every 100 batch size query is extremely slow.
Unfortunately the limit of 100 makes it a slow operation. We are indexing 1.1 million documents and our solution was to slice that our into ten segments and run ten processes in parallel to decrease the time that it takes.
Since reading from the index is very quick we have a separate job that validates the information in App Search matches our source data and flags anything out of order. So I don't check for errors on a full import or update so that part goes as fast as possible. We only see around a 1% failure rate on bulk imports.
I should note that part of the reason for the second piece is we have on occasion found the search index can get out of alignment with the documents and fields we have, so validation seemed like a good idea.
Following is the behaviour:
1. each record is generated for an order.
2. each record has some common data that will be same across all the records for an order.
3. The records for an order may range from 1k to 500k
There are two ways:
Either merge all the records for an order and index in ES or index per record documents in ES.
What must be the approach ?
Is there a way to improve memory performance when using an elasticsearch percolator index?
I have created a separate index for my percolator. I have roughly 1 000 000 user created saved searches (for email alerts). After creating this percolator index, my heap usage spikes to 100% and the server became unresponsive for any queries. I have somewhat limited resources and am not able to simply throw more RAM at the problem. The only solution was to delete the index containing my saved searches.
From what I have read the percolator index resides in-memory permanently. Is this entirely necessary? Is there a way to throttle this behaviour but still preserve the functionality? Is there a way to optimize my data/queries/index structure to circumvent this behaviour while still achieving the desired result?
There is no resolution to this issue from an ElasticSearch point of view nor is one likely. I have chatted to the ElasticSearch guys directly and their answer is: "throw more hardware at it".
I have however found a way to solve this problem in terms of mitigating my usage of this feature. When I analyzed my saved search data I discovered that my searches consisted of around 100 000 unique keyword searches along with various filter permutations creating over 1 000 000 saved searches.
If I look at the filters they are things like:
Location - 300+
Industry - 50+
etc...
Giving a solution space of:
100 000 * >300 * >50 * ... ~= > 1 500 000 000
However if I were to decompose the searches and index the keyword searches and filters separately in the percolator index,
I end up with far less searches:
100 000 + >300 + >50 + ... ~= > 100 350
And those searches themselves are smaller and less complicated than the original searches.
Now I create a second (non-percolator) index listing all 1 000 000 saved searches and including the ids of the search components from the
percolator index.
Then I percolate a document and then do a second query filtering the searches against the keyword and filter percolator results.
I'm even able to preserve the relevance score as this is returned purely from the keyword searches.
This approach will significantly reduce my percolator index memory footprint while serving the same purpose.
I would like to invite feedback on this approach (I haven't tried it yet but I will keep you posted).
Likewise if my approach is successful do you think it is worth a feature request?
I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks
Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.