ElasticSearch terms and cardinality performance with high cardinality fields - performance

TL;DR
My ElasticSearch query takes forever compared to the same query on SQL Server.
Am I doing something wrong? Is there any way to boost my query's performance?
Is it just one of those things RDBMS does better than NoSQL?
Premise
Let's say I have a business that takes orders and delivers requested items.
I would like to know the average amount of unique items per order.
My orders data is arranged per item ordered - each order has one or more records containing Order ID, Item ID and more.
I have a one node setup for development purposes
The results (performance-wise) are the same whether I have 4 GB heap space (on a 12 GB machine) or 16 GB heap space (on a 32 GB machine)
The index has billions of records, but the query filters it to about 300,000 records
The Order and Item ID's are of type keyword (textual by nature) and I have no way to change that.
In this particular case, the average unique item count is 1.65 - with many orders containing only one unique item, others contain 2 and a few contain up to 25 unique items.
The Problem
Using ElasticSearch, I would have to use Terms Aggregation to group documents by order ID, Cardinality Aggregation to get unique item count, and Average Bucket aggregation to get the average item count per order.
This takes about 23 seconds on both my setups. Same query takes less than 2 seconds with the same dataset on SQL Server.
Additional Information
ElasticSearch Query
{
"size":0,
"query":{
"bool":{
"filter":[
{
...
}
]
}
},
"aggs":{
"OrdersBucket":{
"terms":{
"field":"orderID",
"execution_hint":"global_ordinals_hash",
"size":10000000
},
"aggs":{
"UniqueItems":{
"cardinality":{
"field":"itemID"
}
}
}
},
"AverageItemCount":{
"avg_bucket":{
"buckets_path":"OrdersBucket>UniqueItems"
}
}
}
}
At first my query generated OutOfMemoryException which brought my server down.
Issuing the same request on my higher ram setup yielded the following circuit breaker:
[request] Data too large, data for [<reused_arrays>] would be
[14383258184/13.3gb], which is larger than the limit of
[10287002419/9.5gb]
ElasticSearch github has several (currently) open issues on this matter:
Cardinality aggregation should not reserve a fixed amount of memory per bucket #15892
global_ordinals execution mode for the terms aggregation has an adversarially impact on children aggregations that expect dense buckets #24788
Heap Explosion on even small cardinality queries in ES 5.3.1 / Kibana 5.3.1 #24359
All of which led me to use execution hint "global_ordinals_hash" which allowed the query to complete successfully (albeit taking it's time..)
Analogous SQL Query
SELECT AVG(CAST(uniqueCount.amount AS FLOAT)) FROM
( SELECT o.OrderID, COUNT(DISTINCT o.ItemID) AS amount
FROM Orders o
WHERE ...
GROUP BY o.OrderID
) uniqueCount
And this, as I've said, is very very fast.
orderID field mapping
{
"orderID":{
"full_name":"orderID",
"mapping":{
"orderID":{
"type":"keyword",
"boost":1,
"index":true,
"store":false,
"doc_values":true,
"term_vector":"no",
"norms":false,
"index_options":"docs",
"eager_global_ordinals":true,
"similarity":"BM25",
"fields":{
"autocomplete":{
"type":"text",
"boost":1,
"index":true,
"store":false,
"doc_values":false,
"term_vector":"no",
"norms":true,
"index_options":"positions",
"eager_global_ordinals":false,
"similarity":"BM25",
"analyzer":"autocomplete",
"search_analyzer":"standard",
"search_quote_analyzer":"standard",
"include_in_all":true,
"position_increment_gap":-1,
"fielddata":false
}
},
"null_value":null,
"include_in_all":true,
"ignore_above":2147483647,
"normalizer":null
}
}
}
}
I have set eager_global_ordinals trying to boost performance, but to no avail.
Sample Document
{
"_index": "81cec0acbca6423aa3c2feed5dbccd98",
"_type": "order",
"_id": "AVwpLZ7GK9DJVcpvrzss",
"_score": 0,
"_source": {
...
"orderID": "904044A",
"itemID": "23KN",
...
}
}
Irrelevant fields removed for brevity and undisclosable content
Sample Output
{
"OrdersBucket":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"910117A",
"doc_count":16,
"UniqueItems":{
"value":16
}
},
{
"key":"910966A",
"doc_count":16,
"UniqueItems":{
"value":16
}
},
...
{
"key":"912815A",
"doc_count":1,
"UniqueItems":{
"value":1
}
},
{
"key":"912816A",
"doc_count":1,
"UniqueItems":{
"value":1
}
}
]
},
"AverageItemCount":{
"value":1.3975020363833832
}
}
Any help will be much appreciated :)

Apparently SQL Server does a good job at caching those results.
Further investigation showed initial query to be taking the same time as with ElasticSearch.
I will look into why those results are not cached properly via ElasticSearch.
I also managed to convert the order ID to an integer, which dramatically enhanced performance (though same performance gain with SQL Server as well).
Also, as advised by Mark Harwood on the Elastic Forum, specifying precision_threshold on the cardinality aggregation lowered memory consumption a lot!
So the answer is that for this particular kind of query, ES performs at least as well as SQL Server.

Related

Solution for runtime price calc for price sorting problem

We have nuxtjs framework for our frontend and build a API with Elasticsearch for searching hotels/accommodations.
We have a two call api when user search. First call is availabilty and second call is price. The price we fetch is based on nightprice. Then on client calculation on runtime for total price.
2 night searched = 2 * night_price = total_price
This works ok, but we cannot add sorting on total_price value because its runtime value.
Ideas to solve this issue?
Our idea is store ALL possible combination a user can be searching for and this store in ES. But those are 100+ mil documents.
remco
Did you try to use Runtime Fields . The benefits are :
saving storage costs and increasing ingestion speed
immediately use it in search requests, aggregations, filtering, and sorting.
doesn’t increase the index size
So, you can define a field during search :
GET my-index-000001/_search
{
"runtime_mappings": {
"total_price": {
"type": "keyword",
"script": {
"source": "emit(doc['night_price'].value * params['multiplier'])"
},
"params": {
"multiplier": 2
}
}
},
"sort": [
{
"total_price": {
"order": "desc"
}
}
]
}
While sending the query, you need to set the value of the multiplier parameter.

Elasticsearch: Get sort index of the record

Let me describe my scenario with the real example.
I have a page where I need to show the list of the companies sorted by a field "overallRank" and with few filters (like companyType and employeeSize).
Now, it's easy to get the results from the ES index for the filter and then sort them by overallRank. But, I also want to know the rank of the company among all the company data and not only in the filtered result.
For example. Amazon is the 3rd company in the location US and companyType=Private. But, it is the 5th company in the US if we remove the companyType filter. While showing the result with the filter companyType I want to know this overall ranking (i.e 5th). Is it possible to include this field in the result somehow?
What I am currently doing is first getting the filtered result by companyType and location US. Then getting the sorted result by only location. This second query gives the result by overall ranking in the location (where Amazon is coming at 5th place). Now I iterate the first result and see where that company is in the second result to determine it's overall ranking.
The problem with this approach is that second method to determine the overall ranking in the whole company data is very expensive because it has to retrieve around 60k result. By giving the batch size 1000 it has to do a round trip around 60 times to ES to get all the results in the memory. It's time and space consuming both.
Can somebody please suggest a better way of doing this?
I think you can solve it using filtered aggregations: with top hits aggregation
As an example you can do something like:
{
"aggs": {
"filtered_companies_by_us": {
"filter": {
"term": {
"location": "US"
}
},
"aggs": {
"top_companies": {
"top_hits": {
"sort": [
{
"overallRank": {
"order": "desc"
}
}
],
"size": 5
}
}
}
}
}
}

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

aggregration to return all values not do group by

can aggregatin return all values? is there any way to do this with scripts?
{
"size": 0,
"_source":["docDescription","datasource"],
"query": {
"match_all":{}
},
"aggs":{
"projectNameMatchCount": {
"filter" : { "match": { "docDescription": ".ppt" } },
"aggs":{
"names":{
"terms":{"field":"_id"}
}
}
},
"datasourceSourceMatchCount": {
"filter" : { "match": { "datasource": "NGA" } }
}
}
}
in aggeration projectMatchCount, I am applying filter , and call other aggegration, to return the values, but term will do a group by, I don't want group by, all I want is return the field values
Aggregations are for grouping together data sets to drive a certain metric. If you want individual elements to be returned, you should run direct queries/filter instead. Aggregations are post processes which runs on the data set narrowed down by your query and comparatively expensive than your queries/filter. So, they should be avoided till you need aggregated metrics.
Having said that, from what I understood from your query is that you are using two aggregations. You want one to return some document IDs and the other to just return a count based on a different filter. It is possible to do so by making use of top-hits aggregation within the filter aggregation in projectNameMatchCount. For more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
But still, I believe you will benefit more by simply making two separate queries in terms of total query time and the resources consumed at ElasticSearch side, one with a query to return the IDs and the other with aggregation to return the count of docs.

Elastic Search SUM of aggregated values

We are using elastic search to get some statics.
I need to get average values for each group.
Sum all this values
So far, step no. 1 was pretty straight forward. However I really don't know how to sum all values at the end. Is this possible? If yes, how?.
Thanks for suggestions.
Here is my aggs query >
{
"query":{
"filtered":{
"query":{
"query_string":{
"analyze_wildcard":true,
"query":"*"
}
}
}
},
"aggs":{
"2":{
"terms":{
"field":"person",
"size":5000,
"order":{
"1":"desc"
}
},
"aggs":{
"1":{
"avg":{
"field":"company"
}
}
}
}
}
}
Aggregating over aggregation results are not yet supported in elasticsearch. Apparently there is a concept called reducers that are being developed for 2.0. I would suggest having a look at scripted metric aggregations. Basically, you can create your own aggregation by controlling the collection and computation aspects yourself using scripts.
Alternatively, if possible you can precompute and store the average when indexing and then use the sum aggregation when querying.
Have a look at the following question for an example of this aggregation: Elasticsearch: Possible to process aggregation results?

Resources