Elastic and aggregations performance

Elastic and aggregations performance - elasticsearch

We have an index with approximately 1 million documents each representing a product for an e-commerce store. We are using aggregations to calculate buckets representing attribute values for each attribute of the product. If we do a search, which limits the resultset to say 2000 products, performance is great (Elastic returns the result in less than 10 milliseconds). However, if we do a matchall query to get all products and their corresponding aggregations, elastic takes more than 3 seconds to return a result. If we disable the aggregations, performance is blazing again. Thus, it seems as though performance, when using aggregations, is very dependent the size of the resultset (With a matchall query we get around 1000 buckets). Is there anything special we need to be aware of in Elastic, in order to have a matchall query performing similarly to a query, which returns 2000 products?
Before we started using Elastic, we have worked with Lucene, and built our own facet abstraction on top of Lucene to be able to handle the above scenario. In this case, we pre-calculated facets on index startup and represented each facet as a term with a corresponding bitset. When doing searches, we retrieved a bitset for the query in question and “AND’ed” it with the precalculated bitset for each facet we wanted to show in the given scenario. With this implementation the speed at which we were able to calculate facet results did not differ depending on the size of the resultset of a given query. Only the number of documents and the number of facets influenced the speed. With this implementation, we were able to calculate more than 10.000 facets (buckets) at each search request with an index of 1 million documents and still get the resultset and facet results in less than 100 milliseconds.
Can anyone tell if this will be possible to achieve with Elastic and any pointers to what we are doing wrong (We are currently running tests on a setup at found.no with 1 cluster having 4 2,5Ghz cores and 1GB ram. The Elastic index takes up around 3,5GB of disk space
Example query (we can save about 1/3 of the query time by not using nested aggregations):
{
"query": {
"match_all": {}
},
"aggs": {
"nested_aggs": {
"nested": {
"path": "specs"
},
"aggs": {
"combined": {
"filter": {
"match_all": { }
},
"aggs": {
"ul1": {
"terms": {
"field": "unspscNameLevel1",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"ul2": {
"terms": {
"field": "unspscNameLevel2",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"ul3": {
"terms": {
"field": "unspscNameLevel3",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"ul4": {
"terms": {
"field": "unspscNameLevel4",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"rl1": {
"terms": {
"field": "requirementSpecificationNameLevel1",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"rl2": {
"terms": {
"field": "requirementSpecificationNameLevel2",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"rl3": {
"terms": {
"field": "requirementSpecificationNameLevel3",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"t1": {
"terms": {
"field": "tilslutningskrav",
"size": 50
}
},
"t2": {
"terms": {
"field": "tildelingsform",
"size": 50
}
},
"t3": {
"terms": {
"field": "tildelingskriterie",
"size": 50
}
}
}
}
}
},
"nested_aggs2": {
"nested": {
"path": "specs.attributes"
},
"aggs": {
"combined": {
"filter": {
"match_all": { }
},
"aggs": {
"n4": {
"terms": {
"field": "4. niv. kategori",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"ffv": {
"terms": {
"field": "form forvaltningsopgave",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"fh": {
"terms": {
"field": "form hovedområde",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"fo": {
"terms": {
"field": "form opgaveområde",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"fs": {
"terms": {
"field": "form serviceområde",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
},
"s": {
"terms": {
"field": "styresystem",
"size": 50
} ,
"aggs": {
"parent_count": {
"reverse_nested": {}
}
}
}
}
}
}
},
"supplierCategory": {
"filter": {
"match_all": { }
},
"aggs": {
"sc": {
"terms": {
"field": "supplierCategory.raw",
"size": 50
}
},
"on": {
"terms": {
"field": "organisationName.raw",
"size": 50
}
},
"mn": {
"terms": {
"field": "manufacturerName.raw",
"size": 50
}
}
}
}
}
}

Related

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.

I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

What should be the bucket path for nested term aggregation?

I want to do pipeline aggregation on my elasticsearch aggregation. Here is my query body
{
"aggs": {
"user_info": {
"terms": {
"field": "user_id"
},
"aggs": {
"product_info": {
"terms": {
"field": "product_id"
},
"aggs": {
"total_item_price": {
"sum": {
"field": "selling_price"
}
}
}
}
}
},
"price_percentile": {
"percentiles_bucket": {
"buckets_path": "user_info.product_info.total_item_price"
}
}
}
}
This is giving me error that
No aggregation found for path [user_info.product_info.total_item_price]
What should be the path for bucket if such nested aggregation is there? Or is it not possible to find percentiles for such bucket arrangement in elasticsearch.
P.S I am using elasticsearch 6.5

#jzzfs answer is also somewhat right. I approached it in a different way. I reversed my aggregations and it fulfilled my use case. But in general, you can't do nested bucket percentiles for now.
{
"aggs": {
"product_info": {
"terms": {
"field": "product_id"
},
"aggs": {
"user_info": {
"terms": {
"field": "user_id"
},
"aggs": {
"total_item_price": {
"sum": {
"field": "selling_price"
}
}
}
},
"pb": {
"percentiles_bucket": {
"buckets_path": "user_info>total_item_price"
}
}
}
}
}
}

First, don't use dots in the path -- use > instead:
GET stack/_search
{
"aggs": {
"user_info": {
"terms": {
"field": "user_id"
},
"aggs": {
"product_info": {
"terms": {
"field": "product_id"
},
"aggs": {
"total_item_price": {
"sum": {
"field": "selling_price"
}
}
}
}
}
},
"pb": {
"percentiles_bucket": {
"buckets_path": "user_info>product_info>total_item_price"
}
}
}
}
which yields "buckets_path must reference either a number value or a single value numeric metric aggregation, got: [Object[]] at aggregation [product_info]" so it's not gonna work.
Here are our options:
Aggregate globally but just under the bucketed product info (without the users):
GET stack/_search
{
"aggs": {
"product_info": {
"terms": {
"field": "product_id"
},
"aggs": {
"total_item_price": {
"sum": {
"field": "selling_price"
}
}
}
},
"pb": {
"percentiles_bucket": {
"buckets_path": "product_info>total_item_price"
}
}
}
}
Use filtered aggregations in order to mimic the original intent:
GET stack/_search
{
"aggs": {
"user_123": { <-- keeping the agg name consistent w/ the filter
"filter": {
"term": {
"user_id": 123 <-- actual filter
}
},
"aggs": {
"product_info": {
"terms": {
"field": "product_id"
},
"aggs": {
"total_item_price": {
"sum": {
"field": "selling_price"
}
}
}
},
"pb": {
"percentiles_bucket": {
"buckets_path": "product_info>total_item_price"
}
}
}
}
}
}
You can then have as many user_xyz subaggregations as you like -- provided you gather their IDs beforehand.

How to detect the number of days that a person passed in a city?

I have the following mapping in Elasticsearch:
PUT /traffic-data
{
"mappings": {
"traffic-entry": {
"_all": {
"enabled": false
},
"properties": {
"CameraId": {
"type":"keyword"
},
"VehiclePlateNumber": {
"type":"keyword"
},
"DateTime": {
"type":"date"
}
}
}
}
}
I want to calculate how many days per month has a vehicle stayed. A unique vehicle is identified by VehiclePlateNumber.
So, I want to get the result something like this:
VehiclePlaneNumber Month StayDays
111 1 5
222 1 1
...
How can I do it using Elasticsearch query?
This is what I tried:
GET traffic-data/_search?
{
"size": 0,
"aggs":{
"by_district":{
"terms": {
"field": "VehiclePlateNumber",
"size": 100000
},
"aggs": {
"by_month": {
"terms": {
"field": "DateTime",
"size": 12
}
}
}
}
}
}

You can do terms aggregation on Vehicle plate number then a terms sub agg on month then a sum sub agg on days.
Something like:
GET traffic-data/_search
{
"size": 0,
"aggs":{
"by_district":{
"terms": {
"field": "VehiclePlateNumber",
"size": 100000
},
"aggs": {
"by_month": {
"terms": {
"field": "DateTime",
"size": 12
},
"aggs": {
"days": {
"sum": {
"field": "days"
}
}
}
}
}
}
}
}
Month should be a scripted field but would be better to compute it at index time.
That should work.
Or you can use entity centric design and regularly index that value computed. See https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes

Count of all the results from terms aggregation bucket

GET /test*/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"test": {
"terms": {
"field": "name.keyword",
"min_doc_count": 10
}
}
}
}
the above query will return all the unique names which occurs more than 10 times.
I want the query to return the count of those unique names with occurrance more than 10 times.
I could not find a way to do find out the count. Can anybody help with it.
Thanks.

This will do the trick.
"count" will give you the count of unique names with occurrences more than 10 times.
GET /test*/_search?filter_path=aggregations.count
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"test": {
"terms": {
"field": "name.keyword",
"min_doc_count": 10
},
"aggs": {
"test2": {
"value_count": {
"field": "name.keyword"
}
},
"test3":{
"bucket_script": {
"buckets_path": "test2",
"script": "return 1"
}
}
}
},
"count":{
"sum_bucket": {
"buckets_path": "test>test3"
}
}
}
}
Let me know if this solves your problem.

For each country/colour/brand combination , find sum of number of items in elasticsearch

This is a portion of the data I have indexed in elasticsearch:
{
"country" : "India",
"colour" : "white",
"brand" : "sony"
"numberOfItems" : 3
}
I want to get the total sum of numberOfItems on a per country basis, per colour basis and per brand basis. Is there any way to do this in elasticsearch?

The following should land you straight to the answer.
Make sure you enable scripting before using it.
{
"aggs": {
"keys": {
"terms": {
"script": "doc['country'].value + doc['color'].value + doc['brand'].value"
},
"aggs": {
"keySum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
}
}

To get a single result you may use sum aggregation applied to a filtered query with term (terms) filter, e.g.:
{
"query": {
"filtered": {
"filter": {
"term": {
"country": "India"
}
}
}
},
"aggs": {
"total_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
To get statistics for all countries/colours/brands in a single pass over the data you may use the following query with 3 multi-bucket aggregations, each of them containing a single-bucket sum sub-aggregation:
{
"query": {
"match_all": {}
},
"aggs": {
"countries": {
"terms": {
"field": "country"
},
"aggs": {
"country_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
},
"colours": {
"terms": {
"field": "colour"
},
"aggs": {
"colour_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
},
"brands": {
"terms": {
"field": "brand"
},
"aggs": {
"brand_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elastic and aggregations performance - elasticsearch

Related

How to define percentage of result items with specific field in Elasticsearch query?

What should be the bucket path for nested term aggregation?

How to detect the number of days that a person passed in a city?

Count of all the results from terms aggregation bucket

For each country/colour/brand combination , find sum of number of items in elasticsearch

Categories

Resources