Multiple concurrent aggregations best practice - elasticsearch

I'm considering using Elasticsearch to act as the backend search engine for multi-filter utility. Per this requirement, a multiple aggregation queries will be run upon the cluster, while the expected response time is ~5 seconds.
Based on the details below, do you think this approach is valid for my use case?
If yes, what is the suggested cluster sizing?
For sure I'll have to increase default values for parameters such as index.mapping.total_fields.limit and index.mapping.nested_objects.limit.
It will be much appreciated to get some feedback on the approach suggested below, and ways to avoid common pitfalls.
Thanks in advance.
Details
Number of expected documents: ~50m
Number of unique fields values (facet_name + face_value): ~1B
Number of queries per second: ~50 per sec
Mappings:
{
"mappings": {
"properties": {
"customer_id": {
"type": "keyword"
},
"id": {
"type": "keyword"
},
"mi_score_join": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"mi_data": "customer_model"
}
},
"model_id": {
"type": "keyword"
},
"number_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "long"
}
}
},
"score": {
"type": "long"
},
"string_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "keyword"
}
}
}
}
}
}
An example for a document:
{
"id": 33421,
"string_facet":
[
{
"facet_value":"true",
"facet_name": "var_a"
},
{
"facet_value":"dummy_country",
"facet_name": "var_b"
},
{
"facet_value":"dummy_",
"facet_name": "var_c"
},
{
"facet_value":"https://dummy.com/",
"facet_name": "var_d"
}
,
{
"facet_value":"www.dummy.com",
"facet_name": "var_e"
}
,
{
"facet_value":"dummy",
"facet_name": "var_f"
}
],
"mi_score_join": "mi_data"
}
An example for an aggregation query to be run:
POST test_index/_search
{
"size":0,
"aggs": {
"facets": {
"nested": {
"path": "string_facet"
},
"aggs": {
"names": {
"terms": { "field": "string_facet.facet_name", "size":???},
"aggs": {
"values": {
"terms": { "field": "string_facet.facet_value" }
}
}
}
}
}
}
}
The "size": ??? will probably be the max cardinality of the whole terms values.
Filters may be added to the aggregations, based on the filters that already applied.

Related

Aggregate, sort and paginate on nested documents

I'm managing a product index, with product sales and other KPIs under a nested field.
Trying to sort based on nested aggregation, and paginate - with no success.
Below is a simplified version of my mapping, for the sake of the example -
{
"product_type":
{
"type": "keyword"
},
"family":
{
"type": "keyword"
},
"rootdomain":
{
"type": "keyword"
},
"kpis":
{
"type": "nested",
"properties":
{
"sales_1d":
{
"type": "float"
},
"timestamp":
{
"type": "date",
"format": "strict_date_optional_time_nanos"
},
"views_1d":
{
"type": "float"
}
}
}
}
My aggregation is similar to the one below-
{
"aggs": {
"group_by_family": {
"aggs": {
"nested_aggregation": {
"aggs": {
"range_filtered": {
"aggs": {
"sales_1d": {
"sum": {
"field": "kpis.sales_1d"
}
},
"views_1d": {
"sum": {
"field": "kpis.views_1d"
}
},
"reverse_nesting": {
"aggs": {
"docs": {
"top_hits": {
"size": 1,
"sort": [
{
"_id": {
"order": "asc"
}
}
],
"_source": {
"includes": [
"_id",
"family",
"rootdomain",
"product_type"
]
}
}
}
},
"reverse_nested": {}
}
},
"filter": {
"range": {
"kpis.timestamp": {
"format": "basic_date_time_no_millis",
"gte": "20220721T000000Z",
"lte": "20220918T235959Z"
}
}
}
}
},
"nested": {
"path": "kpis"
}
}
},
"terms": {
"field": "family",
"size": 10
}
}
},
"query": {
//some query to filter by product-type and rootdomain
},
"size": 0
}
I'm aware that I can add an order clause to term aggregation to order the aggregated results.
My target though is to paginate the aggregated results - meaning I want to retrieve and order
1-10 best-selling products, and later retrieve 11-20 best-selling products and so on.
I've tried using bucket sort under range_filtered but I'm getting an error -
class org.elasticsearch.search.aggregations.bucket.filter.InternalFilter cannot be cast to class org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
I'm not sure how to proceed from here, is this possible? if not, is there any workaround?
Thanks.

elasticsearch facet nested aggregation

Using elasticsearch 7.0.0.
I am following this link.
I have an index test_products with following mapping:
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"dynamic_templates": [
{
"search_result_data": {
"mapping": {
"type": "keyword"
},
"path_match": "search_result_data.*"
}
}
],
"properties": {
"search_data": {
"type": "nested",
"properties": {
"full_text": {
"type": "text"
},
"string_facet": {
"type": "nested",
"properties": {
"facet-name": {
"type": "keyword"
},
"facet-value": {
"type": "keyword"
}
}
}
}
}
}
}
}
And a document inserted with following format:
{
"search_result_data": {
"sku": "wheel-6075-90092",
"gtin": null,
"name": "Matte Black Wheel Fuel Ripper",
"preview_image": "abc.jg",
"url": "9836817354546538796",
"brand": "Fuel Off-Road"
},
"search_data":
{
"full_text": "Matte Black Wheel Fuel Ripper",
"string_facet": [
{
"facet-name": "category",
"facet-value": "Motor Vehicle Rims & Wheels"
},
{
"facet-name": "brand",
"facet-value": "Fuel Off-Road"
}
]
}
}
and one other document..
I am trying to aggregate on string_facet as mentioned in the link.
"aggregations": {
"agg_string_facet": {
"nested": {
"path": "string_facet"
},
"aggregations": {
"facet_name": {
"terms": {
"field": "string_facet.facet-name"
},
"aggregations": {
"facet_value": {
"terms": {
"field": "string_facet.facet-value"
}
}
}
}
}
}
}
But I get all (two) documents returned with :
"aggregations": {
"agg_string_facet": {
"doc_count": 0
}
}
What am I missing here?
Also why are the docs being returned as a response?
Documents are returned as a response because they match with your query. If you'd like them to disappear, you can set the "size" field to 0. By default, it's set to 10.
query{
...
},
"size" = 0
I read the docs and Facet aggregation has been removed. The recommendation is to use the Terms aggregation.
Now, for your question, you can go with two options:
If you'd like to get the unique values for each: facet-value and facet-name, you can do the following:
"aggs":{
"unique facet-values":{
"terms":{
"field": "facet-value.keyword",
"size": 30 #By default is 10, maximum recommended is 10,000
}
},
"unique facet-names":{
"terms":{
"field": "facet-name.keyword"
"size": 30 #By default is 10, maximum recommended is 10,000
}
}
}
If you'd like to get the unique combinations between facet-name and facet-value, you can use the Composite aggregation. If you choose this way, your aggs should look like this:
{
"aggs":{
"unique-facetvalue-and-facetname-combination":{
"composite":{
"size": 30, #By default is 10, maximum recommended is 10,000. No matter what size you choose, you can paginate.
"sources":[
{
"value":
{
"terms":{
"field": "facet-value.keyword"
}
}
},
{
"name":
{
"terms":{
"field": "facet-name.keyword"
}
}
}
]
}
}
}
}
The advantage of using Composite over Terms is that Composite lets you paginate your results with the After key. So your cluster's performance does not get affected.
Hope this is helpful! :D

Terms aggregation with nested wildcard path

Given the following nested object of nested objects
{
[...]
"nested_parent":{
"nested_child_1":{
"classifier":"one"
},
"nested_child_2":{
"classifier":"two"
},
"nested_child_3":{
"classifier":"two"
},
"nested_child_4":{
"classifier":"five"
},
"nested_child_5":{
"classifier":"six"
}
[...]
}
I'm wanting to aggregate on the wildcard-ish field nested_parent.*.classifier, along the lines of
{
"size": 0,
"aggs": {
"termsAgg": {
"nested": {
"path": "nested_parent.*"
},
"aggs": {
"termsAgg": {
"terms": {
"size": 1000,
"field": "nested_parent.*.classifier"
}
}
}
}
}
}
which does not seem to work -- possibly because the path and field are not defined clearly enough.
How can I aggregate on nested objects with dynamically created nested mappings which share most of their properties, including the classifier on which I intend to terms-aggregate?
Tdlr;
A bit late to the party.
I would suggest a different approach as I don't see a possible solution using wildcards.
My solution would involve using the copy_to to create a field that you will be able to access using aggregation.
Solution
The idea is to create a field that will store the values of all your classifiers.
Which you can be doing aggregation on.
PUT /54198251/
{
"mappings": {
"properties": {
"classifiers": {
"type": "keyword"
},
"parent": {
"type": "nested",
"properties": {
"child": {
"type": "nested",
"properties": {
"classifier": {
"type": "keyword",
"copy_to": "classifiers"
}
}
},
"child2": {
"type": "nested",
"properties": {
"classifier": {
"type": "keyword",
"copy_to": "classifiers"
}
}
}
}
}
}
}
}
POST /54198251/_doc
{
"parent": {
"child": {
"classifier": "c1"
},
"child2": {
"classifier": "c2"
}
}
}
GET /54198251/_search
{
"aggs": {
"classifiers": {
"terms": {
"field": "classifiers",
"size": 10
}
}
}
}
Will give you:
"aggregations": {
"classifiers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "c1",
"doc_count": 1
},
{
"key": "c2",
"doc_count": 1
}
]
}
}

Unwind in ElasticSearch

I am currently having the below index in ElasticSearch
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"type" : {
"type": "text",
"fielddata": true
},
"id" : {
"type": "text",
"fielddata": true
},
"nestedTypes": {
"type": "nested",
"properties": {
"nestedTypeId":{
"type": "integer"
},
"nestedType":{
"type": "text",
"fielddata": true
},
"isLead":{
"type": "boolean"
},
"share":{
"type": "float"
},
"amount":{
"type": "float"
}
}
}
}
}
}
}
I need the nested types to be displayed in a HTML table along with the id and type fields in each row.
I am trying to achieve something similar to unwind in MongoDB.
I have tried the reverse nested aggregation as below
GET my_index/_search
{
"size": 0,
"aggs": {
"NestedTypes": {
"nested": {
"path": "nestedTypes"
},
"aggs": {
"NestedType": {
"terms": {
"field": "nestedTypes.nestedType",
"order": {
"_key": "desc"
}
},
"aggs": {
"Details": {
"reverse_nested": {},
"aggs": {
"type": {
"terms": {
"field": "type"
}
},
"id": {
"terms": {
"field": "id"
}
}
}
}
}
}
}
}
}
}
But the above returns only one field from the nestedTypes, but I need all of them.
Also, I need sorting and pagination for this table. Could you please let me know how this can be achieved in ElasticSearch.
ElasticSearch does not support this operation out of the box. When a request was raised to implement the same in git, the below response was given:
We discussed it in Fixit Friday and agreed that we won't try to
implement it due to the fact that we can't think of a way to support
such operations efficiently.
The only ideas that we thought were reasonable boiled down to having
another index that stores the same data but flattened. Depending on
your use-case, you might be able to maintain those two views in
parallel or would only maintain the one you have today, then
materialize a flattened view of the data when you need it and throw it
away after you are done querying. In both cases, this requires
client-side logic.
The link to the request is here

Elasticsearch queries slow performance

We have setup elasticsearch cluster with 7 nodes. Each node having configuration like 16G RAM, 8 Core cpu, centos 6.
Elasticsearch Version : 1.3.0
Heap Memory is - 9000m
1 Master (Non data)
1 Capable master (Non data)
5 Data node
Having 10 indices, In which one index having 55 million documents [ 254Gi (508Gi with replica) ] size rest all indices having approx 20k documents.
Every 1 seconds there are 5-10 new documents are indexing.
But problem is search is bit slow. Almost taking average of 2000 ms to 5000 ms. Some queries are in 1 secs.
Mapping:
{
"my_index": {
"mappings": {
"product": {
"_id": {
"path": "product_refer_id"
},
"properties": {
"product_refer_id": {
"type": "string"
},
"body": {
"type": "string"
},
"cat": {
"type": "string"
},
"cat_score": {
"type": "float"
},
"compliant": {
"type": "string"
},
"created": {
"type": "integer"
},
"facets": {
"properties": {
"ItemsPerCategoryCount": {
"properties": {
"terms": {
"properties": {
"field": {
"type": "string"
},
"size": {
"type": "long"
}
}
}
}
}
}
},
"fields": {
"type": "string"
},
"from": {
"type": "string"
}
"id": {
"type": "string"
},
"image": {
"type": "string"
},
"lang": {
"type": "string"
},
"main_cat": {
"properties": {
"Technology": {
"type": "double"
}
}
},
"md5_product": {
"type": "string"
},
"post_created": {
"type": "long"
},
"query": {
"properties": {
"bool": {
"properties": {
"must": {
"properties": {
"query_string": {
"properties": {
"default_field": {
"type": "string"
},
"query": {
"type": "string"
}
}
},
"range": {
"properties": {
"main_cat.Technology": {
"properties": {
"gte": {
"type": "string"
}
}
},
"sub_cat.Technology.computers": {
"properties": {
"gte": {
"type": "string"
}
}
}
}
},
"term": {
"properties": {
"product.secondary_cat": {
"type": "string"
}
}
}
}
}
}
},
"match_all": {
"type": "object"
}
}
},
"secondary_cat": {
"type": "string"
},
"secondary_cat_score": {
"type": "float"
},
"size": {
"type": "long"
},
"sort": {
"properties": {
"_uid": {
"type": "string"
}
}
},
"sub_cat": {
"properties": {
"Technology": {
"properties": {
"audio": {
"type": "double"
},
"computers": {
"type": "double"
},
"gadgets": {
"type": "double"
},
"geekchic": {
"type": "double"
}
}
}
}
},
"title": {
"type": "string"
},
"product": {
"type": "string"
}
}
}
}
}
}
We are using Default Analyzer.
Any Suggestion? Does this configuration is not enough?
Looks like the indices can not fit into memory, so there will be some more disk I/O going on. Do you use SSDs? If not you should get some.
Besides this your nodes need more resources (memory, CPU) to handle that index size.
I am a little surprised about the sizes here: ~250 GB for "just" 55 million documents is huge and I don't see you are storing any bigger blobs there (I might be mistaken, its hard to see just from the mapping definition). Maybe you can consider to keep some data not analyzed in case you don't need to query it, but just retrieve it. That would reduce the index size.
Except this I have no other ideas, without knowing all the relevant infrastructure in more detail.
To add to Torsten Engelbrecht's answer, default analyzer might be part of the culprit. This analyzer will index every form of each word as a separate token, meaning that a single verb in a language with complex conjugation can be indexed a dozen times. Also, that degrades the quality of the search results. The same applies if your documents contain formatting information (HTML markup ?).
More, stop words are disabled by default, meaning that each "the", "a"... in english for instance will be indexed as well.
You should consider using localized analyzers (snowball analyzer maybe ?) and stop words for the language used in your documents in order to limit the inverted index size and this way, increase performance.
Also, consider making not_analyzed fields as md5, urls, ids, and other sorts of unsearchable fields.

Resources