Aggregating an index with parent-child runs forever - elasticsearch

I've recently decided to make an attempt to reindex an existing denormalized index to a new index with parent-chid relation.
I've around 14M parent docs, each parent has up to 400 children.(total of around 270M docs)
This is a simplified version of my mapping ->
{
"mappings": {
"_doc": {
"properties": {
"product_type": {
"type": "keyword"
},
"relation_type": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"product_data": [
"kpi",
"customer"
]
}
},
"rootdomain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"rootdomain_sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"sales_1d": {
"type": "float"
},
"sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time_nanos"
}
}
}
}
}
As you can see I've used eager_global_ordinals for the join relation to speed up search performance
(per my understanding this causes some of the join relation computation in global ordinals to be done in indexing time and not while querying).
This migration process helped me reduce my index size from around 500GB to just 40GB.
It has a huge benefit for my use case since I update a lot of data daily.
My current testing environment is using a single node, and the index has only 1 primary shard.
Trying to run the following aggregation, seems like it runs forever -
{
"aggs": {
"skus_sales": {
"aggs": {
"sales1": {
"children": {
"type": "kpi"
},
"aggs": {
"sales2": {
"filter": {
"range": {
"timestamp": {
"format": "basic_date_time_no_millis",
"gte": "20220601T000000Z",
"lte": "20220605T235959Z"
}
}
},
"aggs": {
"sales3": {
"sum": {
"field": "sales_1d"
}
}
}
}
}
}
},
"terms": {
"field": "rootdomain_sku",
"size": 10
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"rootdomain.keyword": "some_domain"
}
},
{
"term": {
"product_type": "Rugs"
}
}
]
}
}
}
I understand the cons of parent-child relations, but it seems like I'm doing something wrong.
I would expect to get some result, even after 15 minutes, but it seems to run forever.
I would love to get some help here,
Thanks.

Seems like the issue is using a single shard, by increasing the # of primary shards (1->4) i've managed to gain some performance boost, but it still runs for a very(!) long time.
Seems like parent-child relation query performance does not meet my
requirements so I'm trying to use nested objects instead -
by doing so updating/indexing time will increase but I'll gain search/aggregation performance boost.

Related

Multiple concurrent aggregations best practice

I'm considering using Elasticsearch to act as the backend search engine for multi-filter utility. Per this requirement, a multiple aggregation queries will be run upon the cluster, while the expected response time is ~5 seconds.
Based on the details below, do you think this approach is valid for my use case?
If yes, what is the suggested cluster sizing?
For sure I'll have to increase default values for parameters such as index.mapping.total_fields.limit and index.mapping.nested_objects.limit.
It will be much appreciated to get some feedback on the approach suggested below, and ways to avoid common pitfalls.
Thanks in advance.
Details
Number of expected documents: ~50m
Number of unique fields values (facet_name + face_value): ~1B
Number of queries per second: ~50 per sec
Mappings:
{
"mappings": {
"properties": {
"customer_id": {
"type": "keyword"
},
"id": {
"type": "keyword"
},
"mi_score_join": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"mi_data": "customer_model"
}
},
"model_id": {
"type": "keyword"
},
"number_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "long"
}
}
},
"score": {
"type": "long"
},
"string_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "keyword"
}
}
}
}
}
}
An example for a document:
{
"id": 33421,
"string_facet":
[
{
"facet_value":"true",
"facet_name": "var_a"
},
{
"facet_value":"dummy_country",
"facet_name": "var_b"
},
{
"facet_value":"dummy_",
"facet_name": "var_c"
},
{
"facet_value":"https://dummy.com/",
"facet_name": "var_d"
}
,
{
"facet_value":"www.dummy.com",
"facet_name": "var_e"
}
,
{
"facet_value":"dummy",
"facet_name": "var_f"
}
],
"mi_score_join": "mi_data"
}
An example for an aggregation query to be run:
POST test_index/_search
{
"size":0,
"aggs": {
"facets": {
"nested": {
"path": "string_facet"
},
"aggs": {
"names": {
"terms": { "field": "string_facet.facet_name", "size":???},
"aggs": {
"values": {
"terms": { "field": "string_facet.facet_value" }
}
}
}
}
}
}
}
The "size": ??? will probably be the max cardinality of the whole terms values.
Filters may be added to the aggregations, based on the filters that already applied.

Elasticsearch multiple index query

I have a following index which stores the course details (I have truncated some attributes for brevity):
{
"settings": {
"index": {
"number_of_replicas": "1",
"number_of_shards": "1"
}
},
"aliases": {
"course": {
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"id": {
"type": "integer"
},
"max_per_user": {
"type": "integer"
}
}
}
}
Here max_per_user is number of times a user can complete the course. A user is allowed through a course multiple times but not more than max_per_user for a course
I want to track user interactions with courses. I have created following index to track interaction events. event_type_id represents a type of interaction
{
"settings": {
"index": {
"number_of_replicas": "1",
"number_of_shards": "1"
}
},
"aliases": {
"course_events": {
}
},
"mappings": {
"properties": {
"user_progress": {
"dynamic": "true",
"properties": {
"current_count": {
"type": "integer"
},
"user_id": {
"type": "integer"
},
"events": {
"dynamic": "true",
"properties": {
"event_type_id": {
"type": "integer"
},
"event_timestamp": {
"type": "date",
"format": "strict_date_time"
}
}
}
}
},
"created_at": {
"type": "date",
"format": "strict_date_time"
},
"course_id": {
"type": "integer"
}
}
}
}
Where current_count is number of times the user has gone through the complete course
Now when I run a search on course index, I also want to be able to pass in the user_id and get only those courses where the current_count for the given user is less than max_per_user for the course
My search query for course index is something like this (truncated some filters for brevity). This query is executed when a user searches for a course, so basically at the time of executing this I will have user_id.
{
"sort": [
{
"id": "desc"
}
],
"query": {
"bool": {
"filter": [
{
"range": {
"end_date": {
"gte": "2020-09-28T12:27:55.884Z"
}
}
},
{
"range": {
"start_date": {
"lte": "2020-09-28T12:27:55.884Z"
}
}
}
],
"must": [
{
"term": {
"is_active": true
}
}
]
}
}
}
I am not sure how to construct my search query such that I am able to filter out courses where max_per_user has been achieved for a given user_id.
If I understood the question correctly you want to find the courses where max_per_user limit isn't exceeded. My answer is on the same basis:
Considering your current Schema way to find what you want is:
For the given user_id find all the course_ids and their corresponding completion count
Using the data fetched in #1 find out the courses where-in max_per_user limit is not exceeded.
Now comes the problem:
In a relational database such use case can be solved using table join and checks
Elastic Search doesn't support joins and can't be done here.
Poor solution with current schema:
For each course check whether it is applicable or not. For n courses number of queries to E.S will be proportional to N.
Solution with current schema:
With-in the user-course-completion index (second index you mentioned), track max_per_user as well and use a simple query like below, to get the required course ids :
{
"size": 10,
"query": {
"script": {
"script": "doc['current_usage'].value<doc['max_per_user'].value &&
doc['u_id'].value==1" // <======= 1 is the user_id here
}
}
}

Elastic Search sorting very slow on large datasets

The sorting of data in ES was very fast when I had less data, but when the data increased into GBs then the sorting of the fields is very very slow, normal fields < 1 sec, but for the fields with the below mapping the sorting time is > 10 seconds and sometimes more.
I am unable to figure out why is that? can anyone help me with this?
Mapping:
"newFields": {
"type": "nested",
"properties": {
"group": { "type": "keyword" },
"fieldType": { "type": "keyword" },
"name": { "type": "keyword" },
"stringValue": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "sort_normalizer"
}
}
},
"longValue": {
"type": "long"
},
"doubleValue": {
"type": "float"
},
"booleanValue": {
"type": "boolean"
}
}
}
Query:
{
"index": "transactions-read",
"body": {
"query": {
"bool": { "filter": { "bool": { "must": [{ "match_all": {} }] } } }
},
"sort": [
{
"newFields.intValue": {
"order": "desc",
"nested": {
"path": "newFields",
"filter": { "match": { "newFields.name": "johndoe" } }
}
}
}
]
},
"from": 0,
"size": 50
}
So is there any way to make it faster? Or am I missing something here?
Nested datatype is known for bad performance and on top of it you are using sort which is again a costly operation Please refer this great medium blog of Gojek engineering team on their perf issues with nested docs.
They suggested some optimization which includes changing the schema as well but they have not covered the infra level optimization like tunning the JVM heap size and having the favourable shards and replicas which are backbones of elasticsearch and its worth checking and tunning these infra params as well.
Nested sort will be slower compared to non-nested sort. As the number of nested documents in your index increases - unfortunately, sort will slow down.

Unwind in ElasticSearch

I am currently having the below index in ElasticSearch
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"type" : {
"type": "text",
"fielddata": true
},
"id" : {
"type": "text",
"fielddata": true
},
"nestedTypes": {
"type": "nested",
"properties": {
"nestedTypeId":{
"type": "integer"
},
"nestedType":{
"type": "text",
"fielddata": true
},
"isLead":{
"type": "boolean"
},
"share":{
"type": "float"
},
"amount":{
"type": "float"
}
}
}
}
}
}
}
I need the nested types to be displayed in a HTML table along with the id and type fields in each row.
I am trying to achieve something similar to unwind in MongoDB.
I have tried the reverse nested aggregation as below
GET my_index/_search
{
"size": 0,
"aggs": {
"NestedTypes": {
"nested": {
"path": "nestedTypes"
},
"aggs": {
"NestedType": {
"terms": {
"field": "nestedTypes.nestedType",
"order": {
"_key": "desc"
}
},
"aggs": {
"Details": {
"reverse_nested": {},
"aggs": {
"type": {
"terms": {
"field": "type"
}
},
"id": {
"terms": {
"field": "id"
}
}
}
}
}
}
}
}
}
}
But the above returns only one field from the nestedTypes, but I need all of them.
Also, I need sorting and pagination for this table. Could you please let me know how this can be achieved in ElasticSearch.
ElasticSearch does not support this operation out of the box. When a request was raised to implement the same in git, the below response was given:
We discussed it in Fixit Friday and agreed that we won't try to
implement it due to the fact that we can't think of a way to support
such operations efficiently.
The only ideas that we thought were reasonable boiled down to having
another index that stores the same data but flattened. Depending on
your use-case, you might be able to maintain those two views in
parallel or would only maintain the one you have today, then
materialize a flattened view of the data when you need it and throw it
away after you are done querying. In both cases, this requires
client-side logic.
The link to the request is here

Elasticsearch aggregation performance takes a hit on relatively small dataset

We have a cluster of 3 Linux VMs (each machine has 2 cores, 8GB of RAM per core) where we have deployed an Elasticsearch 2.1.1 cluster, with default configuration. Store size is ~50GB for ~3M documents -so arguably fairly modest. We index documents ranging in size from tweets to blog posts. For each document, we extract "entities" (eg, if string "Barack Obama" appears in a document, we locate its character position and classify it into an entity type, in this case the type "person", or "statesman") from the text before indexing the document alongside its array of extracted entities.
Our mapping is as follows:
{
"mappings": {
"_default_": {
"_all": { "enabled": "false" },
"dynamic": false
},
"document": {
"properties": {
"body": { "type": "string", "index": "analyzed", "analyzer": "english" },
"timestamp": { "type": "date", "index":"not_analyzed" },
"author": {
"properties": {
"name": { "type": "string", "index": "not_analyzed" }
}
},
"entities": {
"type": "nested",
"include_in_parent": true,
"properties": {
"text": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "analyzed", "analyzer": "path" },
"start": { "type": "integer", "index":"not_analyzed", "doc_values": false },
"stop": { "type": "integer", "index":"not_analyzed", "doc_values": false }
}
}
}
}
}
}
Path analyzer is used on the entity type field (entity types are based on some hierarchical taxonomy, so the type is represented as a path-like string). The only other analyzed field is the body of the document. For some reason that I could expand on if necessary, we have to index the entities as nested types, though we are still including them in the parent document.
There are on average ~10 entities extracted per document, so ~30M entities in total. The cardinality for the entities field is thus fairly high (~2M unique values).
Our problem is that some of the aggregations we are doing are very slow (>30s). In particular, the following two aggregations:
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
terms: { field: 'entities.text', size: 50 }
}
}
}
And the same one, just replacing 'terms' aggregation with 'significant_terms':
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
significant_terms: { field: 'entities.text', size: 50 }
}
}
}
My questions:
Why are these aggregations prohibitively slow?
Is there something stupid/inefficient in the mapping strategy?
Does indexing the entities as a nested document while still keeping them in the parent document have an impact?
Is it simply that the cardinality of the entities field is just too big and Elasticsearch is not magic?

Resources