Elastic Search sorting very slow on large datasets - elasticsearch

The sorting of data in ES was very fast when I had less data, but when the data increased into GBs then the sorting of the fields is very very slow, normal fields < 1 sec, but for the fields with the below mapping the sorting time is > 10 seconds and sometimes more.
I am unable to figure out why is that? can anyone help me with this?
Mapping:
"newFields": {
"type": "nested",
"properties": {
"group": { "type": "keyword" },
"fieldType": { "type": "keyword" },
"name": { "type": "keyword" },
"stringValue": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "sort_normalizer"
}
}
},
"longValue": {
"type": "long"
},
"doubleValue": {
"type": "float"
},
"booleanValue": {
"type": "boolean"
}
}
}
Query:
{
"index": "transactions-read",
"body": {
"query": {
"bool": { "filter": { "bool": { "must": [{ "match_all": {} }] } } }
},
"sort": [
{
"newFields.intValue": {
"order": "desc",
"nested": {
"path": "newFields",
"filter": { "match": { "newFields.name": "johndoe" } }
}
}
}
]
},
"from": 0,
"size": 50
}
So is there any way to make it faster? Or am I missing something here?

Nested datatype is known for bad performance and on top of it you are using sort which is again a costly operation Please refer this great medium blog of Gojek engineering team on their perf issues with nested docs.
They suggested some optimization which includes changing the schema as well but they have not covered the infra level optimization like tunning the JVM heap size and having the favourable shards and replicas which are backbones of elasticsearch and its worth checking and tunning these infra params as well.

Nested sort will be slower compared to non-nested sort. As the number of nested documents in your index increases - unfortunately, sort will slow down.

Related

Aggregating an index with parent-child runs forever

I've recently decided to make an attempt to reindex an existing denormalized index to a new index with parent-chid relation.
I've around 14M parent docs, each parent has up to 400 children.(total of around 270M docs)
This is a simplified version of my mapping ->
{
"mappings": {
"_doc": {
"properties": {
"product_type": {
"type": "keyword"
},
"relation_type": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"product_data": [
"kpi",
"customer"
]
}
},
"rootdomain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"rootdomain_sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"sales_1d": {
"type": "float"
},
"sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time_nanos"
}
}
}
}
}
As you can see I've used eager_global_ordinals for the join relation to speed up search performance
(per my understanding this causes some of the join relation computation in global ordinals to be done in indexing time and not while querying).
This migration process helped me reduce my index size from around 500GB to just 40GB.
It has a huge benefit for my use case since I update a lot of data daily.
My current testing environment is using a single node, and the index has only 1 primary shard.
Trying to run the following aggregation, seems like it runs forever -
{
"aggs": {
"skus_sales": {
"aggs": {
"sales1": {
"children": {
"type": "kpi"
},
"aggs": {
"sales2": {
"filter": {
"range": {
"timestamp": {
"format": "basic_date_time_no_millis",
"gte": "20220601T000000Z",
"lte": "20220605T235959Z"
}
}
},
"aggs": {
"sales3": {
"sum": {
"field": "sales_1d"
}
}
}
}
}
}
},
"terms": {
"field": "rootdomain_sku",
"size": 10
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"rootdomain.keyword": "some_domain"
}
},
{
"term": {
"product_type": "Rugs"
}
}
]
}
}
}
I understand the cons of parent-child relations, but it seems like I'm doing something wrong.
I would expect to get some result, even after 15 minutes, but it seems to run forever.
I would love to get some help here,
Thanks.
Seems like the issue is using a single shard, by increasing the # of primary shards (1->4) i've managed to gain some performance boost, but it still runs for a very(!) long time.
Seems like parent-child relation query performance does not meet my
requirements so I'm trying to use nested objects instead -
by doing so updating/indexing time will increase but I'll gain search/aggregation performance boost.

Aggregation on terms and not on whole field

I have an index with products (ES 6.3), where some of the product names look like this Tomato, Tomatosoup, Tomatojuice etc. What I'm trying to achieve, is when I query for example by the term Toma, to get an aggregation of the best matching terms instead of the whole product names.
To achieve this, I have the following mapping:
{
"name": {
"type": "text",
"analyzer": "custom-ngram" // Defined in the mapping
"search-analyzer": "standard",
"fields": {
"suggestion": {
"type": "text",
"fielddata": true,
"analyzer": "standard"
}
}
}
}
and my query looks like this:
{
"query": {
"bool": {
"must":{
"multi_match": {
"query": "tom",
"fields": ["name^3", "description"]
}
}
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "name.suggestion",
"include": "tom.*",
"size": 10
}
}
},
"size": 0
}
Indeed this works and gives me back what I need but I have two concerns:
The usage of fielddata which is not encouraged based on the ES docs
The usage of the includes directive to actually filter the aggregation buckets
Is this the right way to go on solving this issue or the approach is completely wrong? Is there any best practice for this problem?

Unwind in ElasticSearch

I am currently having the below index in ElasticSearch
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"type" : {
"type": "text",
"fielddata": true
},
"id" : {
"type": "text",
"fielddata": true
},
"nestedTypes": {
"type": "nested",
"properties": {
"nestedTypeId":{
"type": "integer"
},
"nestedType":{
"type": "text",
"fielddata": true
},
"isLead":{
"type": "boolean"
},
"share":{
"type": "float"
},
"amount":{
"type": "float"
}
}
}
}
}
}
}
I need the nested types to be displayed in a HTML table along with the id and type fields in each row.
I am trying to achieve something similar to unwind in MongoDB.
I have tried the reverse nested aggregation as below
GET my_index/_search
{
"size": 0,
"aggs": {
"NestedTypes": {
"nested": {
"path": "nestedTypes"
},
"aggs": {
"NestedType": {
"terms": {
"field": "nestedTypes.nestedType",
"order": {
"_key": "desc"
}
},
"aggs": {
"Details": {
"reverse_nested": {},
"aggs": {
"type": {
"terms": {
"field": "type"
}
},
"id": {
"terms": {
"field": "id"
}
}
}
}
}
}
}
}
}
}
But the above returns only one field from the nestedTypes, but I need all of them.
Also, I need sorting and pagination for this table. Could you please let me know how this can be achieved in ElasticSearch.
ElasticSearch does not support this operation out of the box. When a request was raised to implement the same in git, the below response was given:
We discussed it in Fixit Friday and agreed that we won't try to
implement it due to the fact that we can't think of a way to support
such operations efficiently.
The only ideas that we thought were reasonable boiled down to having
another index that stores the same data but flattened. Depending on
your use-case, you might be able to maintain those two views in
parallel or would only maintain the one you have today, then
materialize a flattened view of the data when you need it and throw it
away after you are done querying. In both cases, this requires
client-side logic.
The link to the request is here

Elasticsearch aggregation performance takes a hit on relatively small dataset

We have a cluster of 3 Linux VMs (each machine has 2 cores, 8GB of RAM per core) where we have deployed an Elasticsearch 2.1.1 cluster, with default configuration. Store size is ~50GB for ~3M documents -so arguably fairly modest. We index documents ranging in size from tweets to blog posts. For each document, we extract "entities" (eg, if string "Barack Obama" appears in a document, we locate its character position and classify it into an entity type, in this case the type "person", or "statesman") from the text before indexing the document alongside its array of extracted entities.
Our mapping is as follows:
{
"mappings": {
"_default_": {
"_all": { "enabled": "false" },
"dynamic": false
},
"document": {
"properties": {
"body": { "type": "string", "index": "analyzed", "analyzer": "english" },
"timestamp": { "type": "date", "index":"not_analyzed" },
"author": {
"properties": {
"name": { "type": "string", "index": "not_analyzed" }
}
},
"entities": {
"type": "nested",
"include_in_parent": true,
"properties": {
"text": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "analyzed", "analyzer": "path" },
"start": { "type": "integer", "index":"not_analyzed", "doc_values": false },
"stop": { "type": "integer", "index":"not_analyzed", "doc_values": false }
}
}
}
}
}
}
Path analyzer is used on the entity type field (entity types are based on some hierarchical taxonomy, so the type is represented as a path-like string). The only other analyzed field is the body of the document. For some reason that I could expand on if necessary, we have to index the entities as nested types, though we are still including them in the parent document.
There are on average ~10 entities extracted per document, so ~30M entities in total. The cardinality for the entities field is thus fairly high (~2M unique values).
Our problem is that some of the aggregations we are doing are very slow (>30s). In particular, the following two aggregations:
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
terms: { field: 'entities.text', size: 50 }
}
}
}
And the same one, just replacing 'terms' aggregation with 'significant_terms':
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
significant_terms: { field: 'entities.text', size: 50 }
}
}
}
My questions:
Why are these aggregations prohibitively slow?
Is there something stupid/inefficient in the mapping strategy?
Does indexing the entities as a nested document while still keeping them in the parent document have an impact?
Is it simply that the cardinality of the entities field is just too big and Elasticsearch is not magic?

Elasticsearch queries slow performance

We have setup elasticsearch cluster with 7 nodes. Each node having configuration like 16G RAM, 8 Core cpu, centos 6.
Elasticsearch Version : 1.3.0
Heap Memory is - 9000m
1 Master (Non data)
1 Capable master (Non data)
5 Data node
Having 10 indices, In which one index having 55 million documents [ 254Gi (508Gi with replica) ] size rest all indices having approx 20k documents.
Every 1 seconds there are 5-10 new documents are indexing.
But problem is search is bit slow. Almost taking average of 2000 ms to 5000 ms. Some queries are in 1 secs.
Mapping:
{
"my_index": {
"mappings": {
"product": {
"_id": {
"path": "product_refer_id"
},
"properties": {
"product_refer_id": {
"type": "string"
},
"body": {
"type": "string"
},
"cat": {
"type": "string"
},
"cat_score": {
"type": "float"
},
"compliant": {
"type": "string"
},
"created": {
"type": "integer"
},
"facets": {
"properties": {
"ItemsPerCategoryCount": {
"properties": {
"terms": {
"properties": {
"field": {
"type": "string"
},
"size": {
"type": "long"
}
}
}
}
}
}
},
"fields": {
"type": "string"
},
"from": {
"type": "string"
}
"id": {
"type": "string"
},
"image": {
"type": "string"
},
"lang": {
"type": "string"
},
"main_cat": {
"properties": {
"Technology": {
"type": "double"
}
}
},
"md5_product": {
"type": "string"
},
"post_created": {
"type": "long"
},
"query": {
"properties": {
"bool": {
"properties": {
"must": {
"properties": {
"query_string": {
"properties": {
"default_field": {
"type": "string"
},
"query": {
"type": "string"
}
}
},
"range": {
"properties": {
"main_cat.Technology": {
"properties": {
"gte": {
"type": "string"
}
}
},
"sub_cat.Technology.computers": {
"properties": {
"gte": {
"type": "string"
}
}
}
}
},
"term": {
"properties": {
"product.secondary_cat": {
"type": "string"
}
}
}
}
}
}
},
"match_all": {
"type": "object"
}
}
},
"secondary_cat": {
"type": "string"
},
"secondary_cat_score": {
"type": "float"
},
"size": {
"type": "long"
},
"sort": {
"properties": {
"_uid": {
"type": "string"
}
}
},
"sub_cat": {
"properties": {
"Technology": {
"properties": {
"audio": {
"type": "double"
},
"computers": {
"type": "double"
},
"gadgets": {
"type": "double"
},
"geekchic": {
"type": "double"
}
}
}
}
},
"title": {
"type": "string"
},
"product": {
"type": "string"
}
}
}
}
}
}
We are using Default Analyzer.
Any Suggestion? Does this configuration is not enough?
Looks like the indices can not fit into memory, so there will be some more disk I/O going on. Do you use SSDs? If not you should get some.
Besides this your nodes need more resources (memory, CPU) to handle that index size.
I am a little surprised about the sizes here: ~250 GB for "just" 55 million documents is huge and I don't see you are storing any bigger blobs there (I might be mistaken, its hard to see just from the mapping definition). Maybe you can consider to keep some data not analyzed in case you don't need to query it, but just retrieve it. That would reduce the index size.
Except this I have no other ideas, without knowing all the relevant infrastructure in more detail.
To add to Torsten Engelbrecht's answer, default analyzer might be part of the culprit. This analyzer will index every form of each word as a separate token, meaning that a single verb in a language with complex conjugation can be indexed a dozen times. Also, that degrades the quality of the search results. The same applies if your documents contain formatting information (HTML markup ?).
More, stop words are disabled by default, meaning that each "the", "a"... in english for instance will be indexed as well.
You should consider using localized analyzers (snowball analyzer maybe ?) and stop words for the language used in your documents in order to limit the inverted index size and this way, increase performance.
Also, consider making not_analyzed fields as md5, urls, ids, and other sorts of unsearchable fields.

Resources