Elasticsearch aggregation performance takes a hit on relatively small dataset - elasticsearch

We have a cluster of 3 Linux VMs (each machine has 2 cores, 8GB of RAM per core) where we have deployed an Elasticsearch 2.1.1 cluster, with default configuration. Store size is ~50GB for ~3M documents -so arguably fairly modest. We index documents ranging in size from tweets to blog posts. For each document, we extract "entities" (eg, if string "Barack Obama" appears in a document, we locate its character position and classify it into an entity type, in this case the type "person", or "statesman") from the text before indexing the document alongside its array of extracted entities.
Our mapping is as follows:
{
"mappings": {
"_default_": {
"_all": { "enabled": "false" },
"dynamic": false
},
"document": {
"properties": {
"body": { "type": "string", "index": "analyzed", "analyzer": "english" },
"timestamp": { "type": "date", "index":"not_analyzed" },
"author": {
"properties": {
"name": { "type": "string", "index": "not_analyzed" }
}
},
"entities": {
"type": "nested",
"include_in_parent": true,
"properties": {
"text": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "analyzed", "analyzer": "path" },
"start": { "type": "integer", "index":"not_analyzed", "doc_values": false },
"stop": { "type": "integer", "index":"not_analyzed", "doc_values": false }
}
}
}
}
}
}
Path analyzer is used on the entity type field (entity types are based on some hierarchical taxonomy, so the type is represented as a path-like string). The only other analyzed field is the body of the document. For some reason that I could expand on if necessary, we have to index the entities as nested types, though we are still including them in the parent document.
There are on average ~10 entities extracted per document, so ~30M entities in total. The cardinality for the entities field is thus fairly high (~2M unique values).
Our problem is that some of the aggregations we are doing are very slow (>30s). In particular, the following two aggregations:
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
terms: { field: 'entities.text', size: 50 }
}
}
}
And the same one, just replacing 'terms' aggregation with 'significant_terms':
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
significant_terms: { field: 'entities.text', size: 50 }
}
}
}
My questions:
Why are these aggregations prohibitively slow?
Is there something stupid/inefficient in the mapping strategy?
Does indexing the entities as a nested document while still keeping them in the parent document have an impact?
Is it simply that the cardinality of the entities field is just too big and Elasticsearch is not magic?

Related

Elastic Search sorting very slow on large datasets

The sorting of data in ES was very fast when I had less data, but when the data increased into GBs then the sorting of the fields is very very slow, normal fields < 1 sec, but for the fields with the below mapping the sorting time is > 10 seconds and sometimes more.
I am unable to figure out why is that? can anyone help me with this?
Mapping:
"newFields": {
"type": "nested",
"properties": {
"group": { "type": "keyword" },
"fieldType": { "type": "keyword" },
"name": { "type": "keyword" },
"stringValue": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "sort_normalizer"
}
}
},
"longValue": {
"type": "long"
},
"doubleValue": {
"type": "float"
},
"booleanValue": {
"type": "boolean"
}
}
}
Query:
{
"index": "transactions-read",
"body": {
"query": {
"bool": { "filter": { "bool": { "must": [{ "match_all": {} }] } } }
},
"sort": [
{
"newFields.intValue": {
"order": "desc",
"nested": {
"path": "newFields",
"filter": { "match": { "newFields.name": "johndoe" } }
}
}
}
]
},
"from": 0,
"size": 50
}
So is there any way to make it faster? Or am I missing something here?
Nested datatype is known for bad performance and on top of it you are using sort which is again a costly operation Please refer this great medium blog of Gojek engineering team on their perf issues with nested docs.
They suggested some optimization which includes changing the schema as well but they have not covered the infra level optimization like tunning the JVM heap size and having the favourable shards and replicas which are backbones of elasticsearch and its worth checking and tunning these infra params as well.
Nested sort will be slower compared to non-nested sort. As the number of nested documents in your index increases - unfortunately, sort will slow down.

nested terms aggregation on object containing a string field

I like to run a nested terms aggregation on string field which is inside an object.
Usually, I use this query
"terms": {
"field": "fieldname.keyword"
}
to enable fielddata
But I am unable to do that for a nested document like this
{
"nested": {
"path": "objectField"
},
"aggs": {
"allmyaggs": {
"terms": {
"field": "objectField.fieldName.keyword"
}
}
}
}
The above query is just returning an empty buckets array
Is there a way this can be done without enabling field-data by default during index mapping.
Since that will take a large heap memory and I have already loaded a huge data without it
document mapping
{
"mappings": {
"properties": {
"productname": {
"type": "nested",
"properties": {
"productlineseqno": {
"type": "text"
},
"invoiceitemname": {
"type": "text"
},
"productlinename": {
"type": "text"
},
"productlinedescription": {
"type": "text"
},
"isprescribable": {
"type": "boolean"
},
"iscontrolleddrug": {
"type": "boolean"
}
}
}
sample document
{
"productname": [
{
"productlineseqno": "1.58",
"iscontrolleddrug": "false",
"productlinename": "Consultations",
"productlinedescription": "Consultations",
"isprescribable": "false",
"invoiceitemname": "invoice name"
}
]
}
Fixed
By changing the mapping to enable field data
Nested query is used to access nested fields similarly nested aggregation is needed to aggregation on nested fields
{
"aggs": {
"fieldname": {
"nested": {
"path": "objectField"
},
"aggs": {
"fields": {
"terms": {
"field": "objectField.fieldname.keyword",
"size": 10
}
}
}
}
}
}
EDIT1:
If you are searching for productname.invoiceitemname.keyword then it will give empty bucket as no field exists with that name.
You need to define your mapping like below
{
"mappings": {
"properties": {
"productname": {
"type": "nested",
"properties": {
"productlineseqno": {
"type": "text"
},
"invoiceitemname": {
"type": "text",
"fields":{ --> note
"keyword":{
"type":"keyword"
}
}
},
"productlinename": {
"type": "text"
},
"productlinedescription": {
"type": "text"
},
"isprescribable": {
"type": "boolean"
},
"iscontrolleddrug": {
"type": "boolean"
}
}
}
}
}
}
Fields
It is often useful to index the same field in different ways for
different purposes. This is the purpose of multi-fields. For instance,
a string field could be mapped as a text field for full-text search,
and as a keyword field for sorting or aggregations:
When mapping is not explicitly provided, keyword fields are created by default. If you are creating your own mapping(which you need to do for nested type), you need to provide keyword fields in mapping, wherever you intend to use them

Elasticsearch: Merge result of aggregation by bucket key

I've indexed entities in Elasticsearch, which occur in my documents. The mapping for the entities looks like the following:
"Entities": {
"properties": {
"EntFrequency": {
"type": "long"
},
"EntId": {
"type": "long"
},
"EntType": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"Entname": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
},
[...]
Furthermore, I use this aggregation query to determine the most-occurring entities:
GET cable/document/_search
{
"size" :0,
"query": {
"match_all": {}
},
"aggs" : {
"entities_agg" : {
"terms" : {
"field" : "Entities.EntId"
}
}
}
}
}
Response
"buckets": [
{
"key": 323644,
"doc_count": 231038
},
[...]
However, some of those entity mentions refer to the same entity e.g. "USA" and "United States" and I do know their ids. How do I merge the buckets and the counts of these duplicates in ES?
I cannot use a client-side solution since there are too many entities and retrieving all of them and merging would be probably too slow for my application. The knowledge about duplicates is acquired through runtime. Thus, I cannot use this knowledge for the initial creation of my ES index.
Thanks for your help and comments!

Elasticsearch Aggregation - Unable to perform aggregation to object

I have a mapping with an inner object as follows:
{
"mappings": {
"_all": {
"enabled": false
},
"properties": {
"foo": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"address": {
"type": "object",
"properties": {
"address": {
"type": "string"
},
"city": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
When I try the following aggregation it does not return any data:
post data:*/foo/_search?search_type=count
{
"query": {
"match_all": {}
},
"aggs": {
"unique": {
"cardinality": {
"field": "address.city"
}
}
}
}
When I try to put field city or address.city, aggregation returns zero but if i put foo.address.city it is then when i get the correct respond by elasticsearch. This also affects kibana behavior
Any ideas why this is happening? I saw there is a mapping refactoring that might affects this. I use elasticsearch version 1.7.1
To add on this if, I use the relative path in a search query as follows it works normally:
"query": {
"filtered": {
"filter": {
"term": {
"address.city": "london"
}
}
}
}
Seems its this same issue.
This is seen when the type name and field name is same.

Elasticsearch queries slow performance

We have setup elasticsearch cluster with 7 nodes. Each node having configuration like 16G RAM, 8 Core cpu, centos 6.
Elasticsearch Version : 1.3.0
Heap Memory is - 9000m
1 Master (Non data)
1 Capable master (Non data)
5 Data node
Having 10 indices, In which one index having 55 million documents [ 254Gi (508Gi with replica) ] size rest all indices having approx 20k documents.
Every 1 seconds there are 5-10 new documents are indexing.
But problem is search is bit slow. Almost taking average of 2000 ms to 5000 ms. Some queries are in 1 secs.
Mapping:
{
"my_index": {
"mappings": {
"product": {
"_id": {
"path": "product_refer_id"
},
"properties": {
"product_refer_id": {
"type": "string"
},
"body": {
"type": "string"
},
"cat": {
"type": "string"
},
"cat_score": {
"type": "float"
},
"compliant": {
"type": "string"
},
"created": {
"type": "integer"
},
"facets": {
"properties": {
"ItemsPerCategoryCount": {
"properties": {
"terms": {
"properties": {
"field": {
"type": "string"
},
"size": {
"type": "long"
}
}
}
}
}
}
},
"fields": {
"type": "string"
},
"from": {
"type": "string"
}
"id": {
"type": "string"
},
"image": {
"type": "string"
},
"lang": {
"type": "string"
},
"main_cat": {
"properties": {
"Technology": {
"type": "double"
}
}
},
"md5_product": {
"type": "string"
},
"post_created": {
"type": "long"
},
"query": {
"properties": {
"bool": {
"properties": {
"must": {
"properties": {
"query_string": {
"properties": {
"default_field": {
"type": "string"
},
"query": {
"type": "string"
}
}
},
"range": {
"properties": {
"main_cat.Technology": {
"properties": {
"gte": {
"type": "string"
}
}
},
"sub_cat.Technology.computers": {
"properties": {
"gte": {
"type": "string"
}
}
}
}
},
"term": {
"properties": {
"product.secondary_cat": {
"type": "string"
}
}
}
}
}
}
},
"match_all": {
"type": "object"
}
}
},
"secondary_cat": {
"type": "string"
},
"secondary_cat_score": {
"type": "float"
},
"size": {
"type": "long"
},
"sort": {
"properties": {
"_uid": {
"type": "string"
}
}
},
"sub_cat": {
"properties": {
"Technology": {
"properties": {
"audio": {
"type": "double"
},
"computers": {
"type": "double"
},
"gadgets": {
"type": "double"
},
"geekchic": {
"type": "double"
}
}
}
}
},
"title": {
"type": "string"
},
"product": {
"type": "string"
}
}
}
}
}
}
We are using Default Analyzer.
Any Suggestion? Does this configuration is not enough?
Looks like the indices can not fit into memory, so there will be some more disk I/O going on. Do you use SSDs? If not you should get some.
Besides this your nodes need more resources (memory, CPU) to handle that index size.
I am a little surprised about the sizes here: ~250 GB for "just" 55 million documents is huge and I don't see you are storing any bigger blobs there (I might be mistaken, its hard to see just from the mapping definition). Maybe you can consider to keep some data not analyzed in case you don't need to query it, but just retrieve it. That would reduce the index size.
Except this I have no other ideas, without knowing all the relevant infrastructure in more detail.
To add to Torsten Engelbrecht's answer, default analyzer might be part of the culprit. This analyzer will index every form of each word as a separate token, meaning that a single verb in a language with complex conjugation can be indexed a dozen times. Also, that degrades the quality of the search results. The same applies if your documents contain formatting information (HTML markup ?).
More, stop words are disabled by default, meaning that each "the", "a"... in english for instance will be indexed as well.
You should consider using localized analyzers (snowball analyzer maybe ?) and stop words for the language used in your documents in order to limit the inverted index size and this way, increase performance.
Also, consider making not_analyzed fields as md5, urls, ids, and other sorts of unsearchable fields.

Resources