Elasticsearch queries slow performance - elasticsearch

We have setup elasticsearch cluster with 7 nodes. Each node having configuration like 16G RAM, 8 Core cpu, centos 6.
Elasticsearch Version : 1.3.0
Heap Memory is - 9000m
1 Master (Non data)
1 Capable master (Non data)
5 Data node
Having 10 indices, In which one index having 55 million documents [ 254Gi (508Gi with replica) ] size rest all indices having approx 20k documents.
Every 1 seconds there are 5-10 new documents are indexing.
But problem is search is bit slow. Almost taking average of 2000 ms to 5000 ms. Some queries are in 1 secs.
Mapping:
{
"my_index": {
"mappings": {
"product": {
"_id": {
"path": "product_refer_id"
},
"properties": {
"product_refer_id": {
"type": "string"
},
"body": {
"type": "string"
},
"cat": {
"type": "string"
},
"cat_score": {
"type": "float"
},
"compliant": {
"type": "string"
},
"created": {
"type": "integer"
},
"facets": {
"properties": {
"ItemsPerCategoryCount": {
"properties": {
"terms": {
"properties": {
"field": {
"type": "string"
},
"size": {
"type": "long"
}
}
}
}
}
}
},
"fields": {
"type": "string"
},
"from": {
"type": "string"
}
"id": {
"type": "string"
},
"image": {
"type": "string"
},
"lang": {
"type": "string"
},
"main_cat": {
"properties": {
"Technology": {
"type": "double"
}
}
},
"md5_product": {
"type": "string"
},
"post_created": {
"type": "long"
},
"query": {
"properties": {
"bool": {
"properties": {
"must": {
"properties": {
"query_string": {
"properties": {
"default_field": {
"type": "string"
},
"query": {
"type": "string"
}
}
},
"range": {
"properties": {
"main_cat.Technology": {
"properties": {
"gte": {
"type": "string"
}
}
},
"sub_cat.Technology.computers": {
"properties": {
"gte": {
"type": "string"
}
}
}
}
},
"term": {
"properties": {
"product.secondary_cat": {
"type": "string"
}
}
}
}
}
}
},
"match_all": {
"type": "object"
}
}
},
"secondary_cat": {
"type": "string"
},
"secondary_cat_score": {
"type": "float"
},
"size": {
"type": "long"
},
"sort": {
"properties": {
"_uid": {
"type": "string"
}
}
},
"sub_cat": {
"properties": {
"Technology": {
"properties": {
"audio": {
"type": "double"
},
"computers": {
"type": "double"
},
"gadgets": {
"type": "double"
},
"geekchic": {
"type": "double"
}
}
}
}
},
"title": {
"type": "string"
},
"product": {
"type": "string"
}
}
}
}
}
}
We are using Default Analyzer.
Any Suggestion? Does this configuration is not enough?

Looks like the indices can not fit into memory, so there will be some more disk I/O going on. Do you use SSDs? If not you should get some.
Besides this your nodes need more resources (memory, CPU) to handle that index size.
I am a little surprised about the sizes here: ~250 GB for "just" 55 million documents is huge and I don't see you are storing any bigger blobs there (I might be mistaken, its hard to see just from the mapping definition). Maybe you can consider to keep some data not analyzed in case you don't need to query it, but just retrieve it. That would reduce the index size.
Except this I have no other ideas, without knowing all the relevant infrastructure in more detail.

To add to Torsten Engelbrecht's answer, default analyzer might be part of the culprit. This analyzer will index every form of each word as a separate token, meaning that a single verb in a language with complex conjugation can be indexed a dozen times. Also, that degrades the quality of the search results. The same applies if your documents contain formatting information (HTML markup ?).
More, stop words are disabled by default, meaning that each "the", "a"... in english for instance will be indexed as well.
You should consider using localized analyzers (snowball analyzer maybe ?) and stop words for the language used in your documents in order to limit the inverted index size and this way, increase performance.
Also, consider making not_analyzed fields as md5, urls, ids, and other sorts of unsearchable fields.

Related

Multiple concurrent aggregations best practice

I'm considering using Elasticsearch to act as the backend search engine for multi-filter utility. Per this requirement, a multiple aggregation queries will be run upon the cluster, while the expected response time is ~5 seconds.
Based on the details below, do you think this approach is valid for my use case?
If yes, what is the suggested cluster sizing?
For sure I'll have to increase default values for parameters such as index.mapping.total_fields.limit and index.mapping.nested_objects.limit.
It will be much appreciated to get some feedback on the approach suggested below, and ways to avoid common pitfalls.
Thanks in advance.
Details
Number of expected documents: ~50m
Number of unique fields values (facet_name + face_value): ~1B
Number of queries per second: ~50 per sec
Mappings:
{
"mappings": {
"properties": {
"customer_id": {
"type": "keyword"
},
"id": {
"type": "keyword"
},
"mi_score_join": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"mi_data": "customer_model"
}
},
"model_id": {
"type": "keyword"
},
"number_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "long"
}
}
},
"score": {
"type": "long"
},
"string_facet": {
"type": "nested",
"properties": {
"facet_name": {
"type": "keyword"
},
"facet_value": {
"type": "keyword"
}
}
}
}
}
}
An example for a document:
{
"id": 33421,
"string_facet":
[
{
"facet_value":"true",
"facet_name": "var_a"
},
{
"facet_value":"dummy_country",
"facet_name": "var_b"
},
{
"facet_value":"dummy_",
"facet_name": "var_c"
},
{
"facet_value":"https://dummy.com/",
"facet_name": "var_d"
}
,
{
"facet_value":"www.dummy.com",
"facet_name": "var_e"
}
,
{
"facet_value":"dummy",
"facet_name": "var_f"
}
],
"mi_score_join": "mi_data"
}
An example for an aggregation query to be run:
POST test_index/_search
{
"size":0,
"aggs": {
"facets": {
"nested": {
"path": "string_facet"
},
"aggs": {
"names": {
"terms": { "field": "string_facet.facet_name", "size":???},
"aggs": {
"values": {
"terms": { "field": "string_facet.facet_value" }
}
}
}
}
}
}
}
The "size": ??? will probably be the max cardinality of the whole terms values.
Filters may be added to the aggregations, based on the filters that already applied.

How to avoid index explosion in ElasticSearch

I have two docs from the same index that originally look like this (only _source value is shown here)
{
"id" : "3",
"name": "Foo",
"property":{
"schemaId":"guid_of_the_RGB_schema_defined_extenally",
"value":{
"R":255,
"G":100,
"B":20
}
}
}
{
"id" : "2",
"name": "Bar",
"property":{
"schemaId":"guid_of_the_HSL_schema_defined_extenally",
"value":{
"H":255,
"S":100,
"L":20
}
}
}
The schema(used for validation of value) is stored outside of ES since it has nothing to do with the indexing.
If I don't define mapping, the value field will be consider Object mapping. And its subfield will grow once there is a new subfield.
Currently, ElasticSearch supports Flattened mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html to prevent this explosion in the index. However it has a limited support for searching for inner field due to its restriction: As with queries, there is no special support for numerics — all values in the JSON object are treated as keywords. When sorting, this implies that values are compared lexicographically.
I need to be able to query the index to find the document match a given doc (e.g. B in the range [10,30])
So far I come up with a solution that structure my doc like this
{
"id":4,
"name":"Boo",
"property":
{
"guid_of_the_normalized_RGB_schema_defined_extenally":
{
"R":0.1,
"G":0.2,
"B":0.5
}
}
Although it does not solve my issue of the explosion in mapping, it mitigates some other issue.
My mapping now will look similar like this for the field property
"property": {
"properties": {
"guid_of_the_RGB_schema_defined_extenally": {
"properties": {
"B": {
"type": "long"
},
"G": {
"type": "long"
},
"R": {
"type": "long"
}
}
},
"guid_of_the_normalized_RGB_schema_defined_extenally": {
"properties": {
"B": {
"type": "float"
},
"G": {
"type": "float"
},
"R": {
"type": "float"
}
},
"guid_of_the_HSL_schema_defined_extenally": {
"properties": {
"B": {
"type": "float"
},
"G": {
"type": "float"
},
"R": {
"type": "float"
}
}
}
}
}
This solve the issue with the case where the field have the same name but different data type.
Can someone suggest me a solution that could solve the explosion of indices with out suffering from the limit that the Flattened has in searching?
To avoid mapping explosion, the best solution is to normalize your data better.
You can set "dynamic": "strict", in your mapping, then a doc will be rejected if it contains a field which is not already in the mapping.
After that, you can still add new fields but you will have to add them explicitly in the mapping before.
You can add a pipeline to clean up and normalize your data before ingestion.
If you don't want, or cannot reindex:
To make your query easy even if you can not know the "middle" part of your key, you can use a multimatch with a star.
GET myindex/_search
{
"query": {
"multi_match": {
"query": 0.5,
"fields": ["property.*.B"]
}
}
}
But you will still not be able to sort it as you want.
For ordering on multiple 'unknown' field names without touching the data, you can use a script: https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html
But maybe you could simplify the whole process by adding a dynamic template to your index.
PUT test/_mapping
{
"dynamic_templates": [
{
"unified_red": {
"path_match": "property.*.R",
"mapping": {
"type": "float",
"copy_to": "unified_color.R"
}
}
},
{
"unified_green": {
"path_match": "property.*.G",
"mapping": {
"type": "float",
"copy_to": "unified_color.G"
}
}
},
{
"unified_blue": {
"path_match": "property.*.B",
"mapping": {
"type": "float",
"copy_to": "unified_color.B"
}
}
}
],
"properties": {
"unified_color": {
"properties": {
"R": {
"type": "float"
},
"G": {
"type": "float"
},
"B": {
"type": "float"
}
}
}
}
}
Then you'll be able to query any value with the same query :
GET test/_search
{
"query": {
"range": {
"unified_color.B": {
"gte": 0.1,
"lte": 0.6
}
}
}
}
For already existing fields, you'll have to add the copy_to by yourself on the mapping, and after that run an _update_by_query to populate them.

Elastic Search sorting very slow on large datasets

The sorting of data in ES was very fast when I had less data, but when the data increased into GBs then the sorting of the fields is very very slow, normal fields < 1 sec, but for the fields with the below mapping the sorting time is > 10 seconds and sometimes more.
I am unable to figure out why is that? can anyone help me with this?
Mapping:
"newFields": {
"type": "nested",
"properties": {
"group": { "type": "keyword" },
"fieldType": { "type": "keyword" },
"name": { "type": "keyword" },
"stringValue": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "sort_normalizer"
}
}
},
"longValue": {
"type": "long"
},
"doubleValue": {
"type": "float"
},
"booleanValue": {
"type": "boolean"
}
}
}
Query:
{
"index": "transactions-read",
"body": {
"query": {
"bool": { "filter": { "bool": { "must": [{ "match_all": {} }] } } }
},
"sort": [
{
"newFields.intValue": {
"order": "desc",
"nested": {
"path": "newFields",
"filter": { "match": { "newFields.name": "johndoe" } }
}
}
}
]
},
"from": 0,
"size": 50
}
So is there any way to make it faster? Or am I missing something here?
Nested datatype is known for bad performance and on top of it you are using sort which is again a costly operation Please refer this great medium blog of Gojek engineering team on their perf issues with nested docs.
They suggested some optimization which includes changing the schema as well but they have not covered the infra level optimization like tunning the JVM heap size and having the favourable shards and replicas which are backbones of elasticsearch and its worth checking and tunning these infra params as well.
Nested sort will be slower compared to non-nested sort. As the number of nested documents in your index increases - unfortunately, sort will slow down.

Nested query in ElasticSearch - two levels

I have the next mapping :
"c_index": {
"aliases": {},
"mappings": {
"an": {
"properties": {
"id": {
"type": "string"
},
"sm": {
"type": "nested",
"properties": {
"cr": {
"type": "nested",
"properties": {
"c": {
"type": "string"
},
"e": {
"type": "long"
},
"id": {
"type": "string"
},
"s": {
"type": "long"
}
}
},
"id": {
"type": "string"
}
}
}
}
}
}
And I need a query than gives me all the cr's when:
an.id == x and sm.id == y
I tried with :
{"query":{"bool":{"should":[{"terms": {"_id": ["x"]}},
{"nested":{"path": "sm","query":{
"match": {"sm.id":"y"}}}}]}}}
But runs very slow and gives more info than i need.
What's the most efficient way to do that ? Thank you!
You don't need nested query here. Also, use filter instead of should if you want to find documents matching all the queries (the exception would be if you wanted the query to affect the score, like match query, which is not the case here, then you could use should + minimum_should_match option)
{
"query": {
"bool": {
"filter": [
{ "term": { "_id": "x" } },
{ "term": { "sm.id": "y" } }
]
}
}
}

Elasticsearch aggregation performance takes a hit on relatively small dataset

We have a cluster of 3 Linux VMs (each machine has 2 cores, 8GB of RAM per core) where we have deployed an Elasticsearch 2.1.1 cluster, with default configuration. Store size is ~50GB for ~3M documents -so arguably fairly modest. We index documents ranging in size from tweets to blog posts. For each document, we extract "entities" (eg, if string "Barack Obama" appears in a document, we locate its character position and classify it into an entity type, in this case the type "person", or "statesman") from the text before indexing the document alongside its array of extracted entities.
Our mapping is as follows:
{
"mappings": {
"_default_": {
"_all": { "enabled": "false" },
"dynamic": false
},
"document": {
"properties": {
"body": { "type": "string", "index": "analyzed", "analyzer": "english" },
"timestamp": { "type": "date", "index":"not_analyzed" },
"author": {
"properties": {
"name": { "type": "string", "index": "not_analyzed" }
}
},
"entities": {
"type": "nested",
"include_in_parent": true,
"properties": {
"text": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "analyzed", "analyzer": "path" },
"start": { "type": "integer", "index":"not_analyzed", "doc_values": false },
"stop": { "type": "integer", "index":"not_analyzed", "doc_values": false }
}
}
}
}
}
}
Path analyzer is used on the entity type field (entity types are based on some hierarchical taxonomy, so the type is represented as a path-like string). The only other analyzed field is the body of the document. For some reason that I could expand on if necessary, we have to index the entities as nested types, though we are still including them in the parent document.
There are on average ~10 entities extracted per document, so ~30M entities in total. The cardinality for the entities field is thus fairly high (~2M unique values).
Our problem is that some of the aggregations we are doing are very slow (>30s). In particular, the following two aggregations:
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
terms: { field: 'entities.text', size: 50 }
}
}
}
And the same one, just replacing 'terms' aggregation with 'significant_terms':
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
significant_terms: { field: 'entities.text', size: 50 }
}
}
}
My questions:
Why are these aggregations prohibitively slow?
Is there something stupid/inefficient in the mapping strategy?
Does indexing the entities as a nested document while still keeping them in the parent document have an impact?
Is it simply that the cardinality of the entities field is just too big and Elasticsearch is not magic?

Resources