ElasticSearch 'range' query returns inappropriate results - elasticsearch

Lets take this query:
{
"timeout": 10000,
"from": 0,
"size": 21,
"sort": [
{
"view_avg": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"range": {
"price": {
"from": 10,
"to": 20
}
}
},
{
"terms": {
"category_ids": [
16405
]
}
}
]
}
}
}
This query on data set that I am running on, should return no results (as all prices are in 100s-1000s range). However, this query returns results, matching prices as:
"price": "1399.00"
"price": "1299.00"
"price": "1089.00"
And so on, and so forth.. Any ideas how I could modify the query, so it returns the correct results?

I'm 99% sure your mapping is wrong and price is declared as string. Elasticsearch is using different Lucene range queries based on the field type as you can see in their documentation. The TermRangeQuery for string type acts like your output, it uses lexicographical ordering (ie. 1100 is between 10 and 20).
To test it you can try the following mapping/search:
PUT tests/
PUT tests/test/_mapping
{
"test": {
"_source" : {"enabled" : false},
"_all" : {"enabled" : false},
"properties" : {
"num" : {
"type" : "float", // <-- HERE IT'S A FLOAT
"store" : "no",
"index" : "not_analyzed"
}
}
}
}
PUT tests/test/1
{
"test" : {
"num" : 100
}
}
POST tests/test/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"num": {
"from": 10,
"to": 20
}
}
}
]
}
}
}
Result:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
If you delete the index and try to recreate it changing the num type to a string:
PUT tests/test/_mapping
{
"test": {
"_source" : {"enabled" : false},
"_all" : {"enabled" : false},
"properties" : {
"num" : {
"type" : "string", // <-- HERE IT'S A STRING
"store" : "no",
"index" : "not_analyzed"
}
}
}
}
You'll see a different result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tests",
"_type": "test",
"_id": "1",
"_score": 1
}
]
}
}

price needs to be a numeric field for that must to work. If it's string it will return. Make sure the mapping is correct, if it would have been float it would have worked.
You can check the mapping of the index with GET /index_name/_mapping.
If you would have had the following (and the price is string):
"range": {
"price": {
"from": 30,
"to": 40
}
}
that shouldn't return the docs because 1 (string) is before 3 or 4 (strings), even if numerically speaking 30 is smaller than 1399.

Related

Change field type in index without reindex

First, I had this index template
GET localhost:9200/_index_template/document
And this is output
{
"index_templates": [
{
"name": "document",
"index_template": {
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"format": "epoch_millis",
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"composed_of": [],
"priority": 501,
"version": 1
}
}
]
}
And my data is indexed, for example
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_source": {
"firstOperationAtUtc": 1556868952000,
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
Next, I need to update mapping for field firstOperationAtUtc and remove format epoch_millis
localhost:9200/_template/document
{
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"version": 1
}
After that, If I get previous request I still have indexed data.
But now I need to update field firstOperationAtUtc and set data from firstOperationAtUtcDate
localhost:9200/v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtc = ctx._source.firstOperationAtUtcDate }",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}
After that, if I get previous request
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
I have no indexed data
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
But if I find with id, I will get this data with modify data but my field is ignored
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"terms": {
"_id": [ "9b46d6fe78735274342d1bc539b084510000000455" ]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_ignored": [
"firstOperationAtUtc"
],
"_source": {
"firstOperationAtUtc": "2019-05-03T13:35:52.000Z",
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
How I could indexed data without reindex? Because I have milliard data in index and this could may produce huge downtime in prod
What you changed is the index template, but not your index mapping. The index template is used only when a new index that matches the name pattern is created.
What you want to do is to modify the actual mapping of your index, like this:
PUT test/_mapping
{
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
}
}
}
However, this won't be possible and you will get the following error, which makes sense as you cannot modify an existing field mapping.
Mapper for [firstOperationAtUtc] conflicts with existing mapper:
Cannot update parameter [format] from [epoch_millis] to [strict_date_optional_time||epoch_millis]
The only reason why your update by query seemed to work is because you have "ignore_malformed": true in your mapping. Because if you remove that parameter and try to run your update by query again, you'd see the following error:
"type" : "mapper_parsing_exception",
"reason" : "failed to parse field [firstOperationAtUtc] of type [date] in document with id '2'. Preview of field's value: '2019-05-03T13:35:52.000Z'",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "failed to parse date field [2019-05-03T13:35:52.000Z] with format [epoch_millis]",
"caused_by" : {
"type" : "date_time_parse_exception",
"reason" : "date_time_parse_exception: Failed to parse with all enclosed parsers"
}
}
So, to wrap it up, you have two options:
Create a new index with the right mapping and reindex your old index into it, but that doesn't seem like an option for you.
Create a new field in your existing index mapping (e.g. firstOperationAtUtcTime) and discard the use of firstOperationAtUtc
The steps would be:
Modify the index template to add the new field
Modify the actual index mapping to add the new field
Run your update by query by modifying the script to write your new field
In short:
# 1. Modify your index template
# 2. modify your actual index mapping
PUT v2-documents-2021-11-20/_mapping
{
"properties": {
"firstOperationAtUtcTime": {
"ignore_malformed": true,
"type": "date"
}
}
}
# 3. Run update by query again
POST v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtcTime = ctx._source.firstOperationAtUtcDate; ctx._source.remove('firstOperationAtUtc')}",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}

Get elasticsearch to ignore diacritics and accents in search hit

I want to search data on elasticsearch with different languages, and expect the data will be retrieved no matter if there is a diacritics or accent.
``
for example I have this data:
``
POST ابجد/_doc/31
{
"name":"def",
"city":"Tulkarem"
}
``
POST ابجٌد/_doc/31 { "name":"def", "city":"Tulkarem" }
PUT /abce
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
}
The difference between the two indexes is the diacritics.
Trying to get data:
GET ابجد/_search
I need it to retrieve both index, currently it is revering this:
`{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "ابجد",
"_id": "31",
"_score": 1,
"_source": {
"name": "def",
"city": "Tulkarem"
}
}
]
}
}

Pipeline aggregation with Date histogram doesn’t return expected result

I'm facing an issue regarding to use Pipeline aggregation with Date histogram.
I need to filter data from: "2019-03-08T06:00:00Z" to "2019-03-09T10:00:00Z" and do histogram aggregation on that. Then calculate avg value after aggregating by cardinality agg.
{
"size": 0,
"query": {
"bool" : {
"filter": {
"range" : {
"recordTime" : {
"gte" : "2019-03-08T06:00:00Z",
"lte" : "2019-03-09T10:00:00Z"
}
}
}
}
},
"aggs" : {
"events_per_bucket" : {
"date_histogram" : {
"field" : "eventTime",
"interval" : "1h"
},
"aggs": {
"cards_per_bucket": {
"cardinality": {
"field": "KANBAN_PKKEY.keyword"
}
}
}
},
"avg_cards_per_bucket": {
"avg_bucket": {
"buckets_path": "events_per_bucket>cards_per_bucket.value"
}
}
}
}
Result:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"events_per_bucket": {
"buckets": [
{
"key_as_string": "2019-03-08T06:00:00.000Z",
"key": 1552024800000,
"doc_count": 1,
"cards_per_bucket": {
**"value": 1**
}
},
{
"key_as_string": "2019-03-08T07:00:00.000Z",
"key": 1552028400000,
"doc_count": 0,
"cards_per_bucket": {
**"value": 0**
}
},
{
"key_as_string": "2019-03-08T08:00:00.000Z",
"key": 1552032000000,
"doc_count": 1,
"cards_per_bucket": {
**"value": 1**
}
}
]
},
"avg_cards_per_bucket": {
**"value": 1**
}
}
}
The problem is why avg value is "1"? It should be: 2/3 = 0.6666
Why 0 value cardinality bucket is ignored?
If i remove cardinality agg and do avg on doc_count (events_per_bucket>_count) then it works fine.
The same thing happens for MAX, MIN, SUM as well.
Any help would be appreciated!
Thank you.
you should tell the aggregation pipeline what to do in the case of gaps in your buckets, like your bucket with key 1552028400000. By default, gaps are ignored. You might want instead to replace the missing values with a zero. This can be done by adding the gap_policy parameter to your aggregation pipeline:
...
"avg_cards_per_bucket": {
"avg_bucket": {
"buckets_path": "events_per_bucket>cards_per_bucket.value",
"gap_policy": "insert_zeros"
}
}
...
More details in the Elastic documentation.

Elasticsearch query do not work with # value

When I execute a simple search query on an email it does not return anything to me, unless I remove what follows the "#", why?
I wish to make queries on the e-mails in fuzzy and autocompletion.
ELASTICSEARCH INFOS:
{
"name" : "ZZZ",
"cluster_name" : "YYY",
"cluster_uuid" : "XXX",
"version" : {
"number" : "6.5.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "WWW",
"build_date" : "2018-11-29T23:58:20.891072Z",
"build_snapshot" : false,
"lucene_version" : "7.5.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
MAPPING :
PUT users
{
"mappings":
{
"_doc": { "properties": { "mail": { "type": "text" } } }
}
}
ALL DATAS :
[
{ "mail": "firstname.lastname#company.com" },
{ "mail": "john.doe#company.com" }
]
QUERY WORKS :
Term request works but mail == "firstname.lastname#company.com" and not "firstname.lastname"...
QUERY :
GET users/_search
{ "query": { "term": { "mail": "firstname.lastname" } }}
RETURN :
{
"took": 7,
"timed_out": false,
"_shards": { "total": 6, "successful": 6, "skipped": 0, "failed": 0 },
"hits": {
"total": 1,
"max_score": 4.336203,
"hits": [
{
"_index": "users",
"_type": "_doc",
"_id": "H1dQ4WgBypYasGfnnXXI",
"_score": 4.336203,
"_source": {
"mail": "firstname.lastname#company.com"
}
}
]
}
}
QUERY NOT WORKS :
QUERY :
GET users/_search
{ "query": { "term": { "mail": "firstname.lastname#company.com" } }}
RETURN :
{
"took": 0,
"timed_out": false,
"_shards": { "total": 6, "successful": 6, "skipped": 0, "failed": 0 },
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
SOLUTION :
Change mapping (reindex after mapping changes) with uax_url_email analyzer for mails.
PUT users
{
"settings":
{
"index": { "analysis": { "analyzer": { "mail": { "tokenizer":"uax_url_email" } } } }
}
"mappings":
{
"_doc": { "properties": { "mail": { "type": "text", "analyzer":"mail" } } }
}
}
If you use no other tokenizer for your indexed text field, it will use the standard tokenizer, which tokenizes on the # symbol [I don't have a source on this, but there's proof below].
If you use a term query rather than a match query then that exact term will be searched for in the inverted index elasticsearch match vs term query.
Your inverted index looks like this
GET users/_analyze
{
"text": "firstname.lastname#company.com"
}
{
"tokens": [
{
"token": "firstname.lastname",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "company.com",
"start_offset": 19,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 1
}
]
}
To resolve this you could specify your own analyzer for the mail field or you could use the match query, which will analyze your searched text just like how it analyzes the indexed text.
GET users/_search
{
"query": {
"match": {
"mail": "firstname.lastname#company.com"
}
}
}

Elasticsearch Histogram of visits

I'm quite new to Elasticsearch and I fail to build a histogram based on ranges of visits. I am not even sure that it's possible to create this kind of chart by using a single query in Elasticsearch, but I'm the feeling that could be possible with pipeline aggregation or may be scripted aggregation.
Here is a test dataset with which I'm working:
PUT /test_histo
{ "settings": { "number_of_shards": 1 }}
PUT /test_histo/_mapping/visit
{
"properties": {
"user": {"type": "string" },
"datevisit": {"type": "date"},
"page": {"type": "string"}
}
}
POST test_histo/visit/_bulk
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Jean","page":"productXX.hmtl","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Robert","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"media_center.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"media_center.html","datevisit":"2015-11-26"}
If we consider the ranges [1,2[, [2,3[, [3, inf.[
The expected result should be :
[1,2[ = 2
[2,3[ = 1
[3, inf.[ = 1
All my efforts to find the histogram showing a customer visit frequency remained to date unsuccessful. I would be pleased to have a few tips, tricks or ideas to get a response to my problem.
There are two ways you can do it.
First is doing it in ElasticSearch which will require Scripted Metric Aggregation. You can read more about it here.
Your query would look like this
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"scripted_metric": {
"init_script": "_agg['values'] = new java.util.HashMap();",
"map_script": "if (_agg.values[doc['user'].value]==null) {_agg.values[doc['user'].value]=1} else {_agg.values[doc['user'].value]+=1;}",
"combine_script": "someHashMap = new java.util.HashMap();for(x in _agg.values.keySet()) {value=_agg.values[x];if(value<3){key='[' + value +',' + (value + 1) + '[';}else{key='[' + value +',inf[';}; if(someHashMap[key]==null){someHashMap[key] = 1}else{someHashMap[key] += 1}}; return someHashMap;"
}
}
}
}
}
}
where you can change period of time in date_histogram object in the field interval by values like day, week, month.
Your response would look like this
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"value": [
{
"[2,3[": 1,
"[3,inf[": 1,
"[1,2[": 2
}
]
}
}
]
}
}
}
Second method is to the work of scripted_metric in client side. You can use the result of Terms Aggregation. You can read more about it here.
Your query will look like this
GET test_histo/visit/_search
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"terms": {
"field": "user",
"size": 10
}
}
}
}
}
}
and the response will be
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john",
"doc_count": 3
},
{
"key": "mary",
"doc_count": 2
},
{
"key": "jean",
"doc_count": 1
},
{
"key": "robert",
"doc_count": 1
}
]
}
}
]
}
}
}
where on the response you can do count for each doc_count for each period.
Have a look at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
If you whant to show it in fancy already fixed UI use Kibana.
A query like this:
GET _search
{
"query": {
"match_all": {}
},
{
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Should give you a histogram, I don't have elastic here at the moment so I might have some fat finggered typos.
Then you could ad query terms to only show histogram for specific page our you could have an aouter aggregation bucket wich aggregates / page or user.
Something like this:
GET _search
{
"query": {
"match_all": {}
},
{
{
"aggs" : {
"users" : {
"terms" : {
"field" : "user",
},
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Have a look to this solution:
{
"query": {
"match_all": {}
},
"aggs": {
"periods": {
"filters": {
"filters": {
"1-2": {
"range": {
"datevisit": {
"gte": "2015-11-25",
"lt": "2015-11-26"
}
}
},
"2-3": {
"range": {
"datevisit": {
"gte": "2015-11-26",
"lt": "2015-11-27"
}
}
},
"3-": {
"range": {
"datevisit": {
"gte": "2015-11-27",
}
}
}
}
},
"aggs": {
"users": {
"terms": {"field": "user"}
}
}
}
}
}
Step by step:
Filter aggregation: You can define ranged values for the next aggregation, in this case we define 3 periods based on date range filter
Nested Users aggregation: This aggregation returns as many results as filters you'd defined. So, in this case, you'll get 3 values using range date filtering
You'll get a result like this:
{
...
"aggregations" : {
"periods" : {
"buckets" : {
"1-2" : {
"users" : {
"buckets" : [
{"key" : XXX,"doc_count" : NNN},
{"key" : YYY,"doc_count" : NNN},
]
}
},
"2-3" : {
"users" : {
"buckets" : [
{"key" : XXX1,"doc_count" : NNN1},
{"key" : YYY1,"doc_count" : NNN1},
]
}
},
"3-" : {
"users" : {
"buckets" : [
{"key" : XXX2,"doc_count" : NNN2},
{"key" : YYY2,"doc_count" : NNN2},
]
}
},
}
}
}
}
Try it, and tell if it works

Resources