Elasticsearch query with fuzziness AUTO not working as expected - elasticsearch

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}

This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

Related

Understand Elasticsearch Multivalue Fields

I am trying to understand the position_increment_gap as it is explained on the Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/guide/current/_multivalue_fields_2.html
I created the same index as in the example and inserted a single document
PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith", "Justin Trudeau"]
}
Then I try a phrase query for Abraham Lincoln and it matches, as expected
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "names",
"_type": "doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"names": [
"john abraham",
"lincoln smith",
"justin trudeau"
]
}
}
]
}
}
The documentation explains that the match occurs because ES produces the tokens john abraham lincoln smith justin trudeau and it recommends inserting a position_increment_gap of 100 to avoid matching abraham lincoln unless I have a slop of 100.
I changed the index to have a position_increment_gap of 1 as shown below:
PUT names
{
"mappings": {
"doc": {
"properties": {
"names": {
"type":"text",
"position_increment_gap": 1
}
}
}
}
}
If I'm understanding the documentation, using a gap of 1 should allow me to match "abraham smith". But it doesn't match. Nor does "abraham lincoln", "abraham justin", or "abraham trudeau". "lincoln smith", "john abraham" and "justin trudeau" all continue to match.
I must be misunderstanding the documentation.
Thanks for any suggestions.

GET query based on timestamp

I’m looking for help on building a query that will retrieve the last number of documents for a given time frame, for example last 30 minutes.
The documents are syslogs like:
{
"_index": "logstash-2017.01.16",
"_type": "syslog",
"_id": "AVmnIUFGd2leAWt2KJSr",
"_score": 1,
"_source": {
"#timestamp": "2017-01-16T11:54:48.318Z",
"syslog_severity_code": 5,
"syslog_facility": "user-level",
"#version": "1",
"host": "10.0.0.1",
"syslog_facility_code": 1,
"message": "Test Syslog Message",
"type": "syslog",
"syslog_severity": "notice",
tags": [
"_grokparsefailure"
]
}
My idea is to build this query into another script that will check for new items being added to ES.
Use Range Query:
GET index/type/_count
{
"query": {
"range": {
"#timestamp": {
"from": "now-30m",
"to" : "now"
}
}
}
}
This will give output like :
{
"count": 2,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
}
}
where count will carry the number of document matched.
Read more about Range Query here

Fuzziness behavior on a match_phrase query

Days ago I got this "problem". I was running a match_phrase query in my index. Everything was as expected, until I did the same search with a multiple words nouns (before I was using single word nouns, eg: university). I made one misspelling and the search did not work (not found), if I removed a word (let's say the one that was spelled correctly), the search work (found).
Here there are the example I made:
The settings
PUT index1
{
"mappings": {
"myType": {
"properties": {
"field1": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST index1/myType/1
{
"field1": "Commercial Banks"
}
Case 1: Single noun search
GET index1/myType/_search
{
"query": {
"match": {
"field1": {
"type": "phrase",
"query": "comersial",
"fuzziness": "AUTO"
}
}
}
}
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178303,
"hits": [
{
"_index": "index1",
"_type": "myType",
"_id": "1",
"_score": 0.19178303,
"_source": {
"field1": "Commercial Banks"
}
}
]
}
}
Case 2: Multiple noun search
GET index1/myType/_search
{
"query": {
"match": {
"field1": {
"type": "phrase",
"query": "comersial banks",
"fuzziness": "AUTO"
}
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
So, in the second case, why am I not finding the document when performing the match_phrase query? Is there something I am missing?
Those result just make doubt about what I know.
Am I using the fuzzy search incorrectly? I'm not sure if this is a problem, or I'm the one who do not understand the behavior.
Many thanks in advance for reading my question. I hope you can help me with this.
Fuzziness is not supported in phrase queries.
Currently, ES is silent about it, i.e. it allows you to specify the parameter but doesn't warn you that it is not supported. A pull request (#18322) (related to issue #7764) exists that will remedy to this problem. Once merged into ES 5, this query will error out.
In the breaking changes document for 5.0, we can see that this won't be supported:
The multi_match query will fail if fuzziness is used for cross_fields, phrase or phrase_prefix type. This parameter was undocumented and silently ignored before for these types of multi_match.

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

Does an empty field in a document take up space in elasticsearch?

Does an empty field in a document take up space in elasticsearch?
For example, in the case below, is the total amount of space used to store the document the same in Case A as in Case B (assuming the field "colors" is defined in the mapping).
Case A
{"features":
"price": 1,
"colors":[]
}
Case B
{"features":
"price": 1,
}
If you keep the default settings, the original document is stored in the _source field, there will be a difference as the original document of case A is bigger than case B.
Otherwise, there should be no difference : for case A, no term is added in the index for the colors field as it's empty.
You can use the _size field to see the size of the original document indexed, which is the size of the _source field :
POST stack
{
"mappings":{
"features":{
"_size": {"enabled":true, "store":true},
"properties":{
"price":{
"type":"byte"
},
"colors":{
"type":"string"
}
}
}
}
}
PUT stack/features/1
{
"price": 1
}
PUT stack/features/2
{
"price": 1,
"colors": []
}
POST stack/features/_search
{
"fields": [
"_size"
]
}
The last query will output this result, which shows than document 2 takes more space than 1:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "features",
"_id": "1",
"_score": 1,
"fields": {
"_size": 16
}
},
{
"_index": "stack",
"_type": "features",
"_id": "2",
"_score": 1,
"fields": {
"_size": 32
}
}
]
}
}

Resources