Search returning different value in fuzzy Query- Elasticsearch - elasticsearch

I have an Elasticsearch fuzzy query as below:
GET /resume/candidate/_search
{
"query": {
"fuzzy" : { "name" : {
"value": "Tam",
"fuzziness" : 2,
"max_expansions": 50 }
}
}
}
I have the names Tom, Roy, Maxwell in my Index. The name Tom is matched as per the request, but the name Roy also gets returned. How is this happening?
The full names are:
Roy M Lovejoy III
Tom Atwell
Also, if i set the fuzziness to 1, I am not getting any result. Shouldn't Tom be matched as only 1 character is different?
Mapping:
{
"resume": {
"aliases": {},
"mappings": {
"candidate": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
I also have an analyzer, but it is not used in the name field
Analyzer:
"analysis": {
"analyzer": {
"case_insensitive": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}

"Tam" is not a fuzzy match with "roy", it is a match with the middle initial "m", which has an edit distance of 2.
The reason you are not getting a result on "tom" with an edit distance of 1, is because, while your indexed names are being analyzed, and thus lowercased, your query is not. You could lowercase your query, or you could use a fuzzy match query, which would be analyzed:
"query": {
"match": {
"name": {
"query": "Tam",
"fuzziness": "AUTO"
}
}
}

Related

How can I get auto-suggestions for synonyms match in elasticsearch

I'm using the code below and it does not give auto-suggestion as curd when i type "cu"
But it does match the document with yogurt which is correct.
How can I get both auto-complete for synonym words and document match for the same?
PUT products
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_graph"
]
}
},
"filter": {
"synonym_graph": {
"type": "synonym_graph",
"synonyms": [
"yogurt, curd, dahi"
]
}
}
}
}
}
}
PUT products/_mapping
{
"properties": {
"description": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
POST products/_doc
{
"description": "yogurt"
}
GET products/_search
{
"query": {
"match": {
"description": "cu"
}
}
}
When you provide a list of synonyms in a synonym_graph filter it simply means that ES will treat any of the synonyms interchangeably. But when they're analyzed via the standard analyzer, only full-word tokens will be produced:
POST products/_analyze?filter_path=tokens.token
{
"text": "yogurt",
"field": "description"
}
yielding:
{
"tokens" : [
{
"token" : "curd"
},
{
"token" : "dahi"
},
{
"token" : "yogurt"
}
]
}
As such, a regular match_query won't cut it here because the standard analyzer hasn't provided it with enough context in terms of matchable substrings (n-grams).
In the meantime you can replace match with match_phrase_prefix which does exactly what you're after -- match an ordered sequence of characters while taking into account the synonyms:
GET products/_search
{
"query": {
"match_phrase_prefix": {
"description": "cu"
}
}
}
But that, as the query name suggests, is only going to work for prefixes. If you fancy an autocomplete that suggests terms regardless of where the substring matches occur, have a look at my other answer where I talk about leveraging n-grams.

Could I combine wildcard and fulltext search in Elasticsearch?

For example, I have some titles data in Elasticsearch likes this,
gamexxx_nightmare,
gamexxx_little_guy
Then I input
game => search out gamexxx_nightmare and gamexxx_little_guy
little guy => search out gamexxx_little_guy ?
first I think I will use a wildcard to make game match gamexxx, the second it is fulltext search?
How to combine them in one DSL??
While Jaspreet's answer is right but doesn't combine both the requirements in one query DSL as asked by OP in his question How to combine them in one DSL??.
It's an enhancement to Jaspreet's solution as I am also not using the wild-card and even avoiding the n-gram analyzer which is too costly(increases the index size) and requires re-indexing if requirement changes.
One Search query to combine both the requirement can be done as below:
Index mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_underscore" -->note this
]
}
},
"char_filter": {
"replace_underscore": {
"type": "mapping",
"mappings": [
"_ => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "my_analyzer"
}
}
}
}
Index your sample docs
{
"title" : "gamexxx_little_guy"
}
And
{
"title" : "gamexxx_nightmare"
}
Single Search query
{
"query": {
"bool": {
"must": [ --> note this
{
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "game"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"title": {
"query": "little guy"
}
}
}
]
}
}
]
}
}
}
Result
{
"_index": "so-46873023",
"_type": "_doc",
"_id": "2",
"_score": 2.2814486,
"_source": {
"title": "gamexxx_little_guy"
}
}
Important points:
The first part of the query is prefix query, which would match the game in both the documents. (This would avoid costly regex).
The second part is allowing the full-text search, to enable this, I used custom analyzer which replaces the _ with whitespace, so you don't need expensive (n-grams in index) and simple match query would fetch the results.
Above query, returns result matching both the criteria, you can change the high level, bool clause to should from must if, you want to return matching any criteria.
NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens. Ex Quick Foxes will stored as [ Qui, uic, ick, Fox, oxe, xes ] depending on min_gram and max_gram size.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": "little guy"
}
}
}
If you want to go with wildcard only then you can search on not_analyzed string. This will handle spaces between words
"wildcard": {
"text.keyword": {
"value": "*gamexxx*"
}
}

Elastic search query from SQL statement with multiple WHERE clause

I need Elastic Search Query based on following SQL Statement
SELECT * FROM documents
WHERE (doc_name like "%test%" OR doc_type like "%test%" OR doc_desc like "%test%) AND
user_id = 1 AND doc_category = "Utilities"
It depends on your mapping, but you can start working with something like this:
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"user_id": 1
}
},
{
"term": {
"doc_category": "Utilities"
}
}
]
}
},
"query": {
"multi_match": {
"query": "test",
"fields": ["doc_name", "doc_type", "doc_desc"]
}
}
}
}
Adding to the answer given by jbasko: Doing LIKE queries in elsaticsearch very much depends on your mapping of the document fields. For example if you want the equivalent of LIKE '%test%' in elasticsearch you need to use the ngram tokenizer for it :
{
"settings": {
"analysis": {
"analyzer": {
"some_analyzer_name": {
"tokenizer": "some_tokenizer_name"
}
},
"tokenizer": {
"some_tokenizer_name": {
"type": "ngram",
"min_gram": <minimum number of characters>,
"max_gram": <maximum number of characters>,
"token_chars": [
"letter",
"digit"
]
}
}
}
...
and in the mapping of the fields use the analyzer:
"mapping":{
...
"doc_type" : {
"type" :"string",
"analyzer" : "some_analyzer_name"
},
...
"doc_type" : {
"type" :"string",
"analyzer" : "some_analyzer_name"
},
...
}
A short explanation about ngram, this tokenizer breaks the string in the fields doc_type and the others to small consecutive strings with the amount of characters you defined in settings.
i.e. an ngram with
min_gram : 1
max_gram : 3
on the string "abcd".
You'll get a collection of terms: 'a','ab','abc','b','bc','bcd','c','cd','c'. This terms will be used by elasticsearch to find the correct document using a match (or multi-match) query with inverse index.
For further reading you can search for : mapping,ngram tokenizer and termvectors in elasticsearch wiki.

Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:
000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874
The most common format is just an integer proceeded by some number of zeros.
How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:
{
"query": {
"multi_match": {
"query": "65",
"fields": ["serviceOrderNo^2", "_all"],
}
}
}
One way of doing this is using the lucene flavour regular exression query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
"query": {
"regexp":{
"serviceOrderNo": "[0]*65"
}
}
Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this:
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html
"query": {
"query_string": {
"default_field": "serviceOrderNo",
"query": "0*65"
}
}
These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.
If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.
Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".
You can also check out the ES website documentation on Analyzers.
"settings":{
"analysis": {
"filter":{
"trimZero": {
"type":"pattern_capture",
"patterns":"^0*([0-9]*$)"
}
},
"analyzer": {
"serviceOrderNo":{
"type":"custom",
"tokenizer":"standard",
"filter":"trimZero"
}
}
}
},
"mappings":{
"serviceorderdto": {
"properties":{
"serviceOrderNo":{
"type":"String",
"analyzer":"serviceOrderNo"
}
}
}
}
One way to do this is to use an ngram token filter so that "12345" gets tokenized as:
[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]
When tokenized this way, "65" is a match for "000000065".
To set this up, create a new index that has a custom analyzer that uses an ngram filter:
POST /my-index
{
"mappings": {
"serviceorderdto": {
"properties": {
"serviceOrderNo": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Index some data.
Then run your query:
GET /my-index/_search
{
"query": {
"multi_match": {
"query": "55",
"fields": [
"serviceOrderNo^2",
"_all"
]
}
}
}

Why prefix returns documents without the specific prefix?

I want to return only documents which their name start with "pizza". this is what I've done:
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name": "pizza"
}
}
}
}
}
But I've got these 3 documents:
{
"name": "Viana Pizza",
"city": "Mashhad",
"address": "Vakil abad",
"foods": ["Pizza"],
"salad": true,
"rate": 5.0
}
{
"name": "Pizza Pizza",
"city": "Mashhad",
"address": "Bahar st",
"foods": ["Pizza"],
"salad": true,
"rate": 8.5
}
{
"name": "Reza Pizza",
"city": "Tehran",
"address": "Vali Asr",
"foods": ["Pizza"],
"salad": true,
"rate": 7.5
}
As you can see, Only one of them has "pizza" in the beginning of the name field.
What's wrong?
Probably, the simplest explanation given that you didn't provide the actual mapping, is that you have th e "name" field as "string" and "analyzed" (the default). Which means that "Reza Pizza" will be transformed to "reza" and "pizza" terms.
And your filter will match against terms, not against entire fields. Because ES analyzes the fields and forms terms when the standard mapping is used.
You need to either change your "name" field to "not_analyzed" or add another field to mirror the "name" but this mirror field to be "not_analyzed". Also, for text "pizza" (lowercase) to work in this case you need to create a custom analyzer.
Below you have the solution with the mirror field:
PUT /pizza
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
And in searching you need to use the mirror field:
GET /pizza/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name.raw": "pizza"
}
}
}
}
}
That's all about Elasticsearch analyzers. Let's read the documentation on prefix filter:
Filters documents that have fields containing terms with a specified prefix (not analyzed).
Here we can see that this filter matches terms, not the whole field value. When you index the document, ES splits your field values to terms using analyzers. Default analyzer splits value by whitespace and convert parts to lowercse. So all three results have term pizza in the name field and pizza term perfectly matches pizza prefix. If you want to match field value as is - I'd suggest you to map name field as not_analyzed

Resources