No match on document if the search string is longer than the search field - elasticsearch

I have a title I am looking for
The title is, and is stored in a document as
"Police diaries : stefan zweig"
When I search "Police"
I get the result.
But when I search Policeman
I do not get the result.
Here is the query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
omitted because irrelevance...
],
"query": "Policeman",
"fuzziness": "1.5",
"prefix_length": "2"
}
}
],
"must": {
omitted because irrelevance...
}
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
and here is the mapping
{
"books": {
"mappings": {
"book": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"title": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"sort": {
"type": "text",
"analyzer": "to order in another language, (creates a string with symbols)",
"fielddata": true
}
}
}
}
}
}
}
}
It should be noted that I have documents with a title "some title"
which get hits if I search for "someone title".
I cant figure out why the police book is not showing up.

So you have 2 parts of your question.
You want to search the title containing police when searching for policeman.
want to know why some title documents match the someone title document and according to that you expect the first one to match as well.
Let me first explain you why second query matches and the why the first one doesn't and then would tell you, how to make the first one to work.
Your document containing some title creates below tokens and you can verify this with analyzer API.
POST /_analyze
{
"text": "some title",
"analyzer" : "standard" --> default analyzer for text field
}
Generated tokens
{
"tokens": [
{
"token": "some",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "title",
"start_offset": 5,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Now when you search for someone title using the match query which is analyzed and uses the same analyzer which is used on index time on field.
So it creates 2 tokens someone and title and match query matches the title tokens, which is the reason it comes in your search result, you can also use Explain API to verify and see the internals how it matches in detail.
How to bring police title when searching for policeman
You need to make use of synonyms token filter as shown in the below example.
Index Def
{
"settings": {
"analysis": {
"analyzer": {
"synonyms": {
"filter": [
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms" : ["policeman => police"] --> note this
}
}
}
},
"mappings": {
"properties": {
"": {
"type": "text",
"analyzer": "synonyms"
}
}
}
}
Index sample doc
{
"dialog" : "police"
}
Search query having term policeman
{
"query": {
"match" : {
"dialog" : {
"query" : "policeman"
}
}
}
}
And search result
"hits": [
{
"_index": "so_syn",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"dialog": "police" --> note source has `police` only.
}
}
]

Related

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

how to exclude search words in synonyms filter in elasticsearch

While I'm adding table and tables as synonym filter in elastic search, I need to filter out the results for table fan. How to achieve this in elastic search
Could we build a taxonomy of inclusion and exclusion lists filters in settings rather than at run time queries in elastic search
GET <indexName>/_search
{
"query": {
"bool": {
"must_not": [
{
"match": {
"<fieldName>": {
"query": "table fan", // <======= Below operator will applied b/w table(&synonyms) And fan(&synonyms)
"operator": "AND"
}
}
}
]
}
}
}
You can use above query to exclude all the documents having both 'table', 'fan' and their corresponding synonyms.
OR:
If you want to play with multiple logical operators. e.g Given me all the documents which doesn't contain either "table fan" Or "ac" you can use simple_query_string
GET <indexName>/_search
{
"query": {
"bool": {
"must_not": [
{
"simple_query_string": {
"query": "(table + fan) | ac", // <=== '+'='and', '|'='or', '-'='not'
"fields": [
"<fieldName>" // <==== use multiple field names, wildcard also supported
]
}
}
]
}
}
}
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"table, tables"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_filter"
]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer",
"search_analyzer": "standard"
}
}
}
}
Analyze API
POST/_analyze
{
"analyzer" : "synonym_analyzer",
"text" : "table fan"
}
The following tokens are generated:
{
"tokens": [
{
"token": "table",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "tables",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "fan",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Index Data:
{ "title": "table and fan" }
{ "title": "tables and fan" }
{ "title": "table fan" }
{ "title": "tables fan" }
{ "title": "table chair" }
Search Query:
{
"query": {
"bool": {
"must": {
"match": {
"title": "table"
}
},
"filter": {
"bool": {
"must_not": [
{
"match_phrase": {
"title": "table fan"
}
},
{
"match_phrase": {
"title": "table and fan"
}
}
]
}
}
}
}
}
You can also use match query in place of match_phrase query
{
"query": {
"bool": {
"must": {
"match": {
"title": "table"
}
},
"filter": {
"bool": {
"must_not": [
{
"match": {
"title": {
"query": "table fan",
"operator": "AND"
}
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "synonym",
"_type": "_doc",
"_id": "2",
"_score": 0.06783115,
"_source": {
"title": "table chair"
}
}
]
Update 1:
Could we build a taxonomy of inclusion and exclusion lists filters in
settings rather than at run time queries in elastic search
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.Refer this ES documentation on mapping to understand what mapping is used to define.
Please refer to this documentation on Dynamic template that allow you to define custom mappings that can be applied to dynamically added fields

ElasticSearch Keyword usage with a prefix search

I have a requirement to be able to search a sentence as complete or with prefix. The UI library (reactive search) I am using is generating the query in this way:
"simple_query_string": {
"query": "\"Louis George Maurice Adolphe\"",
"fields": [
"field1",
"field2",
"field3"
],
"default_operator": "or"
}
I am expecting it to returns results for eg.
Louis George Maurice Adolphe (Roche)
but NOT just records containing partial terms like Louis or George
Currently, I have code like this but it only brings the record if I search with complete word Louis George Maurice Adolphe (Roche) but not a prefix Louis George Maurice Adolphe.
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
}
Any guidance on the above is appreciated. Thanks.
You are not using the prefix query hence not getting result for prefix search terms, I used same mapping and sample doc, but changed the search query which gives the expected results
Index mapping
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
Indexed sample doc
{
"field3" : "Louis George Maurice Adolphe (Roche)"
}
Search query
{
"query": {
"prefix": {
"field3": {
"value": "Louis George Maurice Adolphe"
}
}
}
}
Search result
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"field3": "Louis George Maurice Adolphe (Roche)"
}
}
]
The underlying issue stems from the fact that you're applying a whitespace remover. What this practically means is that when you ingest your docs:
GET your_index_name/_analyze
{
"text": "Louis George Maurice Adolphe (Roche)",
"field": "field3"
}
they're indexed as
{
"tokens" : [
{
"token" : "louisgeorgemauriceadolphe(roche)",
"start_offset" : 0,
"end_offset" : 36,
"type" : "word",
"position" : 0
}
]
}
So if you indend to use simple_string, you may want to rethink your normalizers.
#Ninja's answer fails when you search for George Maurice Adolphe, i.e. no prefix intersection.

Return only exact matches (substrings) in full text search (elasticsearch)

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed:
{title: "Joe Dirt"}
{title: "Meet Joe Black"}
{title: "Tomorrow Never Dies"}
and the search query is "I want to watch the movie Joe Dirt tomorrow"
I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to return "Joe Dirt" because the title is an exact match substring of the search query.
Is that possible in elasticsearch?
Thanks!
One way to achieve this is as follows :
1) while indexing index title using keyword tokenizer
2) While searching use shingle token-filter to extract substring from the query string and match against the title
Example:
Index Settings
put test
{
"settings": {
"analysis": {
"analyzer": {
"substring": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
},
"exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"substring": {
"type":"shingle",
"output_unigrams" : true
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "exact"
}
}
}
}
}
}
}
Index Documents
put test/movie/1
{"title": "Joe Dirt"}
put test/movie/2
{"title": "Meet Joe Black"}
put test/movie/3
{"title": "Tomorrow Never Dies"}
Query
post test/_search
{
"query": {
"match": {
"title.raw" : {
"analyzer": "substring",
"query": "Joe Dirt tomorrow"
}
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 0.015511602,
"hits": [
{
"_index": "test",
"_type": "movie",
"_id": "1",
"_score": 0.015511602,
"_source": {
"title": "Joe Dirt"
}
}
]
}

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources