Match all partial words in one field - elasticsearch

I have this query:
{
"query": {
"match": {
"tag": {
"query": "john smith",
"operator": "and"
}
}
}
}
With the and operator I solved to return documents, where words "john" and "smith" must be present in the tag field in any position and any order. But I need to return documents where all partial words must be present in the tag field, like "joh" and "smit". I try this:
{
"query": {
"match": {
"tag": {
"query": "*joh* *smit*",
"operator": "and"
}
}
}
}
but nothing returns. How can I solve this?

You can use the edge_ngram tokenizer and boolean query with multiple must clause(using your example 2) to get the desired output.
Working example:
Index Def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram", --> note this
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index two sample doc, one which should match and one which shouldn't
{
"title" : "john bravo" --> show;dn't match
}
{
"title" : "john smith" --> should match
}
Boolean Search query with must clause
{
"query": {
"bool": {
"must": [ --> this means both `jon` and `smit` match clause must match, replacement of your `and` operator.
{
"match": {
"title": "joh"
}
},
{
"match": {
"title": "smit"
}
}
]
}
}
}
Search result
"hits": [
{
"_index": "so_partial",
"_type": "_doc",
"_id": "1",
"_score": 1.2840209,
"_source": {
"title": "john smith"
}
}
]

Related

Elasticsearch searching for exact tags

Let's say I have the following documents
doc1: "blue water"
doc2: "extra blue water"
doc3: "blue waters"
I'm looking for a way to handle the following scenarios
If a user searches for "blue water" I want him to receive doc1 and doc3 (meaning that it will ignore doc2 and will also have an analyzers that will be able to stem tokens like in doc3).
If I'm using query_string, for example, I'm receiving doc2 as well as doc1 and doc3.
You can use stemmer along with the percolate query
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"tags": {
"type": "text",
"analyzer": "my_analyzer"
},
"query": {
"type": "percolator"
}
}
}
}
Index Data:
{
"query": {
"match_phrase": {
"tags": {
"query": "blue waters",
"analyzer": "my_analyzer"
}
}
}
}
{
"query": {
"match_phrase": {
"tags": {
"query": "extra blue water",
"analyzer": "my_analyzer"
}
}
}
}
{
"query": {
"match_phrase": {
"tags": {
"query": "blue water",
"analyzer": "my_analyzer"
}
}
}
}
Search Query:
{
"query": {
"percolate": {
"field": "query",
"document": {
"tags": "blue water"
}
}
}
}
Search Result:
"hits": [
{
"_index": "67671916",
"_type": "_doc",
"_id": "3",
"_score": 0.26152915,
"_source": {
"query": {
"match_phrase": {
"tags": {
"query": "blue waters",
"analyzer": "my_analyzer"
}
}
}
},
"fields": {
"_percolator_document_slot": [
0
]
}
},
{
"_index": "67671916",
"_type": "_doc",
"_id": "1",
"_score": 0.26152915,
"_source": {
"query": {
"match_phrase": {
"tags": {
"query": "blue water",
"analyzer": "my_analyzer"
}
}
}
},
"fields": {
"_percolator_document_slot": [
0
]
}
}
]
You could use prefix search in this case. If you look for blue water, then as per prefix search it will give doc1 and doc3.
For prefix search :
{
"query": {
"prefix":{
"doc": word
}
}
}
Here word = blue water
You can have a look at this link.

Elasticsearch match an exact word in a field of type array

I have consulted few articles and stackoverflow questions but I am afraid I wasn't able to find anything like my scenario, forgive me if its duplicate.
Problem:
I want to match a word "Tennis" to a filed which has an array of sports ["Football", "Tennis", "Table Tennis", "Basketball"]. Now the word should be an exact match.
Mapping:
"properties": {
"clubname": {
"type": "text"
},
"sports": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
DOC:
// DOC1
{
"clubname": "Arena 51",
"sports: ["Cricket","Football", "Tennis"]
}
// DOC2
{
"clubname": "Play You",
"sports: ["Cricket","Football", "Table Tennis"]
}
Query:
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": [
{
"match": {
"sports": "tennis"
}
}
]
}
}
With this query I get both documents, which I believe is correct behaviour. So how do I make elastic search to return only Doc1 when I search just Tennis
The answer given by #Opster works fine if you search for Tennis, but if you want to have case insensitive search, you need to create custom normalization for index in the following way:
Index Mapping:
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"sports": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
Search query:
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": [
{
"match": {
"sports": "tennis"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "fd_cb2",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"clubname": "Arena 51",
"sports": [
"Cricket",
"Football",
"Tennis"
]
}
}
]
For exact match you can use keyword sub-field:
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": [
{
"term": {
"sports.keyword": "Tennis"
}
}
]
}
}
}
Note that this will be case sensitive.

Elastic phrase prefix working phrase isnt

I am trying to return all documents that contain a string in the userName & documentName.
Data:
{
"userName" : "johnwick",
"documentName": "john",
"office":{
"name":"my_office"
}
},
{
"userName" : "johnsnow",
"documentName": "snowy",
"office": {
"name":"Abraham deVilliers"
}
},
{
"userName" : "johnnybravo",
"documentName": "bravo",
"office": {
"name":"blabla"
}
},
{
"userName" : "moana",
"documentName": "disney",
"office": {
"name":"deVilliers"
}
},
{
"userName" : "stark",
"documentName": "marvel",
"office": {
"name":"blabla"
}
}
I can perform an exact string match with:
}
_source": [ "userName", "documentName"],
"query": {
"multi_match": {
"query": "johnsnow",
"fields": [ "userName", "documentName"]
}
}
}
This successfully returns:
{
"userName" : "johnsnow",
"documentName": "snowy",
"office": {
"name":"Abraham deVilliers"
}
}
If i use type: phrase_fix with john i also get returned successfully 3 results.
But then i try with:
{
"query": {
"multi_match": {
"query": "ohn", // <---- match all docs that contain 'ohn'
"type": "phrase_prefix"
"fields": [ "userName", "documentName"]
}
}
}
Zero results are returned.
What you are looking for is the infix search and you need to have ngram tokenizer with a search time analyzer to achieve that.
Complete example with your sample data
Index mapping and setting
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "Ingram", --> note this
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10 --> this you can reduce based on your requirement.
},
"mappings": {
"properties": {
"userName": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"documentName": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Sample your docs and then use the same search query, I indexed only first and last doc for brevity and it returned me first doc
"hits": [
{
"_index": "infix",
"_type": "_doc",
"_id": "1",
"_score": 5.7100673,
"_source": {
"userName": "johnwick",
"documentName": "john"
}
}
]

Highlight words with whitespace in Elasticsearch 7.6

I would like to use Elasticsearch highlight to obtain matched keywords found inside a text.
This is my settings/mappings
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _",
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": True
}
}
}
}
I am using a char_filter to search and highligth hypenated words.
This my document example:
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "DbBIxXEBL7VGAl98vIRl",
"_score": 1.0,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
}
}
and this is the query I use
{
"query": {
"query_string" : {
"query" : "\"New York\" OR \"Rome\" OR \"Milton-Freewater\"",
"default_field": "description"
}
},
"highlight" : {
"pre_tags" : ["<key>"],
"post_tags" : ["</key>"],
"fields" : {
"description" : {
"number_of_fragments" : 0
}
}
}
}
and this is the output I have
...
"hits": [
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "GrDNz3EBL7VGAl98EITg",
"_score": 0.72928625,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
},
"highlight": {
"description": [
"This is an example text containing some cities like <key>New</key> <key>York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}
]
...
Rome and Milton-Freewater are highlighted correctly. New York is not
How can I have <key>New York</key> instead of <key>New</key> and <key>York</key>?
There is an open PR regarding this but I'd suggest the following interim solution:
Add a term_vector setting
PUT test_tokenizer
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"term_vector": "with_positions_offsets",
"fielddata": true
}
}
}
}
Sync a doc
POST test_tokenizer/_doc
{"title":"Best places: New Mexico and Sedro-Woolley","description":"This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"}
Convert your query_string to a bunch of bool-should match_phrases inside the highlight_query and use type: fvh
GET test_tokenizer/_search
{
"query": {
"query_string": {
"query": "'New York' OR 'Rome' OR 'Milton-Freewater'",
"default_field": "description"
}
},
"highlight": {
"pre_tags": [
"<key>"
],
"post_tags": [
"</key>"
],
"fields": {
"description": {
"highlight_query": {
"bool": {
"should": [
{
"match_phrase": {
"description": "New York"
}
},
{
"match_phrase": {
"description": "Rome"
}
},
{
"match_phrase": {
"description": "Milton-Freewater"
}
}
]
}
},
"type": "fvh",
"number_of_fragments": 0
}
}
}
}
yielding
{
"highlight":{
"description":[
"This is an example text containing some cities like <key>New York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}

Implementing search using Elasticsearch

I am currently implementing elasticsearch in my application. Please assume that "Hello World" is the data which we need to search. Our requirement is that we should get the result by entering "h" or "Hello World" or "Hello Worlds" as the keyword.
This is our current query.
{
"query": {
"wildcard" : {
"message" : {
"title" : "h*"
}
}
}
}
By using this we are getting the right result using the keyword "h". But we need to get the results in case of small spelling mistakes also.
You need to use english analyzer which stemmed tokens to its root form. More info can be found here
I implemented it by taking your example data, query and expected results using the edge n-gram analyzer and match query.
Index Mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "english"
}
}
}
}
Index document
{
"title" : "Hello World"
}
Search query for h and its result
{
"query": {
"match": {
"title": "h"
}
}
}
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.42763555,
"_source": {
"title": "Hello World"
}
}
]
Search query for Hello Worlds and same document comes in result
{
"query": {
"match": {
"title": "Hello worlds"
}
}
}
Result
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.8552711,
"_source": {
"title": "Hello World"
}
}
]
EdgeNGrams or NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens.
Ex Quick Foxes will stored as [ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ] depending on min_gram and max_gram size.
Fuzziness can be used to find similar terms
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": {
"query": "hello worlds",
"fuzziness": 1
}
}
}
}

Resources