How to find word 'food2u' by search 'food' in Elasticsearch? - elasticsearch

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}

You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

Related

How do I search documents with their synonyms in Elasticsearch?

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.
Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

Search in Elasticsearch for a string containing the "not" keyword

I am using ElasticSearch on AWS (7.9 version) and I am trying to distinguish between two strings.
My main target is to split the search results on "Found" and on "Not found".
The generic question is how to search for "not" keyword.
Two example messages you can see below.
"CachingServiceOne:Found in cache - Retrieve."
"CachingServiceThree:Not found in cache - Create new."
You can use ngram tokenizer, to search for "not" on "title" field.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title":"CachingServiceThree:Not found in cache - Create new."
}
{
"title":"CachingServiceOne:Found in cache - Retrieve."
}
Search Query:
{
"query":{
"match":{
"title":"Not"
}
}
}
Search Result:
"hits": [
{
"_index": "67093372",
"_type": "_doc",
"_id": "2",
"_score": 0.6720003,
"_source": {
"title": "CachingServiceThree:Not found in cache - Create new."
}
}
]
Well, the problem seems to be indeed the way the default analyzer works, and not the fact that I could not search for the not word. That is why I accepted the answer. But I would like to add another take. For the sake of simplicity.
Default analyzer does not split words on :.
That means, we have to search for title:CachingServiceThree\:Not.
Where title is the field name and : must be escaped \:.
What did the trick was title:*\:Not and title:*\:Found using the KQL syntax.
Using the wildcard did the trick to fetch everything. I am wondering whether using an array of all the actual values will be quicker.
That translated through the Inspect panel into:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "*\\:Not"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}

what types are best for elasticsearch "KEYWORDS"(like hashtags) field?

i want to make Elasticsearch index for something KEYWORDS, like.. hashtag.
and make synonym filter for KEYWORDs.
i think two ways indexing keyword, first is make keyword type.
{
"settings": {
"keywordField": {
"type": "keyword"
}
}
}
if make a index with League of Legends
maybe this.
{
"keywordField": ["leagueoflegends", "league", "legends", "lol" /* synonym */]
}
or text type:
{
"settings": {
"keywordField": {
"type": "text",
"analyzer": "lowercase_and_whitespace_and_synonym_analyzer"
}
}
}
maybe this.
{
"keywordField": ["league of legends"](synonym: lol => leagueoflegends)
}
if use _analyzer api for this field, expects "leagueoflegends", "league", "legends"
search query: 'lol', 'league of legends', 'League of Legends' have to match this field.
which practice is best?
Adding a working example with index data, mapping, search query, and search result. In the below example, I have taken two synonyms lol and leagueoflegends
Index Mapping:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"leagueoflegends, lol"
]
}
},
"analyzer": {
"synonym_analyzer": {
"filter": [
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"properties": {
"keywordField": {
"type": "text"
}
}
}
}
Index Data:
{
"keywordField": ["leagueoflegends", "league", "legends"]
}
Search Query:
{
"query": {
"match": {
"keywordField": {
"query": "lol",
"analyzer": "synonym_analyzer"
}
}
}
}
Search Result:
"hits": [
{
"_index": "66872989",
"_type": "_doc",
"_id": "1",
"_score": 0.19363807,
"_source": {
"keywordField": [
"leagueoflegends",
"league",
"legends"
]
}
}
]

Could I combine wildcard and fulltext search in Elasticsearch?

For example, I have some titles data in Elasticsearch likes this,
gamexxx_nightmare,
gamexxx_little_guy
Then I input
game => search out gamexxx_nightmare and gamexxx_little_guy
little guy => search out gamexxx_little_guy ?
first I think I will use a wildcard to make game match gamexxx, the second it is fulltext search?
How to combine them in one DSL??
While Jaspreet's answer is right but doesn't combine both the requirements in one query DSL as asked by OP in his question How to combine them in one DSL??.
It's an enhancement to Jaspreet's solution as I am also not using the wild-card and even avoiding the n-gram analyzer which is too costly(increases the index size) and requires re-indexing if requirement changes.
One Search query to combine both the requirement can be done as below:
Index mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_underscore" -->note this
]
}
},
"char_filter": {
"replace_underscore": {
"type": "mapping",
"mappings": [
"_ => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "my_analyzer"
}
}
}
}
Index your sample docs
{
"title" : "gamexxx_little_guy"
}
And
{
"title" : "gamexxx_nightmare"
}
Single Search query
{
"query": {
"bool": {
"must": [ --> note this
{
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "game"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"title": {
"query": "little guy"
}
}
}
]
}
}
]
}
}
}
Result
{
"_index": "so-46873023",
"_type": "_doc",
"_id": "2",
"_score": 2.2814486,
"_source": {
"title": "gamexxx_little_guy"
}
}
Important points:
The first part of the query is prefix query, which would match the game in both the documents. (This would avoid costly regex).
The second part is allowing the full-text search, to enable this, I used custom analyzer which replaces the _ with whitespace, so you don't need expensive (n-grams in index) and simple match query would fetch the results.
Above query, returns result matching both the criteria, you can change the high level, bool clause to should from must if, you want to return matching any criteria.
NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens. Ex Quick Foxes will stored as [ Qui, uic, ick, Fox, oxe, xes ] depending on min_gram and max_gram size.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": "little guy"
}
}
}
If you want to go with wildcard only then you can search on not_analyzed string. This will handle spaces between words
"wildcard": {
"text.keyword": {
"value": "*gamexxx*"
}
}

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources