How do I search documents with their synonyms in Elasticsearch? - elasticsearch

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.

Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

Search in Elasticsearch for a string containing the "not" keyword

I am using ElasticSearch on AWS (7.9 version) and I am trying to distinguish between two strings.
My main target is to split the search results on "Found" and on "Not found".
The generic question is how to search for "not" keyword.
Two example messages you can see below.
"CachingServiceOne:Found in cache - Retrieve."
"CachingServiceThree:Not found in cache - Create new."
You can use ngram tokenizer, to search for "not" on "title" field.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title":"CachingServiceThree:Not found in cache - Create new."
}
{
"title":"CachingServiceOne:Found in cache - Retrieve."
}
Search Query:
{
"query":{
"match":{
"title":"Not"
}
}
}
Search Result:
"hits": [
{
"_index": "67093372",
"_type": "_doc",
"_id": "2",
"_score": 0.6720003,
"_source": {
"title": "CachingServiceThree:Not found in cache - Create new."
}
}
]
Well, the problem seems to be indeed the way the default analyzer works, and not the fact that I could not search for the not word. That is why I accepted the answer. But I would like to add another take. For the sake of simplicity.
Default analyzer does not split words on :.
That means, we have to search for title:CachingServiceThree\:Not.
Where title is the field name and : must be escaped \:.
What did the trick was title:*\:Not and title:*\:Found using the KQL syntax.
Using the wildcard did the trick to fetch everything. I am wondering whether using an array of all the actual values will be quicker.
That translated through the Inspect panel into:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "*\\:Not"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}

how to switch on the elasticsearch stemming

I don't know how to turn on the Elasticsearch English word stemming. I am sorry I didn't find out a clear example to do that.
Here is what I did
Creating the index
PUT /staff/list/ -d
{
"settings" : {
"analysis": {
"analyzer": {
"standard": {
"type": "standard"
}
}
}
}
}
Adding document
PUT /staff/list/jason
{
"Title" : "searches"
}
when I search for search
GET /staff/list/_search?q=search
The result doesnt appear.
What index setting I should do to make the stemming works.
Many thanks in advance
Please note that the default Elasticsearch analyzer do not support stemming.
In order to support stemming you may need to create a custom analyzer.
Here is how you do it:
Create the index and define an analyzer called my_analyzer
PUT /staff
{
"settings" : {
"analysis": {
"filter": {
"filter_snowball_en": {
"type": "snowball",
"language": "English"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"filter_snowball_en"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Configure mapping that assigns my_analyzer to list type
PUT /staff/_mapping/list
{
"list": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
Index documents
PUT /staff/list/jason
{
"title": "searches"
}
PUT /staff/list/debby
{
"title": "searched open"
}
Search and stemmed results
GET staff/list/_search
{
"query": {
"query_string": {
"query": "title:opened"
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "staff",
"_type": "list",
"_id": "debby",
"_score": 1,
"_source": {
"title": "open"
}
}]
}
}
As you can see in the search results, debby document which contains the term
open was returned although we where searching for opened.
Hope that helps.
When you create the index, you are doing nothing (just re-declaring the standard analyzer).
The standard analyzer is the default that Elasticsearch uses, which doesn't stem any word.
You need to map the fields to their respective analyzers at your index creation (mapping documentation):
PUT /staff -d
{
"mappings": {
"list": {
"properties": {
"Title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
I guess english analyzer fits to your case (which uses the standard tokenizer).

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources