Find concatenate words in Elasticsearch - elasticsearch

Say I have indexed this data
song:{
title:"laser game"
}
but the user is searching for
lasergame
How would you go about mapping/indexing/querying for this?

This is kind of tricky problem.
1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.
"settings": {
"analysis": {
"analyzer": {
"concatenate_split": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"laser",
"game",
"lean",
"on",
"die",
"hard"
]
}
}
}
}
After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.
2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\\s+",
"replacement":""
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work.
Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.
3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.
Hope this helps!

Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:
Mappings
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"sample": {
"properties": {
"myField": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
Test document
PUT /test/sample/1
{
"myField": "laser game"
}
Query
GET /test/_search
{
"query": {
"match": {
"myField": "lasergame"
}
}
}
Results
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2161999,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 0.2161999,
"_source": {
"myField": "laser game"
}
}
]
}
}
This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

Related

How do I search documents with their synonyms in Elasticsearch?

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.
Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

Space handling in Elastic Search

If a Document (Say a merchant name) that I am searching for has no space in it and user search by adding space in it, the result won't show in elastic search. How can that be improved to get results?
For example:
Merchant name is "DeliBites"
User search by typing in "Deli Bites", then the above merchant does not appear in results. The merchant only appears in suggestions when I have typed just "Deli" or "Deli" followed by a space or "Deli."
Adding another option, you can also use the edge n-gram tokenizer which will work in most of the cases, its simple to setup and use.
Working example on your data
Index definition
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample doc
{
"title" : "DeliBites"
}
Search query
{
"query": {
"match": {
"title": {
"query": "Deli Bites"
}
}
}
}
And search results
"hits": [
{
"_index": "65489013",
"_type": "_doc",
"_id": "1",
"_score": 0.95894027,
"_source": {
"title": "DeliBites"
}
}
]
I suggest using synonym token filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
you should have a dictionary for all words that you want search.
something like this:
DelitBites => Deli Bites
ipod => i pod
before implementing synonym be sure you understood all aspect of it.
https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms

elasticsearch - number of searches affects revelance?

I have the following mapping:
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
I've inserted two docs:
POST music/song
{
"song_field" : "Premeditiated murder"
}
POST music/song
{
"song_field" : "Premeditiated"
}
Here is the query:
POST music/song/_search
{
"size": 10,
"query": {
"match": {
"song_field": {
"query": "Premeditiated murd",
"fuzziness": 2
}
}
}
}
Response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.78730416,
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUf6XK1ancUpEdFLdz8",
"_score": 0.78730416,
"_source": {
"song_field": "Premeditiated"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUfUbocancUpEdFLdUf",
"_score": 0.668494,
"_source": {
"song_field": "Premeditiated murder"
}
}
]
}
}
I have two questions:
Why does Premeditiated score is higher ? How can I get a resonable correction + auto-complete?
Does searching the same document over and over again affects default es score ?
You get wrong response because sorting by relevance is broken for very small set of data when you have multiple shareds. Relevance is calculated for each shared and then results from each shared are merged and return so your "Premeditiated" has bigger relevance in one shared. This is a common issue and is well described here: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html
There are two ways to solve this issue you can use:
1. number_of_shards option =1 during defining index mapping
2. add the following information to your search query: search_type=dfs_query_then_fetch
After using one of the above options you will get a result you want.
Regarding your second question: every time when you search scoring is calculated. Even if you are searching the same document over and over again the scoring is calculated and _score result is always the same. If you want to read more how scoring works then you need to read "Controlling relevance" chapter https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-relevance.html. You can always add explain property to your query to see how scroing was calculated https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-explain.html.
P.S
Great that you provided your JSONs but there is a wrong field inside query it should be song_field instead of song_field_1. Additionaly your response doesn’t fit to data stored inside type (look at _source field in the respown) but it doesn't matter here:P.

how to switch on the elasticsearch stemming

I don't know how to turn on the Elasticsearch English word stemming. I am sorry I didn't find out a clear example to do that.
Here is what I did
Creating the index
PUT /staff/list/ -d
{
"settings" : {
"analysis": {
"analyzer": {
"standard": {
"type": "standard"
}
}
}
}
}
Adding document
PUT /staff/list/jason
{
"Title" : "searches"
}
when I search for search
GET /staff/list/_search?q=search
The result doesnt appear.
What index setting I should do to make the stemming works.
Many thanks in advance
Please note that the default Elasticsearch analyzer do not support stemming.
In order to support stemming you may need to create a custom analyzer.
Here is how you do it:
Create the index and define an analyzer called my_analyzer
PUT /staff
{
"settings" : {
"analysis": {
"filter": {
"filter_snowball_en": {
"type": "snowball",
"language": "English"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"filter_snowball_en"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Configure mapping that assigns my_analyzer to list type
PUT /staff/_mapping/list
{
"list": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
Index documents
PUT /staff/list/jason
{
"title": "searches"
}
PUT /staff/list/debby
{
"title": "searched open"
}
Search and stemmed results
GET staff/list/_search
{
"query": {
"query_string": {
"query": "title:opened"
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "staff",
"_type": "list",
"_id": "debby",
"_score": 1,
"_source": {
"title": "open"
}
}]
}
}
As you can see in the search results, debby document which contains the term
open was returned although we where searching for opened.
Hope that helps.
When you create the index, you are doing nothing (just re-declaring the standard analyzer).
The standard analyzer is the default that Elasticsearch uses, which doesn't stem any word.
You need to map the fields to their respective analyzers at your index creation (mapping documentation):
PUT /staff -d
{
"mappings": {
"list": {
"properties": {
"Title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
I guess english analyzer fits to your case (which uses the standard tokenizer).

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources