I am trying to implement a auto complete suggester for movies(title) somewhat similar to IMDB. Below is mapping that i used.This mapping gives decent results. I am using edge Ngam.. are there any better alternatives?
But it has some flaws like.
"war civil" "civil war" gives same results. ie it doesn't give priority to movies with words in same order as query.
It doesn't give any results when space is omitted between words eg "smoking barrels" gives good results. but "smokingbarrels" gives zero result.
What is wrong with query and mapping below?
curl -XPUT "http://localhost:9200/movieindex" -H 'Content-Type: application/json' -d'
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit",
"symbol"
]
}
}
}
}
},
"mappings": {
"movies": {
"properties": {
"title": {
"type": "text",
"fields": {
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
}
},
"analyzer": "standard"
}
}
}
}
}
GET /movieindex/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title.edgengram": {
"query": "smokingbarrels",
"fuzziness": 1
}
}
}
]
}
}
}
Related
I need to explain some weird behavior of term query to Elasticsearch database which contains number part in the string. Query is pretty simple:
{
"query": {
"bool": {
"should": [
{
"term": {
"address.street": "8 kvetna"
}
}
]
}
}
}
The problem is that term 8 kvetna returns empty result. I tried to _analyze it ad it make regular tokens like 8, k, kv, kve .... Also I am pretty sure there is a value 8 kvetna in database.
Here is the mapping for the field:
{
"settings": {
"index": {
"refresh_interval": "1m",
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
}
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"doc": {
"dynamic": "strict",
"_all": {
"enabled": false
},
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"analyzer": "autocomplete"
},
"street": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
}
}
What caused this weird result? I don't understand it. Thanks for any help.
Great start so far! Your only issue is that you're using a term query, while you should use a match one. A term query will try to do an exact match for 8 kvetna and that's not what you want. The following query will work:
{
"query": {
"bool": {
"should": [
{
"match": { <--- change this
"address.street": "8 kvetna"
}
}
]
}
}
}
I'm kind of new to Elasticsearch but I would like to search the partial in the word
For example if I search "helloworld" is it possible to type only "world"?
Right now it work perfectly for case "hello" the elasticsearch return the suggestion helloworld for me
Here is the code:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"word": {
"properties": {
"text": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
}
Can anyone give me any suggestion?
Settings:
{
"settings": {
"analysis": {
"analyzer": {
"idx_analyzer_ngram": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding",
"edgengram_filter_1_32"
],
"tokenizer": "ngram_alltokenchar_tokenizer_1_32"
},
"ngrm_srch_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
},
"tokenizer": {
"ngram_alltokenchar_tokenizer_1_32": {
"token_chars": [
"letter",
"whitespace",
"punctuation",
"symbol",
"digit"
],
"min_gram": "1",
"type": "nGram",
"max_gram": "32"
}
}
}
}
}
Mappings:
{
"properties": {
"TITLE": {
"type": "string",
"fields": {
"untouched": {
"index": "not_analyzed",
"type": "string"
},
"ngramanalyzed": {
"search_analyzer": "ngrm_srch_analyzer",
"index_analyzer": "idx_analyzer_ngram",
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Query:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "have some ha",
"fields": [
"TITLE.ngramanalyzed"
],
"default_operator": "and"
}
}
}
},
"highlight": {
"fields": {
"TITLE.ngramanalyzed": {}
}
}
}
I have document indexed with TITLE have some happy meal. When I search have some, I am able to get proper highlights.
<em>have</em> <em>some</em> happy meal
As i type more have some ha, the highlight results are not as expected.
<em>ha</em>ve <em>some</em> <em>ha</em>ppy meal
The have word gets partially highlighted as ha.
I would expect it to highlight the longest matching token, because with an ngrams with min size = 1, this gives me a highlight of 1 or more char while there should be another matching token of 4 or 5 chars (for example: have should also be highlighted along with ha being highlighted.
I am not able to find any solution for the same. Please suggest.
Suppose I have the following document:
{title:"Sennheiser HD 800"}
I want to all this queries return this document.
senn
heise
sennheise
sennheiser
sennheiser 800
sennheiser hd
hd
800 hd
hd ennheise
In short I want to find partial words either one or more.
In my map i am using this analyzer
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
}
and the map
{
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
}
}
}
}
and the query is a simple string query
{
"query": {
"query_string": {
"fields": [
"title.lower_case_sort"
],
"query": "*800 hd*"
}
}
}
For example this query fails.
You need ngrams.
Here is a blog post I wrote up about it for Qbox:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
(Note that "index_analyzer" no longer works in ES 2.x; use "analyzer" instead; "search_analyzer" still works, though.)
Using this mapping (slightly modified from one in the blog post; I'll refer you there for an in-depth explanation):
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
index your document:
POST /test_index/doc/1
{
"title": "Sennheiser HD 800"
}
and then any of your listed queries work, in the following form:
POST /test_index/_search
{
"query": {
"match": {
"title": {
"query": "heise hd 800",
"operator": "and"
}
}
}
}
If you only have a single term, then you don't need the "operator" part:
POST /test_index/_search
{
"query": {
"match": {
"title": "hd"
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/a9accf67f1713ca99819f45ce0ac28adaea691a9
I guess the title of the topic spoiled you enough :D
I use edge_ngram and highlight to build an autocomplete search. I have added fuzziness in the query to allow users to mispell their search, but it brokes a bit the highlight.
When i write Sport this is what I get :
<em>Spor</em>t
<em>Spor</em>t mécanique
<em>Spor</em>t nautique
I guess it's because it matches with the token spor generated by the ngram tokenizer.
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "sport",
"operator": "and",
"fuzziness": "AUTO"
}
}
},
{
"match_phrase_prefix": {
"name.raw": {
"query": "sport"
}
}
}
]
}
},
"highlight": {
"fields": {
"name": {
"term_vector": "with_positions_offsets"
}
}
}
}
And the mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"keywordAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"raw": {
"type": "string",
"analyzer": "keywordAnalyzer"
}
}
}
}
}
}
}
I tried to add a new match clause without fuzziness in the query to try to match the keyword before the match with fuzziness but it changed nothing.
'match': {
'name': {
'query': 'sport',
'operator': 'and'
}
Any idea how I can handle this?
Regards, Raphaël
You could do that with highlight_query I guess
Try this in your highlighting query.
"highlight": {
"fields": {
"name": {
"term_vector": "with_positions_offsets",
"highlight_query": {
"match": {
"name.raw": {
"query": "spotr",
"fuzziness": 2
}
}
}
}
}
}
I hope it helps.