Highlight part of word with ngram and whitespace analyzers - elasticsearch

I have an elasticsearch index with the following data:
"The A-Team" (as an example)
My index settings are :
"index": {
"number_of_shards": "1",
"provided_name": "tyh.tochniyot",
"creation_date": "1481039136127",
"analysis": {
"analyzer": {
"whitespace_analyzer": {
"type": "whitespace"
},
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": "3",
"max_gram": "7"
}
}
},
When i search for :
_search
{
"from": 0,
"size": 20,
"track_scores": true,
"highlight": {
"fields": {
"*": {
"fragment_size": 100,
"number_of_fragments": 10,
"require_field_match": false
}
}
},
"query": {
"match": {
"_all": {
"query": "Tea"
}
}
}
I expect to get the highlight result :
"highlight": {
"field": [
"The A-<em>Tea</em>m"
]
}
But i dont get any highlight at all.
The reason i am using whitespace for search and ngram for indexing is that i dont want in search phase to break the word i am searching, like if i am searching for "Team" it will find me "Tea","eam","Team"
Thank you

The problem was that my Analyzer and search Analyzer were running on the _all filed.
When i placed Analyzer attribute on the specific fields the Highlight started working.

Related

Elasticsearch Only Searching From Start

Currently, Elasticsearch is only searching through the mapped items from the beginning of the string instead of throughout the string.
I have a custom analyzer, as well as a custom edge ngram tokenizer.
I am currently using bool queries from within javascript to search the index.
Index
{
"homestead_dev_index": {
"aliases": {},
"mappings": {
"elasticprojectnode": {
"properties": {
"archived": {
"type": "boolean"
},
"id": {
"type": "text",
"analyzer": "full_name"
},
"name": {
"type": "text",
"analyzer": "full_name"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "homestead_dev_index",
"creation_date": "1535439085947",
"analysis": {
"analyzer": {
"full_name": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "mytok"
}
},
"tokenizer": {
"mytok": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "10"
}
}
},
"number_of_replicas": "1",
"uuid": "iCa7qKJVRU-_MA8sCYIAXw",
"version": {
"created": "5060399"
}
}
}
}
}
Query Body
{
"query": {
"bool": {
"should": [
{ "match": { "name": this.searchString } },
{ "match": { "id": this.searchString } }
]
}
},
"highlight": {
"pre_tags": ["<b style='background-color:yellow'>"],
"post_tags": ["</b>"],
"fields": {
"name": {},
"id": {}
}
}
}
Example
If I have projects with the names "Road - Area 1", "Road - Area 2" and "Sub-area 5 - Road" and the user searches for "Road", only "Road - Area 1" and "Road - Area 2" will display with the word "Road" highlighted in yellow.
The code needs to pick up the final project as well.
I seem to have figured it out.
In the original description, I am using the edge_ngram tokenizer when I am supposed to be using the ngram tokenizer.
Found on: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html#_partial_word_tokenizers

Completion Suggester Not working as expected

{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"suggest": {
"type": "completion",
"analyzer": "autocomplete"
},
"hostname": {
"type": "text"
}
}
}
}
}
`
Above mapping is stored in Elastic search.
POST index/test
{
"hostname": "testing-01",
"suggest": [{"input": "testing-01"}]
}
POST index/test
{
"hostname": "testing-02",
"suggest": [{"input":"testing-02"}]
}
POST index/test
{
"hostname": "w1-testing-01",
"suggest": [{"input": "w1-testing-01"}]
}
POST index/test
{
"hostname": "w3-testing-01",
"suggest": [{"input": "w3-testing-01"}]
}
`
When there are 30 documents with hostname starting w1 and hostnames w3, when term "w3" is searched, I get suggestions of all w1 first and then w3.
Suggestion Query
{
"query": {
"_source": {
"include": [
"text"
]
},
"suggest": {
"server-suggest": {
"text": "w1",
"completion": {
"field": "suggest",
"size": 10
}
}
}
}
}
Tried different analyzers, same issue.
can some body guide ?
It's a common trap. It is because the min_ngram is 1, and hence, both w1-testing-01 and w3-testing-01 will produce the token w. Since you only specified analyzer, the autocomplete analyzer will also kick in at search time and hence searching suggestions for w3 will also produce the token w, hence why both w1-testing-01 and w3-testing-01 match.
The solution is to add a search_analyzer to your suggest field so that the autocomplete analyzer is not used at search time (you can use the standard, keyword or whatever analyzer makes sense for your use case), but only at indexing time.
"mappings": {
"test": {
"properties": {
"suggest": {
"type": "completion",
"analyzer": "autocomplete",
"search_analyzer": "standard" <-- add this
},
"hostname": {
"type": "text"
}
}
}
}

Google style autocomplete & autocorrection with elasticsearch

I'm trying to achieve google style autocomplete & autocorrection with elasticsearch.
Mappings :
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
Docs:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
I expect that when user types: "beaatiful q" he will get something like beautiful queen (beaatiful is corrected to beautiful and q is completed to queen).
I've tried the following query:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
Unfortunately, Completion suggester doesn't allow any typos so I get this response:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
In addition, search gave me these results (beautiful ranked higher although user started to wrote "queen"):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
UPDATE !!!
I found out that I can use fuzzy query with completion suggester, but now I get no suggestions when querying (fuzzy only supports 2 edit distance):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
I still expect "beautiful queen" as suggestion response.
When you want to provide 2 or more words as search suggestions, I have found out (the hard way), its not worth it to use ngrams or edgengrams in Elasticsearch.
Using the Shingles token filter and the shingles analyzer will provide you with multi-word phrases and if you couple that with the match_phrase_prefix it should give you the functionality your looking for.
Basically something like this:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
And don't forget to do your mapping:
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
Ngrams and edgengrams are going tokenize single characters, whereas the Shingles analyzer and filters, groups letters (making words) and provide a much more efficient way of producing and searching for phrases. I spent alot of time messing with the 2 above until I saw Shingles mentioned and read up on it. Much better.

Trying to form an Elasticsearch query for autocomplete

I've read a lot and it seems that using EdgeNGrams is a good way to go for implementing an autocomplete feature for search applications. I've already configured the EdgeNGrams in my settings for my index.
PUT /bigtestindex
{
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "stop", "kstem", "ngram" ]
}
},
"filter":{
"edgengram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"fields": {
"title.autocomplete": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
}
}
So if in my settings I have the EdgeNGram filter configured how do I add that to the search query?
What I have so far is a match query with highlight:
GET /bigtestindex/doc/_search
{
"query": {
"match": {
"content": {
"query": "thing and another thing",
"operator": "and"
}
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"field": {
"_source.content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
How would I add autocomplete to the search query using EdgeNGrams configured in the settings for the index?
UPDATE
For the mapping, would it be ideal to do something like this:
"title": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
},
Or do I need to use multi_field type:
"title": {
"type": "multi_field",
"fields": {
"title": {
"type": "string"
},
"autocomplete": {
"analyzer": "autocomplete",
"type": "string",
"index": "not_analyzed"
}
}
},
I'm using ES 1.4.1 and want to use the title field for autocomplete purposes.... ?
Short answer: you need to use it in a field mapping. As in:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"ngram"
]
}
},
"filter": {
"edgengram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
For a bit more discussion, see:
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
and
http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
Also, I don't think you want the "highlight" section in your index definition; that belongs in the query.
EDIT: Upon trying out your code, there are a couple of problems with it. One was the highlight issue I already mentioned. Another is that you named your filter "edgengram", even though it is of type "ngram" rather than type "edgeNGram", but then you referenced the filter "ngram" in your analyzer, which will use the default ngram filter, which probably doesn't give you what you want. (Hint: you can use term vectors to figure out what your analyzer is doing to your documents; you probably want to turn them off in production, though.)
So what you actually want is probably something like this:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"edgengram_filter"
]
}
},
"filter": {
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
When I indexed these two docs:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"content":"hello world"}
{"index":{"_id":2}}
{"content":"goodbye world"}
And ran this query (there was an error in your "highlight" block as well; should have said "fields" rather than "field")"
POST /test_index/doc/_search
{
"query": {
"match": {
"content": {
"query": "good wor",
"operator": "and"
}
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"fields": {
"content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
I get back this response, which seems to be what you're looking for, if I understand you correctly:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2712221,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.2712221,
"_source": {
"content": "goodbye world"
},
"highlight": {
"content": [
"<em>goodbye</em> <em>world</em>"
]
}
}
]
}
}
Here is some code I used to test it out:
http://sense.qbox.io/gist/3092992993e0328f7c4ee80e768dd508a0bc053f

ElasticSearch search_analyzer applied but no result returned

I have a query that should search for lowercase terms.
Actually I just had a index_analyzer with a lowercase filter, but I wanted to add also a search_analyzer so I could do case-insensitive searches.
"analysis": {
"analyzer" : {
"DefaultAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
],
"char_filter": ["punctuation"]
},
"MyAnalyzer": {
"type": "custom",
"tokenizer": "first_letter",
"filter": [
"lowercase"
]
},
So I just thought to add the same analyzer as search_analyzer into the mapping
"index_analyzer": "DefaultAnalyzer",
"search_analyzer": "DefaultAnalyzer",
"dynamic" : false,
"_source": { "enabled": true },
"properties" : {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"store": true
},
"startletter": {
"type": "string",
"index_analyzer": "MyAnalyzer",
"search_analyzer": "MyAnalyzer",
"store": true
}
}
},
Doing like that, if I manually query Elastic Search with
curl -XGET host:9200/my-index/_analyze -d 'Test'
I see that the query term is correctly lowercased
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
}
]
}
But executing from the code
if I use an uppercase search term ES returns zero hits (even if we saw that the search_analyzer is applied)
if I use a lowercase search term ES returns me the right number of result hits (hundreds)
While I would like to have the same result independently from the case.
In the code I'm just creating a query with a term filter, that is like that
{
"filter": {
"term": {
"name.startletter": "O"
}
},
"size": 10000,
"query": {
"match_all": {}
}
}
What I'm doing wrong? Why am I not getting any result?
The problem is that you are using a Term Filter. A Term Filter does not analyze the text being used:
Term Filter
Filters documents that have fields that contain a term (not analyzed).
Similar to term query, except that it acts as a filter.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-filter.html
Since it does not analyze, it does not use the analyzer that you have defined.
You generally want to use Term filters and queries with fields that are not analyzed. Change your filter type to something that will analyze during the query.
I think, you are using MyAnalyzer to get start letter of indexed, Your analyzer don't work in that way. I've dome some test, and finally come up with solution.
First, create index and mapping (+ settings)
curl -XPUT "http://localhost:9200/t1" -d'
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"DefaultAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
},
"MyAnalyzer": {
"type": "custom",
"tokenizer": "token_letter",
"filter": [
"one_token","lowercase"
]
}
},
"tokenizer": {
"token_letter": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "1",
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"one_token": {
"type": "limit",
"max_token_count": 1
}
}
}
}
},
"mappings": {
"t2": {
"index_analyzer": "DefaultAnalyzer",
"search_analyzer": "DefaultAnalyzer",
"dynamic": false,
"_source": {
"enabled": true
},
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"store": true
},
"startletter": {
"type": "string",
"index_analyzer": "MyAnalyzer",
"search_analyzer": "simple",
"store": true
}
}
}
}
}
}
}'
And, now write a data.
curl -XPUT "http://localhost:9200/t1/t2/1" -d'
{
"name" :"Oliver Khan"
}'
Now, Here is fun part, Just a query and facet to see what is indexed.
curl -XPOST "http://localhost:9200/t1/t2/_search" -d'
{
"filter": {
"term": {
"name.startletter": "O"
}
},
"size": 10000,
"query": {
"match_all": {}
},
"facets": {
"tf": {
"terms": {
"field": "name.startletter",
"size": 10
}
}
}
}'
This gives me analyzed text, as facet output, so I can check if analyzer is working.
Hope this helps!!

Resources