Autocomplete in elastic search - elasticsearch

i am planning to make an elastic search based auto complete module for an e commerce website.i am using edge_ngram for suggestions.I am trying out this configuration.
**My index creation :**
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter","digit"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
**Inserting Data**
PUT my_index/doc/1
{
"title": "iphone s"
}
PUT my_index/doc/9
{
"title": "iphone ka"
}
PUT my_index/doc/11
{
"title": "iphone ka t"
}
PUT my_index/doc/15
{
"title": "iphone 6"
}
PUT my_index/doc/14
{
"title": "iphone 6 16GB"
}
PUT my_index/doc/3
{
"title": "iphone k"
}
POST my_index/_refresh
POST my_index/_analyze
{
"tokenizer": "autocomplete",
"text": "iphone 6"
}
POST my_index/_analyze
{
"analyzer": "pattern",
"text": "iphone 6"
}
**Autocomplete suggestions**
When i am trying to find out closets match to iphone 6.It is not showing correct result.
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "iphone 6",
"operator": "and"
}
}
}
}
**Above query yielding :**
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0.28582606,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.28582606,
"_source": {
"title": "iphone s"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "9",
"_score": 0.25811607,
"_source": {
"title": "iphone ka"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "14",
"_score": 0.24257512,
"_source": {
"title": "iphone 6 16GB"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "3",
"_score": 0.19100356,
"_source": {
"title": "iphone k"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "15",
"_score": 0.1862728,
"_source": {
"title": "iphone 6"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "11",
"_score": 0.16358379,
"_source": {
"title": "iphone ka t"
}
},
{
"_index": "my_index",
"_type": "doc",
"_id": "2",
"_score": 0.15861572,
"_source": {
"title": "iphone 5 s"
}
}
]
}
}
But result should be :
{
"_index": "my_index",
"_type": "doc",
"_id": "15",
"_score": 1,
"_source": {
"title": "iphone 6"
}
}
Please let me know if i am missing something on this,I am new to this so not aware of any other method that may yield better results.

You are using autocomplete_search as your search_analyzer. If you look how your text is analyzed using search analyzer specified by you.
POST my_index/_analyze
{
"analyzer": "autocomplete_search",
"text": "iphone 6"
}
You will get
{
"tokens": [
{
"token": "iphone", ===> Only one token
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
}
]
}
Since all the documents have this (iphone) token in reverse index. So all the documents are returned
In case you want to match desired results, you can use the same analyzer used while indexing.
{
"query": {
"match": {
"title": {
"query": "iphone 6",
"operator": "and",
"analyzer" : "autocomplete"
}
}
}
}

Related

ElasticSearch wildcard highlighting with hyphen

I am having trouble with wildcard query. When i have some hyphen - it does not highlight anything after it. I played with highlight settings but did not found any solution yet. Is it normal behavior?
I am making some index:
PUT testhighlight
PUT testhighlight/_mapping/_doc
{
"properties": {
"title": {
"type": "text",
"term_vector": "with_positions_offsets"
},
"content": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
Then i create documents:
PUT testhighlight/_doc/1
{
"title": "1",
"content": "test-input"
}
PUT testhighlight/_doc/2
{
"title": "2",
"content": "test input"
}
PUT testhighlight/_doc/3
{
"title": "3",
"content": "testinput"
}
Then i execute this search request:
GET testhighlight/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"title",
"content"
],
"query": "test*"
}
}
]
}
},
"highlight": {
"fields": {
"content": {
"boundary_max_scan": 10,
"fragment_offset": 5,
"fragment_size": 250,
"type": "fvh",
"number_of_fragments": 5,
"order": "score",
"boundary_scanner": "word",
"post_tags": [
"</span>"
],
"pre_tags": [
"""<span class="highlight-search">"""
]
}
}
}
}
It returns these hits:
"hits": [
{
"_index": "testhighlight",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"title": "2",
"content": "test input"
},
"highlight": {
"content": [
"""<span class="highlight-search">test</span> input"""
]
}
},
{
"_index": "testhighlight",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"title": "1",
"content": "test-input"
},
"highlight": {
"content": [
"""<span class="highlight-search">test</span>-input"""
]
}
},
{
"_index": "testhighlight",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"title": "3",
"content": "testinput"
},
"highlight": {
"content": [
"""<span class="highlight-search">testinput</span>"""
]
}
}
]
It looks alright, but didn't highlighted the whole "test-input" in document with ID 1. Is there any way to do so?

Search results for term query not in alphabetical sort order

My results for the following term query gets rendered like this. But we would want the search results where "BC" appears after "Bar", since we are trying to perform a alphabetical search. What should be done to get this working
Adam
Buck
BC
Bar
Car
Far
NativeSearchQuery query = new NativeSearchQueryBuilder()
.withSourceFilter(new FetchSourceFilterBuilder().withIncludes().build())
.withQuery(QueryBuilders.termQuery("type", field))
.withSort(new FieldSortBuilder("name").order(SortOrder.ASC))
.withPageable(pageable).build();
To sort the result in alphabetical order you can define a normalizer with a lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching.
Modify your index mapping as
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
Indexed the same sample documents as given in the question.
Search Query:
{
"sort":{
"name":{
"order":"asc"
}
}
}
Search Result:
"hits": [
{
"_index": "66064809",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"name": "Adam"
},
"sort": [
"adam"
]
},
{
"_index": "66064809",
"_type": "_doc",
"_id": "4",
"_score": null,
"_source": {
"name": "Bar"
},
"sort": [
"bar"
]
},
{
"_index": "66064809",
"_type": "_doc",
"_id": "3",
"_score": null,
"_source": {
"name": "BC"
},
"sort": [
"bc"
]
},
{
"_index": "66064809",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"name": "Buck"
},
"sort": [
"buck"
]
},
{
"_index": "66064809",
"_type": "_doc",
"_id": "5",
"_score": null,
"_source": {
"name": "Car"
},
"sort": [
"car"
]
},
{
"_index": "66064809",
"_type": "_doc",
"_id": "6",
"_score": null,
"_source": {
"name": "Far"
},
"sort": [
"far"
]
}
]
}

Elasticsearch geo query with aggregation

I have an elasticsearch index containing user locations.
I need to perform aggregate query with geo bounding box using geohash grid, and for buckets that have documents count less than some value, i need to return all documents.
How can i do this?
Since you have not given any relevant information about the index which you have created and the user locations.
I am considering the below data:
index Def
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
Index Sample Doc
POST _bulk
{"index":{"_id":1}}
{"location":"52.37408,4.912350","name":"The golden dragon"}
{"index":{"_id":2}}
{"location":"52.369219,4.901618","name":"Burger King"}
{"index":{"_id":3}}
{"location":"52.371667,4.914722","name":"Wendys"}
{"index":{"_id":4}}
{"location":"51.222900,4.405200","name":"Taco Bell"}
{"index":{"_id":5}}
{"location":"48.861111,2.336389","name":"McDonalds"}
{"index":{"_id":6}}
{"location":"48.860000,2.327000","name":"KFC"}
According to your question:
When requesting detailed buckets a filter like geo_bounding_box
should be applied to narrow the subject area
To know more about this, you can refer to this official ES doc
Now, in order to filter data based on doc_count with aggregations, we can use bucket_selector pipeline aggregation.
From documentation
Pipeline aggregations work on the outputs produced from other
aggregations rather than from document sets, adding information to the
output tree.
So, the amount of work that need to be done to calculate doc_count will be the same.
Query
{
"aggs": {
"location": {
"filter": {
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 52.5225,
"lon": 4.5552
},
"bottom_right": {
"lat": 52.2291,
"lon": 5.2322
}
}
}
},
"aggs": {
"around_amsterdam": {
"geohash_grid": {
"field": "location",
"precision": 8
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count < 2"
}
}
}
}
}
}
}
}
Search Result
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "restaurant",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"location": "52.37408,4.912350",
"name": "The golden dragon"
}
},
{
"_index": "restaurant",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"location": "52.369219,4.901618",
"name": "Burger King"
}
},
{
"_index": "restaurant",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"location": "52.371667,4.914722",
"name": "Wendys"
}
},
{
"_index": "restaurant",
"_type": "_doc",
"_id": "4",
"_score": 1.0,
"_source": {
"location": "51.222900,4.405200",
"name": "Taco Bell"
}
},
{
"_index": "restaurant",
"_type": "_doc",
"_id": "5",
"_score": 1.0,
"_source": {
"location": "48.861111,2.336389",
"name": "McDonalds"
}
},
{
"_index": "restaurant",
"_type": "_doc",
"_id": "6",
"_score": 1.0,
"_source": {
"location": "48.860000,2.327000",
"name": "KFC"
}
}
]
},
"aggregations": {
"location": {
"doc_count": 3,
"around_amsterdam": {
"buckets": [
{
"key": "u173zy3j",
"doc_count": 1
},
{
"key": "u173zvfz",
"doc_count": 1
},
{
"key": "u173zt90",
"doc_count": 1
}
]
}
}
}
}
It will filter out all the documents, whose count is less than 2 based on "params.the_doc_count < 2"

Edge n-gram suggestions and 'starts with' keyword in Elasticsearch

I'm trying to build a food search engine on Elasticsearch that should meet following use cases -
If the user searches for 'coff' then it should return all the documents with phrase 'coffee' in their name and the priority should be for food items that have 'coffee' at the starting of their name.
If the user searches for 'green tea' then it should give priority to the documents that have both the phrases 'green tea' instead of splitting 'green' and 'tea'
If the phrase does not exist in the 'name' then it should also search in the alias field.
To manage the first case, I've used the edge n-grams analyzer.
Mapping -
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
"analyzer_keyword": {
"tokenizer": "standard",
"filter": "lowercase"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"alias": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"search_analyzer": "analyzer_keyword",
"analyzer": "edge_ngram_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This is the search query that I'm using but it's not exactly returning the relevant search results
{
"query": {
"multi_match": {
"query": "coffee",
"fields": ["name^2", "alias"]
}
}
}
There are over 1500 food items with 'coffee' in their name but the above query is only returning 2
{
"took": 745,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 8.657346,
"hits": [
{
"_index": "food-master",
"_type": "doc",
"_id": "a9uzinABb4g7LgmgoI1I",
"_score": 8.657346,
"_source": {
"id": 17463,
"name": "Rotiboy, coffee bun",
"alias": [
"Mexican Coffee Bun (Rotiboy)",
"Mexican coffee bun"
],
}
},
{
"_index": "food-master",
"_type": "doc",
"_id": "TNuzinABb4g7LgmgoFVI",
"_score": 7.0164866,
"_source": {
"id": 1344,
"name": "Coffee with sugar",
"alias": [
"Heart Friendly",
"Coffee With Sugar",
"Coffee With Milk and Sugar",
"Gluten Free",
"Hypertension Friendly"
],
}
}
]
}
}
In the mapping, if I remove the analyzer_keyword then it returns relevant results but the documents that start with 'coffee' are not prioritized
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1323,
"max_score": 57.561867,
"hits": [
{
"_index": "food-master-new",
"_type": "doc",
"_id": "nduzinABb4g7LgmgoINI",
"_score": 57.561867,
"_source": {
"name": "Egg Coffee",
"alias": [],
"id": 12609
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "dNuzinABb4g7LgmgoFVI",
"_score": 55.811295,
"_source": {
"name": "Coffee (Black)",
"alias": [
"Weight Loss",
"Diabetes Friendly",
"Gluten Free",
"Lactose Free",
"Heart Friendly",
"Hypertension Friendly"
],
"id": 1341
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "NduzinABb4g7LgmgoHxI",
"_score": 54.303185,
"_source": {
"name": "Brewed Coffee",
"alias": [
"StarBucks"
],
"id": 15679
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "ltuzinABb4g7LgmgoJJI",
"_score": 54.303185,
"_source": {
"name": "Coffee - Masala",
"alias": [],
"id": 11329
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "oduzinABb4g7LgmgoGpI",
"_score": 53.171227,
"_source": {
"name": "Coffee, German",
"alias": [],
"id": 12257
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "YNuzinABb4g7LgmgoFRI",
"_score": 52.929176,
"_source": {
"name": "Soy Milk Coffee",
"alias": [
"Gluten Free",
"Lactose Free",
"Weight Loss",
"Diabetes Friendly",
"Heart Friendly",
"Hypertension Friendly"
],
"id": 978
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "8duzinABb4g7LgmgoFRI",
"_score": 52.068523,
"_source": {
"name": "Cold Coffee (Soy Milk)",
"alias": [
"Soy Milk"
],
"id": 1097
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "tNuzinABb4g7LgmgoF9I",
"_score": 50.956154,
"_source": {
"name": "Coffee Frappe",
"alias": [],
"id": 3142
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "ZduzinABb4g7LgmgoF5I",
"_score": 49.810112,
"_source": {
"name": "Big Apple Coffee",
"alias": [],
"id": 3130
}
},
{
"_index": "food-master-new",
"_type": "doc",
"_id": "eduzinABb4g7LgmgoHtI",
"_score": 49.62197,
"_source": {
"name": "Mexican Coffee",
"alias": [],
"id": 13604
}
}
]
}
}
If I change the tokenizer to 'keyword' from 'standard' then I face the same problem and it also splits phrases into individual words - 'green tea' to 'green' and 'tea'
Any suggestions on what I might be getting wrong with respect to analyzers? I've tried all possible combinations but meeting all 3 scenarios with high accuracy is getting a little difficult.

Elasticsearch search for Turkish characters

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".
I want to find this document by searching "kürşat". How can i do that?
Thanks
Take a look at the asciifolding token filter.
Here is a small example for you to try out in Sense:
Index:
DELETE test
PUT test
{
"settings": {
"analysis": {
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"turkish_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_ascii_folding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish_analyzer"
}
}
}
}
}
POST test/test/1
{
"name": "kürşat"
}
POST test/test/2
{
"name": "KURSAT"
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kursat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.30685282,
"_source": {
"name": "KURSAT"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"name": "kürşat"
}
}
]
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kürşat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.4339554,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.4339554,
"_source": {
"name": "kürşat"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.09001608,
"_source": {
"name": "KURSAT"
}
}
]
}
Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).
If you want the score to be equal, you can put the flag on false.
Hope I got your problem right!

Resources