implementation of AutoSuggestion feature of apps like alibaba using elastic search - spring

I am trying to implement "autosuggestion" feature in my application where when user type a set of letters she should be able to view a list of suggestions for the given input , i would like to have my feature working as similar to how Alibaba or similar sites work .
I am using Elastic search and java.Can any one help me or give any suggestions on how to implement this functionality.

suggestins in elastic search can be implemented using 1)prefix match 2)completion suggesters 3)Edge NGrams.
in this approach i choose "Ngrams analyzer" for auto suggestion , start by defining a nGram filter and then link the filter to custom analyzer and apply analyzer to fields which we choose to provide suggestions.
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
once we have mapping set , we can assign the analyzer(nGram_analyzer) to fileds which take part of suggestions.
Next step would be to write a query to extartc suggestions , that would be as follows.
"query": {
"match": {
"_all": {
"query": "gates",
"operator": "and"(optional , needed when we have multiple words or a senetence)
}
}
}
The below material would help to understand in depth about ngram tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/analysis-ngram-tokenizer.html
PS : each approach listed above will have its own advantages and backdrops.

Related

Can we apply a char_filter to a custom tokenizer in elasticsearch?

I have set up a custom analyser in Elasticsearch that uses an edge-ngram tokeniser and I'm experimenting with filters and char_filters to refine the search experience.
I've been pointed to the excellent tool elyser which enables you to test the affect your custom analyser has on a specific term but this is throwing errors when I combine a custom analyser with a char_filter, specifically html_strip.
The error I get from elyser is:
illegal_argument_exception', 'reason': 'Custom normalizer may not use
char filter [html_strip]'
I would like to know whether this is a legitimate error message or whether it represents a bug in the tool.
I've referred to the main documentation and even their custom analyser example throws an error in elyser:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
Command in elyser:
elyzer --es "http://localhost:9200" --index my_index --analyzer my_custom_analyzer "Trinity Chapel <h1>[in fact King's Chapel]</h1>"
If it turns out that elyser is at fault, could anyone point me to an alternative method of examining the tokens produced from my custom analyser so that I can test the impact of each filter?
My custom analysers look a little bit like I've thrown the kitchen sink at them and I'd like a way to test and refactor:
PUT /objects
{
"settings" : {
"index" : {
"number_of_shards" : "5",
"analysis" : {
"analyzer" : {
"search_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"char_filter" : [
"html_strip"
],
"filter": [
"standard",
"apostrophe",
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer"
]
},
"autocomplete": {
"type": "custom",
"tokenizer": "autocomplete",
"filter": [
"standard",
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer"
]
},
"title_html_strip" : {
"filter" : [
"standard",
"lowercase"
],
"char_filter" : [
"html_strip"
],
"type" : "custom",
"tokenizer" : "standard"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
}
}
}
}
}
This bug is in elyzer. In order to show the state of the tokens at each step of the analysis process, elyzer performs an analyze query for each stage: first char filters, then tokenizer, and finally token filters.
The problem is that on ES side, the analysis process has changed since they introduced normalizers (in a non-backward compatible way). They assume that if there is no normalizer, no analyzer and no tokenizer in the request but either a token filter or a char_filter, then the analyze request should behave like a normalizer.
In your case, elyzer will first perform a request for the html_strip character filter and ES will think it is about a normalizer, hence the error you're getting since html_strip is not a valid char_filter for normalizers.
Since I know Elyzer's developer pretty well (Doug Turnbull), so I've filed a bug already. We'll see what unfolds.
Alternative method of examining the tokens produced from my custom analysers:
The official documentation includes a section on using the _analyse method which along with the explain: true flag, provides me with the information I need to scrutinise my custom analysers.
The following outputs the tokens at each filter stage
GET objects/_analyze
{
"analyzer" : "search_autocomplete",
"explain" : true,
"text" : "Trinity Chapel [in fact <h1>King's Chapel</h1>]"
}

Custom sorting in elastic search

I have some documents in elastic search with completion suggester. I search for some value like Stack, the results are shown in the order given below:
Stack Overflow
Stack-Overflow
Stack
StackOver
StackOverflow
I want the result to be displayed in the order:
Stack
StackOver
StackOverflow
Stack Overflow
Stack-Overflow
i.e, the exacts matches should come first instead of results which space or special characters.
TIA
It all depends on the way you are analysing the string you are querying upon. I will suggest that you apply more than one analyser on the same string field. Below is an example of the mapping of the "name" field over which you want auto complete/suggester feature:
"name": {
"type": "string",
"analyzer": "keyword_analyzer",
"fields": {
"name_ac": {
"type": "string",
"index_analyzer": "string_autocomplete_analyzer",
"search_analyzer": "keyword_analyzer"
}
}
}
Here, keyword_analyzer and string_autocomplete_analyzer are analysers defined in your index settings. Below is an example:
"keyword_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
"string_autocomplete_analyzer": {
"type": "custom",
"filter": [
"lowercase"
,
"autocomplete"
],
"tokenizer": "whitespace"
}
Here autocomplete is an analysis filter:
"autocomplete": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "10"
}
After having set this, when searching in Elasticsearch for the auto suggestions, you can make use of multiMatch queries instead of the normal match queries and here you provide boosts to individual fields in the multiMatch. Below is a example in java:
QueryBuilders.multiMatchQuery(yourSearchString,"name^3","name_ac");
You may need to alter the boost (^3) as per your needs.
If even this does not satisfy your requirements, you can look at having one more analyser which analyse the string based on first word and include that field in the multiMatch. Below is an example of such an analyser:
"first_word_name_analyzer": {
"type": "custom",
"filter": [
"lowercase"
,
"whitespace_merge"
,
"edgengram"
],
"tokenizer": "keyword"
}
With these analysis filters:
"whitespace_merge": {
"pattern": "\s+",
"type": "pattern_replace",
"replacement": " "
},
"edgengram": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "32"
}
You may have to do some trials on the boost values in order to reach the most optimum results based on your requirements. Hope this helps.

Elasticsearch Optimizing Query

We are implementing search company list using elasticsearch but its not what we expected
**Example companies:**
Infosys technologies
Infosys technologies ltd
Infosys technologies pvt ltd
Infosys technologies Limited
Infosys technologies Private Limited
BAC Infosys ltd
Scenario:
When search the keyword "Infosys" it should return "Infosys
technologies" list.
When search the keyword "Infosys ltd" it should return "Infosys
technologies" list.
When search the keyword "BAC Infosys ltd" it should return "BAC
Infosys ltd" list.
The below settings and mapping used
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"keyword_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"companies": {
"properties": {
"company_name": {
"type": "string",
"store": "true",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "keyword_analyzer",
"null_value": "null"
}
}
}
}
}
Query:
{"query":
{
"bool": {
"must": [
{ "match": { "company_name": "Infosys technologies" }}
],
"minimum_should_match": "80%"
}
}
}
Please help me how to achieve this.
you are missing a few things both in terms of the search queries and mappings.looking at your scenarios and using your current mappings settings
1) The result will also have BAC value. You should switch to edge n-grams.But this will not allow you to search from between.
2)It also depend what kind of search you are building, you can avoid the arrangement i suggested in 1. for your all scenarios lets assume your list can have BAC value also in the result for scenarios but ranked lower in the list. For this you can use proximity queries query with fuzzy on for spell checks.
Above three scenarios dont't explain me the whole functionality and uses -cases for your search feature, but i think proximity search offered by elastic can give you more flexibility to meet your cases.
Shingles could help:
https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html
For you case the ng_Gram analyzer is not pertinent, it should impact performance and relevance score. Create a shingle filter and a custom analyzer with standard tokenizer and lowercase filter.
HtH,

Elasticsearch - get results for autocomplete only for start of words

I'm using elasticsearch to give autosuggestions on a search bar but I want it to match only the beginning of words. Eg.
doc_name_1 = "black bag"
doc_name_2 = "abla bag"
Case 1.
On search bar string is part_string = "bla" the query I'm currently using is
query_body = {"query": {
"match": {
"_all": {
"query": part_string,
"operator": "and",
"type": "phrase_prefix"
}
}
}}
this query returns hits on doc_name_1 and doc_name_2.
What I need is to get only hit on doc_name_1 since doc_name_1 does not start the same way as the string queried.
I tried using "type":"phrase" but ES keeps going "inside" the words in the docs. Is it possible to do that just by modifying the query? or settings?
I'll share my ES settings:
{ "analysis":{
"filter":{
"nGram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram":20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}},
"analyzer":{
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter":[
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type":"custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}}}}
use edge n-gram instead of n-gram. you are breaking up the text from all postions of the word and filling the inverted index against lookup.

How can I get results that don't fully match using ElasticSearch?

If a user types
jewelr
I want to get results for
jewelry
I am using a multi_match query.
You could use EdgeNGram tokenizer:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer/
Specify an index time analyzer using this,
"analysis": {
"filter": {
"fulltext_ngrams": {
"side": "front",
"max_gram": 15,
"min_gram": 3,
"type": "edgeNGram"
}
},
"analyzer": {
"fulltext_index": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"asciifolding",
"fulltext_ngrams"
],
"type": "custom",
"tokenizer": "standard"
}
}
Then either specify as default index analyzer, or for a specific field mapping.
When indexing a field with value jewelry, with a 3/15 EdgeNGram, all combinations will be stored:
jew
jewe
jewel
jewelr
jewelry
Then a search for jewelr will get a match in that document.

Resources