How to approach non-latin characters in ElasticSearch autocompletion with Mongoosastic? - elasticsearch

Autocompletion is working fine using es.search({size: 0, suggest: ...} using completion mapping on a field that can have non-latin diacritics (accented characters like â, ê, etc.).
I am creating mappings using mongoosastic. I need to be able to use something like asciifolding for suggestions or add additional field to the response.
I have those fields:
name which is the one with diacritics.
nameSearch which is the name latinized (no diacritics/accented characters).
What I need is to either continue completion suggestions over name but treat a the same as â (and the other way).
In the response I need name. Not nameSearch.

I stumbled on this problem again, this time without mongoosastic. The answer is to have settings field in the index query (in mongoosastic you can add it when using custom mappings).
settings: {
analysis: {
analyzer: {
folding: {
tokenizer: 'standard',
filter: ['lowercase', 'custom_asciifolding'],
},
},
filter: {
custom_asciifolding: {
type: 'asciifolding',
preserve_original: true,
},
},
},
}

Related

Elasticsearch multiple suggestions with more advanced cases like matching prefix in the middle of a sentence

My use case : I have a search bar when the user can type his query. I want to show multiple types of search suggestions to the user in addition to a regular query suggestion. For example, in the screenshot below, as you can see in this screenshot, you can see there are company sector, companies, and schools suggestions.
This is currently implemented using completion suggesters and the following mappings (this is code from our Ruby implementation, but I believe you should be able to understand it easily)
{
_source: '',
suggest: {
text: query_from_the_user, # User query like "sec" to find "security" related matches
'school_names': {
completion: {
field: 'school_names_suggest',
},
},
'companies': {
completion: {
field: 'company_name.suggest',
},
},
'sectors': {
completion: {
field: sector_field_based_on_current_language(I18n.locale),
# uses 'company_sector.french.suggest' when the user browses in french
},
},
},
}
Here are my mappings (this is written in Ruby as but I believe it shouldn't be too hard to mentally convert this to Elasticsearch JSON config
indexes :company_name, type: 'text' do
indexes :suggest, type: 'completion'
end
indexes :company_sector, type: 'object' do
indexes :french, type: 'text' do
indexes :suggest, type: 'completion'
end
indexes :english, type: 'text' do
indexes :suggest, type: 'completion'
end
end
indexes :school_names_suggest, type: 'completion'
# sample Indexed JSON
{
company_name: "Christian Dior Couture",
company_sector: {
english: 'Milk sector',
french: 'Secteur laitier'
},
school_names_suggest: ['Télécom ParisTech', 'Ecole Centrale Paris']
}
The problem is the suggestion is not powerful enough and cannot autocomplete based on the middle of a sentence and provide additional results even after a perfect match. Here are some scenarios that I need to capture with my ES implementation
CASE 1 - Matching by prefix in the middle of a sentence
# documents
[{ company_name: "Christian Dior Couture" }]
# => A search term "Dior" should return this document because it matches by prefix on the second word
CASE 2 - Provide results even after a perfect match
# documents
[
{ company_name: "Crédit Agricole" },
{ company_name: "Crédit Agricole Pyrénées Gascogne" },
]
# => A search term "Crédit Agricole" should return both documents (using the current implementation it only returns "Crédit Agricole"
Can I implement this using suggesters in Elasticsearch ? Or do I need to fall back to multiple search that would take advantage of the new search-as-you-type data type using a query as mentionned in the doc ?
I am using elasticsearch 7.1 on AWS and the Ruby driver (gem elasticsearch-7.3.0)

What is the difference between a `text` field with a `keyword` analyzer and a `keyword field in Elasticsearch?

properties: {
keyword: {
type: "keyword"
fields: {
text: { type: "text", analyzer: "keyword" }
}
}
}
If I create an index with this mapping what is the difference between keyword and keyword.text?
Both are same. Keyword type/analyzer -as per document accepts whatever text it is given and outputs the exact same text as a single term.
If intention is to do an exact match keyword type should be preferred. If you need to customise it (ex. case insensitive search) then custom analyzer can used to modify it.

How to add "context" to Elastic Search suggestions

I'm building a Enterprise social network.
I want to suggest people to add as friend, based on their title.
For example, the value can be: developer, blogger, singer, barber, bartender ...
My users are saved into ElasticSearch, their titles are saved in the field 'title'.
The current mapping is:
title: {
type: 'text',
analyzer: 'autocomplete_analyzer',
search_analyzer: 'autocomplete_analyzer_search'
}
and the query is:
should: [
{
match: {
title: {
query: user.title,
minimum_should_match: '90%',
boost: 2
}
}
}
]
and the analyzers definitions are:
indexConfig: {
settings: {
analysis: {
analyzer: {
autocomplete_analyzer: {
tokenizer: 'autocomplete_tokenizer',
filter: ['lowercase', 'asciifolding']
},
autocomplete_analyzer_search: {
tokenizer: 'lowercase',
filter: ['asciifolding']
},
phrase_analyzer: {
tokenizer: 'standard',
filter: ['lowercase', 'asciifolding', 'fr_stop', 'fr_stemmer', 'en_stop', 'en_stemmer']
},
derivative_analyzer: {
tokenizer: 'standard',
filter: ['lowercase', 'asciifolding', 'derivative_filter', 'fr_stop', 'fr_stemmer', 'en_stop', 'en_stemmer']
}
},
tokenizer: {
autocomplete_tokenizer: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 20,
token_chars: ['letter', 'digit']
}
},
filter: {
derivative_filter: {
type: 'word_delimiter',
generate_word_parts: true,
catenate_words: true,
catenate_numbers: true,
catenate_all: true,
split_on_case_change: true,
preserve_original: true,
split_on_numerics: true,
stem_english_possessive: true
},
en_stop: {
type: 'stop',
stopwords: '_english_'
},
en_stemmer: {
type: 'stemmer',
language: 'light_english'
},
fr_stop: {
type: 'stop',
stopwords: '_french_'
},
fr_stemmer: {
type: 'stemmer',
language: 'light_french'
}
}
}
}
}
I tested it, the relevance is very good, but they are not enough users matched by this, because of the '90%' criteria.
A quick and dirty solution is to lower this criteria to 50% of course.
However, If I do that, I suppose that Elastic will search titles based on the concordance of the letters in the title, rather that the relevance of the proximity between titles.
For example, If my user is a 'barber', ElasticSearch might suggest 'bartender', because they have in common: b,a,r,e,r
Hence, I have two questions:
1 - is my assumption correct ?
2 - what can I do to add more relevance on my titles search ?
The problem with your search is following - it uses autocomplete_analyzer, which is basically creates a huge index with a lot of n-grams.
Example for bartender would be something like ba, bar, bart, etc.
As you could see, for barber you will have a bit similar n-grams, which would make a match.
Regarding your questions, if you would lower the minimum_should_match you will get more results, but that's just because the following matching procedure will lead to partial matches.
To increase the relevancy - I would recommend to use another analyzer, since this n-gram analyzer is usually suitable only for autosuggest functionality, which isn't the case. There could be several choices from keeping it simple to keyword analyzer, or whitespace one.
What would be more important is to properly construct the query. For example if user searches for partial title, e.g bar, you may use prefix query. However, if you're searching just by full match (e.g. developer or bartender) it would be more important to just normalize title field properly. E.g. to use lowercase analyzer with some stemming.

Elastic search - Search by alphabet characters A-Z

I would like to ask how to filter specific data with elastic search by simple 1 character A-Z
so.. for example i got data
Orange
Apple
Ancient
Axe
I would like to get all results which start(not contains) with character for example "A". So results is
Apple
Ancient
Axe
I found here that i should create new analyzer analyzer_startswith and set up like this. What am doing wrong? Now im getting 0 results
Elastica .yml config
fos_elastica:
clients:
default: noImportantInfo
indexes:
bundleName:
client: default
finder: ~
settings:
index:
analysis:
analyzer:
analyzer_startswith:
type: custom
tokenizer: keyword
filter:
- lowercase
types:
content:
properties:
id:
type: integer
elasticaPriority:
type: integer
title:
type: string
analyzer: another_custom_analyzer
fields:
raw:
type: string
index: not_analyzed
title_ngram:
type: string
analyzer: analyzer_startswith
property_path: title
Thank you
You could use the Prefix query for this see here https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-prefix-query.html :
GET /_search
{ "query": {
"prefix" : { "user" : "ki" }
}
}
Thanks i used prefix and its working.
I set index to not_analyzed and used Prefix to find a first character in string.
title_ngram:
type: string
property_path: title
index: not_analyzed
Is there any other way how to apply a standard search for my "title_ngram" now? Because i would like to search by single 1 character and also full text search in "title_ngram"
Try this one
GET /content/_search
{
"query": {
"match": {
"title": "A"
}
},
"sort": "title.raw"
}
For more information refer below link
https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html

Implementing ElasticSearch custom filter for all queries

I'm trying to upgrade from using ElasticSearch 0.90.3 to 2.0 and running into some issues. The people who originally setup and configured ES are no longer available so I'm going about this with very little knowledge about how it works.
They configured ES 0.90.3 to use ElasticSearch-ServiceWrapper and Tire, beyond that there are only a couple small configuration changes.
For the most part, the upgrade went smooth, I replaced the setup information in the cap deploy process to now install ES 2.0 instead of 0.90.3 and the service is coming up, however, I cannot get the partial matching that was taking place before to work. I need to setup a standard filter that applies to all searches that will search all fields using partial matches. I've done tons of google searches and this is the closest I can come up with, but it still isn't returning partial matches.
index:
settings:
analysis:
filter:
autocomplete_filter:
type: edge_ngram
min_gram: 2
max_gram: 32
analyzer:
autocomplete:
type: custom
tokenizer: standard
filter: [ lowercase, autocomplete_filter ]
mappings:
access_point_status:
properties:
text:
type: string
analyzer: autocomplete
search_analyzer: standard
I was hoping to not need to replace Tire as that would make this upgrade much more involved, but if the problem lies within the queries and not the setup then I will go down that road. This is a sample query that is not returning the desired results:
curl -X GET 'http://localhost:9200/access_point_status/_search?from=0&size=100&pretty' -d
'{ "query":
{ "bool":
{ "must": [
{ "match":
{ "_all":
{ "query":
"1925","type":"phrase_prefix"
}
}
}
]}
}
,"sort": [ { "name":"asc" } ]
,"filter": { "term": { "domain":"domain_1" } }
,"size":100,"from":0
}'
Thanks
So I've found most of the issues. The indexes were being created by Tire and data_tables, using a different mapping. This couldn't be overwritten once created.
I created these filters and then applied them to the fields,
index:
analysis:
filter:
edge_ngram_filter:
type: edge_ngram
min_gram: 2
max_gram: 32
side: front
analyzer:
character_only:
type: custom
tokenizer: standard
filter: [ lowercase, edge_ngram_filter ]
special_character:
type: custom
tokenizer: keyword
filter: [ lowercase, edge_ngram_filter ]
And I'm matching about 95% of the things I would hope too with
curl -X GET 'http://localhost:9200/access_point_status/_search from=0&size=100&pretty' -d '
{
"query":
{
"bool":
{
"must":[
{
"prefix":
{
"_all":"bsap-"
}
}]
}
},"sort":[
{
"name":"asc"
}],"filter":
{
"term":
{
"domain":"domain_1"
}
},"size":100,"from":0
}'
The only thing I'm missing is getting special characters to match and uppercase characters are not matched. I've tried several types of queries, query_string doesn't seem to match any partials. Anyone have any thoughts on other queries?
I need to match things like mac addresses, ip's, and then combined text/number fields with -_,. as seperators.

Resources