Elasticsearch: Scoring with Ngrams - elasticsearch

I have a straight forward question where I have incorporated ngram's for partial matchings. The implementation works well but the score results aren't working as I hoped. I would like my score results to look something like this:
Ke: .1
Kev: .2
Kevi: .3
Kevin: .4
Instead I am getting the following results where the scoring is the same if there is a match for the field:
Ke: .4
Kev: .4
Kevi: .4
Kevin: .4
Settings:
settings: {
analysis: {
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 15
}
},
analyzer: {
ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: [
'lowercase',
'ngram_filter'
]
}
}
}
}
Mappings:
mappings: [{
name: 'voter',
_all: {
'type': 'string',
'analyzer': 'ngram_analyzer',
'search_analyzer': 'standard'
},
properties: {
last: {
type: 'string',
required : true,
include_in_all: true,
analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
first: {
type: 'string',
required : true,
include_in_all: true,
analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
}
}]
Query:
GET /user/_search
{
"query": {
"match": {
"_all": {
"query": "Ke",
"operator": "and"
}
}
}
}

You can solve that using an edgeNGram tokenizer instead of an edgeNGram filter:
settings: {
analysis: {
tokenizer: {
ngram_tokenizer: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 15
}
},
analyzer: {
ngram_analyzer: {
type: 'custom',
tokenizer: 'ngram_tokenizer',
filter: [
'lowercase'
]
}
}
}
}
The reason for this is that the edgeNGram filter will write the terms for a given token at the same position (pretty much like synonyms would do), while the edgeNGram tokenizer will create tokens which have different positions, hence influencing the length normalization, hence the score.
Note that this works only on pre-2.0 ES releases, because a compound score is computed from all ngram tokens scores, whereas in ES 2.x only the matching token is scored.

Related

Elasticsearch inaccurate score for symbol searching

I've two stores named, face- and face+ store
when I search for face+, i want the list of results to be:
face+ store
face-
But, the results are
face-
face+ store
My custom analyzer will produce tokens like this
face- to [face, face-]
face+ store to [face, face+, +store, store]
this is my query
multi_match: {
query: keywords,
type: "best_fields",
fields: ['name.analyzed^10','name.word_middle^5'],
analyzer: "custom_analyzer",
}
this is my mapping, if it helps
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
char_filter: ["ampersand", "sym_repl"],
tokenizer: "whitespace",
filter: ["lowercase", "asciifolding", "searchkick_index_shingle", "searchkick_stemmer", "del_sym"]
}
},
char_filter: {
# adding a space between those patterns
sym_repl: {
type: "pattern_replace",
pattern: '[.+\-*|\]\)\"##&!]',
replacement: " $0 "
}
},
filter: {
# remove token that match the stopwords
del_sym: {
type: "stop",
stopwords: ['.', '+', '-', '*', '|', ']', ')', '"', '#', '#', '&', '!']
}
}
}
mappings: {
store: {
properties: {
name: {
type: "keyword",
fields: {
analyzed: {
type: "text",
analyzer: "custom_analyzer"
}
}
},
Its difficult to produce your issue, as you are using searchkick. but if you use minimum_should_matchwith value 2 and create a proper query, it will filterface-` from the search result and that is what you want.

Partial search query in kuromoji

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

Elasticsearch edgeNGram analyzer/tokenizer fuzzy query matching

We have an Accounts table that we are searching for similar records using fuzzy query with edgeNGram analyzer for multiple fields. Our setup:
Settings
{
settings: {
analysis: {
analyzer: {
edge_n_gram_analyzer: {
tokenizer: "whitespace",
filter: ["lowercase", "ednge_gram_filter"]
}
},
filter: {
ednge_gram_filter: {
type: "edgeNGram",
min_gram: 2,
max_gram: 10
}
}
}
}
}
Mapping
{
mappings: {
document_type: {
properties: {
uid: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
shop_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
seller_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
...
...
...
locale_id: {
type: "integer"
}
}
}
}
}
Query
{
body: {
query: {
bool: {
must: [
{
bool: {
should: [
{
fuzzy: {
uid: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
seller_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
shop_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
}
]
}
}
],
must_not: [
{
term: {
locale_id: {
value: 7
}
}
}
]
}
}
}
}
The above example finds different variations of 'antonline' string such as "antonline", "sanjonline", "tanonline", "kotonline", "htonline", "awmonline". However, it doesn't match strings with punctuation like antonline.com or even antonlinecom without the dot. We tried different types of tokenizers but nothing helps.
How could we achieve the search result as we expect?
I resolved that issue by removing everything that matches this regex:
[.,'\"\-+:~\^!?*\\]
Do the removal while building the index as well as while searching.

ElasticSearch match multiple fields with different values

I can actually perform a simple match search with this:
query: {match: {searchable: {query:search}}}
This works well, my searchable field is analyzed in my mapping.
Now I want to perform a search on multiple fields: 1 string, and all other are numeric.
My mapping:
mappings dynamic: 'false' do
indexes :searchable, analyzer: "custom_index_analyzer", search_analyzer: "custom_search_analyzer"
indexes :year, type: "integer"
indexes :country_id, type: "integer"
indexes :region_id, type: "integer"
indexes :appellation_id, type: "integer"
indexes :category_id, type: "integer"
end
def as_indexed_json(options={})
as_json(
only: [:searchable, :year, :country_id, :region_id, :appellation_id, :category_id]
)
end
I have tried this:
query: {
filtered: {
query: {
match: {
searchable: search
}
},
filter: {
term: {
country_id: "1"
},
term: {
region_id: "2"
},
term: {
appellation_id: "34"
}
}
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
It works well but it will give me 100 results in all cases, from the appellation_id sent (34), even if the searchable field is very far from the text search.
I have also tried a BOOL query:
self.search(query: {
bool: {
must: [
{
match: {
country_id: "1"
},
match: {
region_id: "2"
},
match: {
appellation_id: "34"
},
match: {
searchable: search
}
}
]
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
)
But It will give me all results matching the searchable field and don't take care of the appellation_id wanted.
My goal is to get the best results and performance and ask ES to give me all data from country_id=X, region_id=Y, appellation_id=Z and finally perform a match on this set of results with the searchable field... and don't obtain results to far from reality with the searchable text.
Thanks.
As you may know elasticsearch match query returns result based on a relevance score. You can try to use term query instead of match for an exact term match. Also I guess your bool query structure must be like :
bool: {
must: [
{ match: {
country_id: "1"
}},
{match: {
region_id: "2"
}},
{match: {
appellation_id: "34"
}},
{match: {
searchable: search
}}
]
}

Stemming and highlighting for phrase search

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...
Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Resources