Elasticsearch inaccurate score for symbol searching - elasticsearch

I've two stores named, face- and face+ store
when I search for face+, i want the list of results to be:
face+ store
face-
But, the results are
face-
face+ store
My custom analyzer will produce tokens like this
face- to [face, face-]
face+ store to [face, face+, +store, store]
this is my query
multi_match: {
query: keywords,
type: "best_fields",
fields: ['name.analyzed^10','name.word_middle^5'],
analyzer: "custom_analyzer",
}
this is my mapping, if it helps
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
char_filter: ["ampersand", "sym_repl"],
tokenizer: "whitespace",
filter: ["lowercase", "asciifolding", "searchkick_index_shingle", "searchkick_stemmer", "del_sym"]
}
},
char_filter: {
# adding a space between those patterns
sym_repl: {
type: "pattern_replace",
pattern: '[.+\-*|\]\)\"##&!]',
replacement: " $0 "
}
},
filter: {
# remove token that match the stopwords
del_sym: {
type: "stop",
stopwords: ['.', '+', '-', '*', '|', ']', ')', '"', '#', '#', '&', '!']
}
}
}
mappings: {
store: {
properties: {
name: {
type: "keyword",
fields: {
analyzed: {
type: "text",
analyzer: "custom_analyzer"
}
}
},

Its difficult to produce your issue, as you are using searchkick. but if you use minimum_should_matchwith value 2 and create a proper query, it will filterface-` from the search result and that is what you want.

Related

Partial search query in kuromoji

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

ElasticSearch match multiple fields with different values

I can actually perform a simple match search with this:
query: {match: {searchable: {query:search}}}
This works well, my searchable field is analyzed in my mapping.
Now I want to perform a search on multiple fields: 1 string, and all other are numeric.
My mapping:
mappings dynamic: 'false' do
indexes :searchable, analyzer: "custom_index_analyzer", search_analyzer: "custom_search_analyzer"
indexes :year, type: "integer"
indexes :country_id, type: "integer"
indexes :region_id, type: "integer"
indexes :appellation_id, type: "integer"
indexes :category_id, type: "integer"
end
def as_indexed_json(options={})
as_json(
only: [:searchable, :year, :country_id, :region_id, :appellation_id, :category_id]
)
end
I have tried this:
query: {
filtered: {
query: {
match: {
searchable: search
}
},
filter: {
term: {
country_id: "1"
},
term: {
region_id: "2"
},
term: {
appellation_id: "34"
}
}
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
It works well but it will give me 100 results in all cases, from the appellation_id sent (34), even if the searchable field is very far from the text search.
I have also tried a BOOL query:
self.search(query: {
bool: {
must: [
{
match: {
country_id: "1"
},
match: {
region_id: "2"
},
match: {
appellation_id: "34"
},
match: {
searchable: search
}
}
]
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
)
But It will give me all results matching the searchable field and don't take care of the appellation_id wanted.
My goal is to get the best results and performance and ask ES to give me all data from country_id=X, region_id=Y, appellation_id=Z and finally perform a match on this set of results with the searchable field... and don't obtain results to far from reality with the searchable text.
Thanks.
As you may know elasticsearch match query returns result based on a relevance score. You can try to use term query instead of match for an exact term match. Also I guess your bool query structure must be like :
bool: {
must: [
{ match: {
country_id: "1"
}},
{match: {
region_id: "2"
}},
{match: {
appellation_id: "34"
}},
{match: {
searchable: search
}}
]
}

Elasticsearch: Scoring with Ngrams

I have a straight forward question where I have incorporated ngram's for partial matchings. The implementation works well but the score results aren't working as I hoped. I would like my score results to look something like this:
Ke: .1
Kev: .2
Kevi: .3
Kevin: .4
Instead I am getting the following results where the scoring is the same if there is a match for the field:
Ke: .4
Kev: .4
Kevi: .4
Kevin: .4
Settings:
settings: {
analysis: {
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 15
}
},
analyzer: {
ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: [
'lowercase',
'ngram_filter'
]
}
}
}
}
Mappings:
mappings: [{
name: 'voter',
_all: {
'type': 'string',
'analyzer': 'ngram_analyzer',
'search_analyzer': 'standard'
},
properties: {
last: {
type: 'string',
required : true,
include_in_all: true,
analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
first: {
type: 'string',
required : true,
include_in_all: true,
analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
}
}]
Query:
GET /user/_search
{
"query": {
"match": {
"_all": {
"query": "Ke",
"operator": "and"
}
}
}
}
You can solve that using an edgeNGram tokenizer instead of an edgeNGram filter:
settings: {
analysis: {
tokenizer: {
ngram_tokenizer: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 15
}
},
analyzer: {
ngram_analyzer: {
type: 'custom',
tokenizer: 'ngram_tokenizer',
filter: [
'lowercase'
]
}
}
}
}
The reason for this is that the edgeNGram filter will write the terms for a given token at the same position (pretty much like synonyms would do), while the edgeNGram tokenizer will create tokens which have different positions, hence influencing the length normalization, hence the score.
Note that this works only on pre-2.0 ES releases, because a compound score is computed from all ngram tokens scores, whereas in ES 2.x only the matching token is scored.

ElasticSearch "H & R Block" with partial word search

The requirements are to be able to search the following terms :
"H & R" to find "H & R Block".
I have managed to implement this requirement alone using word_delimiter, as mentionned in this answer elasticsearch tokenize "H&R Blocks" as "H", "R", "H&R", "Blocks"
Using ruby code :
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] },
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "whitespace",
filter: %w[lowercase asciifolding my_splitter]
}
}
}
But also, in the same query, we want autocomplete functionality or partial word matching, so
"Ser", "Serv", "Servi", "Servic" and "Service" all find "Service" and "Services".
I have managed to implement this requirement alone, using ngram.
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
tokenizer: "my_ngram",
filter: %w[lowercase asciifolding]
}
},
tokenizer: {
my_ngram: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
I just can't manage to implement them together. When I use ngram, short words are ignored so "H & R" is left out. When I use word_delimiter, partial word searches stop working. Below, my latest attempt at merging both requirements, it results in supporting partial word searches but not "H & R".
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "my_tokenizer",
filter: %w[lowercase asciifolding my_splitter]
}
},
tokenizer: {
my_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
You can use multi_field from your mapping to index the same field in multiple ways. You can use your full text search with custom tokenizer on the default field, and create a special indexing for your autocompletion needs.
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Your query will need to be slightly different when performing the autocomplete as the field will be title.raw instead of just title.
Once the field is indexed in all the ways that make sense for your query, you can query the index using a boolean "should" query, matching the tokenized version and the word start query. It is likely that a larger boost should be provided to the first query matching complete words to get the direct hits on top.

Stemming and highlighting for phrase search

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...
Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Resources