Partial search query in kuromoji - elasticsearch

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

Related

Elasticsearch inaccurate score for symbol searching

I've two stores named, face- and face+ store
when I search for face+, i want the list of results to be:
face+ store
face-
But, the results are
face-
face+ store
My custom analyzer will produce tokens like this
face- to [face, face-]
face+ store to [face, face+, +store, store]
this is my query
multi_match: {
query: keywords,
type: "best_fields",
fields: ['name.analyzed^10','name.word_middle^5'],
analyzer: "custom_analyzer",
}
this is my mapping, if it helps
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
char_filter: ["ampersand", "sym_repl"],
tokenizer: "whitespace",
filter: ["lowercase", "asciifolding", "searchkick_index_shingle", "searchkick_stemmer", "del_sym"]
}
},
char_filter: {
# adding a space between those patterns
sym_repl: {
type: "pattern_replace",
pattern: '[.+\-*|\]\)\"##&!]',
replacement: " $0 "
}
},
filter: {
# remove token that match the stopwords
del_sym: {
type: "stop",
stopwords: ['.', '+', '-', '*', '|', ']', ')', '"', '#', '#', '&', '!']
}
}
}
mappings: {
store: {
properties: {
name: {
type: "keyword",
fields: {
analyzed: {
type: "text",
analyzer: "custom_analyzer"
}
}
},
Its difficult to produce your issue, as you are using searchkick. but if you use minimum_should_matchwith value 2 and create a proper query, it will filterface-` from the search result and that is what you want.

Elastic ngram prioritise whole words

I am trying to build an autocomplete with several million possible values. I have managed to do it with two different methods match and ngram. The problem is that match requires the user to type whole words and ngram returns poor results. Is there a way to only return ngram results if there are no match results?
Method 1: match
Returns very relevant results but requires user to type a full word
//mapping
analyzer: {
std_english: {
type: 'standard',
stopwords: '_english_',
},
}
//search
query: {
bool: {
must: [
{ term: { semanticTag: type } },
{ match: { search } }
]}
}
Method 2: ngram
Returns poor matches
//mapping
analysis: {
filter: {
autocomplete_filter: {
type: 'edge_ngram',
min_gram: 1,
max_gram: 20,
},
},
analyzer: {
autocomplete: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'autocomplete_filter'],
},
},
//search
query: {
bool: {
must: [
{ term: { semanticTag: type } },
{ match: {
term: {
query: search,
operator: 'and',
}
}
}
]}
}
Try changing query to something like this -
{
"query": {
"bool": {
"must": [
{
"term": {
"semanticTag": "type"
}
},
{
"match_phrase_prefix": {
"fieldName": {
"query": "valueToSearch"
}
}
}
]
}
}
}
You can use match_phrase_prefix, by using this user will not need to type the whole word, anything that user types and which starts with indexed field data will get returned.
Just a note that this will also pull results from any available middle words from indexed documents as well.
For e.g. If data indexed in one of field is like - "lorem ipsum" and user type "ips" then you will get this whole document along with other documents that starts with "ips"
You can go with either standard or custom analyzer, you have to check which analyzer better suits your use case. According to information available in question, given above approach works well with standard analyzer.

Elastic search query using match_phrase_prefix and fuzziness at the same time?

I am new to elastic search, so I am struggling a bit to find the optimal query for our data.
Imagine I want to match the following word "Handelsstandens Boldklub".
Currently, I'm using the following query:
{
query: {
bool: {
should: [
{
match: {
name: {
query: query, slop: 5, type: "phrase_prefix"
}
}
},
{
match: {
name: {
query: query,
fuzziness: "AUTO",
operator: "and"
}
}
}
]
}
}
}
It currently list the word if I am searching for "Hand", but if I search for "Handle" the word will no longer be listed as I did a typo. However if I reach to the end with "Handlesstandens" it will be listed again, as the fuzziness will catch the typo, but only when I have typed the whole word.
Is it somehow possible to do phrase_prefix and fuzziness at the same time? So in the above case, if I make a typo on the way, it will still list the word?
So in this case, if I search for "Handle", it will still match the word "Handelsstandens Boldklub".
Or what other workarounds are there to achieve the above experience? I like the phrase_prefix matching as its also supports sloppy matching (hence I can search for "Boldklub han" and it will list the result)
Or can the above be achieved by using the completion suggester?
Okay, so after investigating elasticsearch even further, I came to the conclusion that I should use ngrams.
Here is a really good explaniation of what it does and how it works.
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
Here is the settings and mapping I used: (This is elasticsearch-rails syntax)
settings analysis: {
filter: {
ngram_filter: {
type: "ngram",
min_gram: "2",
max_gram: "20"
}
},
analyzer: {
ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "ngram_filter"]
}
}
} do
mappings do
indexes :name, type: "string", analyzer: "ngram_analyzer"
indexes :country_id, type: "integer"
end
end
And the query: (This query actually search in two different indexes at the same time)
{
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { "club.country_id": country.id } },
{ match: { name: query } }
]
}
},
{
bool: {
must: [
{ match: { country_id: country.id } },
{ match: { name: query } }
]
}
}
],
minimum_should_match: 1
}
}
}
But basically you should just do a match or multi match query, depending on how many fields you want to search in.
I hope someone find it helpful, as I was personally thinking to much in terms of fuzziness instead of ngrams (Didn't know about before). This led me in the wrong direction.

ElasticSearch "H & R Block" with partial word search

The requirements are to be able to search the following terms :
"H & R" to find "H & R Block".
I have managed to implement this requirement alone using word_delimiter, as mentionned in this answer elasticsearch tokenize "H&R Blocks" as "H", "R", "H&R", "Blocks"
Using ruby code :
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] },
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "whitespace",
filter: %w[lowercase asciifolding my_splitter]
}
}
}
But also, in the same query, we want autocomplete functionality or partial word matching, so
"Ser", "Serv", "Servi", "Servic" and "Service" all find "Service" and "Services".
I have managed to implement this requirement alone, using ngram.
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
tokenizer: "my_ngram",
filter: %w[lowercase asciifolding]
}
},
tokenizer: {
my_ngram: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
I just can't manage to implement them together. When I use ngram, short words are ignored so "H & R" is left out. When I use word_delimiter, partial word searches stop working. Below, my latest attempt at merging both requirements, it results in supporting partial word searches but not "H & R".
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "my_tokenizer",
filter: %w[lowercase asciifolding my_splitter]
}
},
tokenizer: {
my_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
You can use multi_field from your mapping to index the same field in multiple ways. You can use your full text search with custom tokenizer on the default field, and create a special indexing for your autocompletion needs.
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Your query will need to be slightly different when performing the autocomplete as the field will be title.raw instead of just title.
Once the field is indexed in all the ways that make sense for your query, you can query the index using a boolean "should" query, matching the tokenized version and the word start query. It is likely that a larger boost should be provided to the first query matching complete words to get the direct hits on top.

Stemming and highlighting for phrase search

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...
Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Resources