ElasticSearch "H & R Block" with partial word search - elasticsearch

The requirements are to be able to search the following terms :
"H & R" to find "H & R Block".
I have managed to implement this requirement alone using word_delimiter, as mentionned in this answer elasticsearch tokenize "H&R Blocks" as "H", "R", "H&R", "Blocks"
Using ruby code :
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] },
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "whitespace",
filter: %w[lowercase asciifolding my_splitter]
}
}
}
But also, in the same query, we want autocomplete functionality or partial word matching, so
"Ser", "Serv", "Servi", "Servic" and "Service" all find "Service" and "Services".
I have managed to implement this requirement alone, using ngram.
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
tokenizer: "my_ngram",
filter: %w[lowercase asciifolding]
}
},
tokenizer: {
my_ngram: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
I just can't manage to implement them together. When I use ngram, short words are ignored so "H & R" is left out. When I use word_delimiter, partial word searches stop working. Below, my latest attempt at merging both requirements, it results in supporting partial word searches but not "H & R".
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "my_tokenizer",
filter: %w[lowercase asciifolding my_splitter]
}
},
tokenizer: {
my_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}

You can use multi_field from your mapping to index the same field in multiple ways. You can use your full text search with custom tokenizer on the default field, and create a special indexing for your autocompletion needs.
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Your query will need to be slightly different when performing the autocomplete as the field will be title.raw instead of just title.
Once the field is indexed in all the ways that make sense for your query, you can query the index using a boolean "should" query, matching the tokenized version and the word start query. It is likely that a larger boost should be provided to the first query matching complete words to get the direct hits on top.

Related

Elasticsearch inaccurate score for symbol searching

I've two stores named, face- and face+ store
when I search for face+, i want the list of results to be:
face+ store
face-
But, the results are
face-
face+ store
My custom analyzer will produce tokens like this
face- to [face, face-]
face+ store to [face, face+, +store, store]
this is my query
multi_match: {
query: keywords,
type: "best_fields",
fields: ['name.analyzed^10','name.word_middle^5'],
analyzer: "custom_analyzer",
}
this is my mapping, if it helps
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
char_filter: ["ampersand", "sym_repl"],
tokenizer: "whitespace",
filter: ["lowercase", "asciifolding", "searchkick_index_shingle", "searchkick_stemmer", "del_sym"]
}
},
char_filter: {
# adding a space between those patterns
sym_repl: {
type: "pattern_replace",
pattern: '[.+\-*|\]\)\"##&!]',
replacement: " $0 "
}
},
filter: {
# remove token that match the stopwords
del_sym: {
type: "stop",
stopwords: ['.', '+', '-', '*', '|', ']', ')', '"', '#', '#', '&', '!']
}
}
}
mappings: {
store: {
properties: {
name: {
type: "keyword",
fields: {
analyzed: {
type: "text",
analyzer: "custom_analyzer"
}
}
},
Its difficult to produce your issue, as you are using searchkick. but if you use minimum_should_matchwith value 2 and create a proper query, it will filterface-` from the search result and that is what you want.

Partial search query in kuromoji

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

Elastic Search Standard Analyzer + Synonyms

I am attempting the get Synonyms added to our current way of searching for products.
The mappings ( part of ) looks like this currently :
{
"keywords": {
"properties": {
"modifiers": {
"type": "string",
"analyzer": "standard"
},
"nouns": {
"type": "string",
"analyzer": "standard"
}
}
}
}
I am interested in use the synonyms filter along with the standard analyzer. And as per the document here from elastic search the right way to do that is to add
{
'analysis': {
'analyzer': {
"synonym" : {
"tokenizer" : 'standard',
"filter" : ['standard', 'lowercase', 'stop', 'external_synonym']
}
},
'filter': {
'external_synonym': {
'type': 'synonym',
'synonyms': synonyms
}
}
}
into the mappings and use that in the analyzer field in the snippet above. But this does not work - in the sense that - the behaviour is quite different ( even without adding any synonyms ).
I am interested in preserving the relevancy behaviour ( as provided by the standard analyzer ) and just added the synonyms list.
Could someone please provide more information on how to exactly replicate the standard analyzer behaviour ?

elasticsearch: multifield mapping of multiple fields

i will have a document with multiple fields. let's say 'title', 'meta1', 'meta2', 'full_body'.
each of them i want to index in a few different ways (raw, stemming without stop-words, shingles, synonyms etc.). therefore i will have fields like: title.stemming, title.shingles, meta1.stemming, meta1.shingles etc.
do i have to copy paste the mapping definition for each field? or is it possible to create one definition of all ways of indexing/analysing and then only apply it to each of 4 top level fields? if so, how?
mappings:
my_type:
properties:
title:
type: string
fields:
shingles:
type: string
analyzer: my_shingle_analyzer
stemming:
type: string
analyzer: my_stemming_analyzer
meta1:
... <-- do i have to repeat everything here?
meta2:
... <-- and here?
full_body:
... <-- and here?
In your case, you could use dynamic templates with the match_mapping_type setting so that you can apply the same setting to all your string fields:
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
},
"stemming": {
"type": "string",
"analyzer": "my_stemming_analyzer"
}
, ... other sub-fields and analyzers
}
}
}
}
]
}
}
}
As a result, whenever you index a string field, its mapping will be created according to the defined template. You can also use the match setting, to restrict the mapping to specific field names, only.

Elasticsearch UTF-8 characters in search

I have indexed record:
"žiema"
Elastic search settings:
index:
cmpCategory: {type: string, analyzer: like_analyzer}
Analyzer
analysis:
char_filter:
lt_characters:
type: mapping
mappings: ["ą=>a","Ą=>a","č=>c","Č=>c","ę=>e","Ę=>e","ė=>e","Ė=>e","į=>i","Į=>i","š=>s","Š=>s","ų=>u","Ų=>u","ū=>u","ž=>z", "Ū=>u"]
analyzer:
like_analyzer:
type: snowball
tokenizer: standard
filter : [lowercase,asciifolding]
char_filter : [lt_characters]
What I want:
By keyword "žiema" found record "žiema" AND by keyword "ziema" also found record "žiema", how to do that ?
I try execute characters replace and filter asciifolding
What am I doing wrong?
You can try indexing your field twice like it is shown in the documentation.
PUT /my_index/_mapping/my_type
{
"properties": {
"cmpCategory": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "like_analyzer"
}
}
}
}
}
so the cmpCategory field is indexed as standard with diacritics, and the cmpCategory.folded field is indexed without diacritics.
And while searching, you do the search on both indexes as such:
GET /my_index/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "žiema",
"fields": [ "cmpCategory", "cmpCategory.folded" ]
}
}
}
Also, I'm not sure if the char_filter is necessary since the asciifolding filter already does that transformation.

Resources