Language Analyzer doesnt work find singular results - elasticsearch

I have a bunch of categories with translations in my category field. I have defined language analyzers for the fields in my index so I can search for them. But it doesnt find the singular version of my words. wasmachine in titles.title-nl is singular of wasmachines but not found. What am I missing?
Demo document
"_source" : {
"google_id" : 2706,
"titles" : [
{
"title-en" : "laundry appliances",
"title-de" : "waschen & trocknen",
"title-fr" : "appareils de blanchisserie",
"title-nl" : "wasmachines"
}
]
}
Way I mapped them
PUT categories/_mapping/category
{
"dynamic": false,
"properties": {
"titles.title-nl": {
"type": "text",
"analyzer": "dutch"
},
"titles.title-en": {
"type": "text",
"analyzer": "english"
},
"titles.title-de": {
"type": "text",
"analyzer": "german"
},
"titles.title-fr": {
"type": "text",
"analyzer": "french"
}
}
}
The way I search for them
GET categories/_search
{
"size": 4,
"query": {
"multi_match": {
"query": "wasmachines",
"fields": ["titles.title-de","titles.title-en", "titles.title-fr", "titles.title-nl"]
}
}
}

The problem is that the default dutch analyzer doesn't know how to stem the word wasmachines, you will need to recreate your index with a custom analyzer using a stemmer_override.
Looking in the elastic documentation you can do the following to recreate the dutch analyzer and tell that wasmachines should be stemmed to wasmachine, just put wasmachine => wasmachines inside the rules for the stemmer_override
PUT categories/
{
"settings": {
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"dutch_keywords": {
"type": "keyword_marker",
"keywords": ["voorbeeld"]
},
"dutch_stemmer": {
"type": "stemmer",
"language": "dutch"
},
"dutch_override": {
"type": "stemmer_override",
"rules": [
"fiets=>fiets",
"bromfiets=>bromfiets",
"wasmachine=>wasmachines",
"ei=>eier",
"kind=>kinder"
]
}
},
"analyzer": {
"rebuilt_dutch": {
"tokenizer": "standard",
"filter": [
"lowercase",
"dutch_stop",
"dutch_keywords",
"dutch_override",
"dutch_stemmer"
]
}
}
}
}
}
You will also need to use that new analyzer in your mapping:
PUT categories/_mapping/category
{
"dynamic": false,
"properties": {
"titles.title-nl": {
"type": "text",
"analyzer": "rebuilt_dutch"
},
"titles.title-en": {
"type": "text",
"analyzer": "english"
},
"titles.title-de": {
"type": "text",
"analyzer": "german"
},
"titles.title-fr": {
"type": "text",
"analyzer": "french"
}
}
}
After that you will be able to search for wasmachine and get the documents that have wasmachines.

Related

ElasticSearch Search-as-you-type field type field with partial search

I recently updating my ngram implementation settings to use Search-as-you-type field type.
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html
This worked great but I noticed that partial searching does not work.
If I search for number 00060434 I get the desired result but I would also like to be able to search for 60434, then it should return document 3.
Is there a way todo it with the Search-as-you-type field type or can i only do this with ngrams?
PUT searchasyoutype_example
{
"settings": {
"analysis": {
"analyzer": {
"englishAnalyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"ascii_folding"
]
}
},
"filter": {
"ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"properties": {
"number": {
"type": "search_as_you_type",
"analyzer": "englishAnalyzer"
},
"fullName": {
"type": "search_as_you_type",
"analyzer": "englishAnalyzer"
}
}
}
}
PUT searchasyoutype_example/_doc/1
{
"number" : "00069794",
"fullName": "Employee 1"
}
PUT searchasyoutype_example/_doc/2
{
"number" : "00059840",
"fullName": "Employee 2"
}
PUT searchasyoutype_example/_doc/3
{
"number" : "00060434",
"fullName": "Employee 3"
}
GET searchasyoutype_example/_search
{
"query": {
"multi_match": {
"query": "00060434",
"type": "bool_prefix",
"fields": [
"number",
"number._index_prefix",
"fullName",
"fullName._index_prefix"
]
}
}
}
I think you need to query on number,number._2gram & number._3gram like below:
GET searchasyoutype_example/_search
{
"query": {
"multi_match": {
"query": "00060434",
"type": "bool_prefix",
"fields": [
"number",
"number._2gram",
"number._3gram",
]
}
}
}
search_as_you_type creates the 3 sub fields. You can check more on this article how it works:
https://ashish.one/blogs/search-as-you-type/

How to create a custom reusable type in ElasticSearch?

My json for ElasticSearch schema looks like this :-
{
"mappings": {
"properties": {
"DESCRIPTION_FR": {
"type": "text",
"analyzer": "french"
},
"FEEDBACK_FR": {
"type": "text",
"analyzer": "french"
},
"SOURCE_FR": {
"type": "text",
"analyzer": "french"
}
}
}
}
There are 100 of properties like this. Replicating a change across all the properties with this approach is redundant and erroneous.
Is there a way in ElasticSearch 7.2 to write custom data type and reuse it in property mapping.
{
"settings": {
//definition of custom type "text_fr"
},
"mappings": {
"properties": {
"DESCRIPTION_FR": {
"type": "text_fr"
},
"FEEDBACK_FR": {
"type": "text_fr"
},
"SOURCE_FR": {
"type": "text_fr"
}
}
}
}
Yes! What you're after is dynamic mapping templates. More specifically the match feature.
Define the target field names with a leading wildcard:
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"is_french_text": {
"match_mapping_type": "*",
"match": "*_FR",
"mapping": {
"type": "text",
"analyzer": "french"
}
}
}
]
}
}
Insert a doc:
POST my_index/_doc
{
"DESCRIPTION_FR": "je",
"FEEDBACK_FR": "oui",
"SOURCE_FR": "je ne sais quoi"
}
Verify the dynamically generated mapping:
GET my_index/_mapping

Elasticsearch replacing cross_fields with combined field and fuzzy

We have an index which was previously searching a few fields such as this:
"query":{
"bool":{
"filter":[
{
"term":{
"eventvisibility":"public"
}
}
],
"should":[
{
"multi_match":{
"query":"keyword",
"fields":[
"eventname",
"venue.name",
"venue.town"
],
"type":"cross_fields",
"minimum_should_match":"3<80%"
}
},
{
"match":{
"eventdescshort":{
"query":"keyword",
"minimum_should_match":"2<80%"
}
}
}
],
"minimum_should_match":1
}
}
This works, but often fails due to spelling mistakes, etc with letters left off the keyword or transposed.
So I was hoping to implement fuzzy searching, As this doesn't work with cross_fields, I created a new field in the index:
"mappings": {
"event": {
"properties": {
"basic_search": {
"type": "text",
"analyzer": "nameanalyzer"
},
"eventname":{
"type": "text",
"copy_to": "basic_search" ,
"fields": {
"raw": {
"type": "keyword"
}
},
"analyzer": "nameanalyzer"
},
"venue": {
"properties": {
"name": {
"type": "text",
"copy_to": "basic_search" ,
"fields": {
"raw": {
"type": "keyword"
}
},
"analyzer": "nameanalyzer"
},
...snip (all fields previosouly in cross_fields now have copy_to: basic_search) ...
}
And our analyzer is as follows:
"nameanalyzer": {
"filter": [
"lowercase",
"stop",
"english_possessive_stemmer",
"english_minimal_stemmer",
"synonym",
"asciifolding",
"word_delimiter"
],
"char_filter": "html_strip",
"type": "custom",
"tokenizer": "standard"
}
I've now run a test search, as follows:
{
"query": {
"fuzzy": {
"basic_search": {
"value": "carers fair"
}
}
}
However, this is not giving me any matches at all.
I just get:
"type": "MatchNoDocsQuery",
"description": "MatchNoDocsQuery(\"empty BooleanQuery\")",
I know I can't see the contents of the basic_search field in _source, so how can I debug and know why this isn't matching?
Fuzzy query don't analyze text before searching. Usage of the same should be avoided.
Excerpt from ES Doc below :
fuzzy query: The elasticsearch fuzzy query type should generally be avoided. Acts much like a term query. Does not analyze the query text first.
Please try below query:
{
"query":{
"match":{
"basic_search":{
"query":"carers fair",
"fuzziness":"AUTO"
}
}
}

Get exact match after doing mapping as not_analyzed

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?
Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

Resources