Elasticsearch. How to stem plurals ending in es? - elasticsearch

For the EN language I have a custom analyser using the porter_stem. I want queries with the words "virus" and "viruses" to return the same results.
What I am finding is that porter stems virus->viru and viruses->virus. Consequently I get differing results.
How can I handle this?

You can achieve your use case, i.e, queries with the words "virus" and "viruses" should return the same result, by using snowball token filter,
that stems all the words to their root word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_snow"
]
}
},
"filter": {
"my_snow": {
"type": "snowball",
"language": "English"
}
}
}
},
"mappings": {
"properties": {
"desc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analyze
{
"analyzer" : "my_analyzer",
"text" : "viruses"
}
Following tokens are generated -
{
"tokens": [
{
"token": "virus",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Index Data:
{
"desc":"viruses"
}
{
"desc":"virus"
}
Search Query:
{
"query": {
"match": {
"desc": {
"query": "viruses"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65707743",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"desc": "viruses"
}
},
{
"_index": "65707743",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"desc": "virus"
}
}
]

Related

matching multiple terms using match_phrase - Elasticsearch

Trying to fetch two documents that fit on the params searched, searching by each document separately works fine.
The query:
{
"query":{
"bool":{
"should":[
{
"match_phrase":{
"email":"elpaso"
}
},
{
"match_phrase":{
"email":"walker"
}
}
]
}
}
}
Im expecting to retrieve both documents that have these words in their email address field, but the query is only returning the first one elpaso
Is this an issue related to index mapping? I'm using type text for this field.
Any concept I am missing?
Index mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name":{
"type": "text"
},
"email":{
"type" : "text"
}
}
}
}
Sample data:
{
"id":"4a43f351-7b62-42f2-9b32-9832465d271f",
"name":"Walker, Gary (Mr.) .",
"email":"walkergrym#mail.com"
}
{
"id":"1fc18c05-da40-4607-a901-3d78c523cea6",
"name":"Texas Chiropractic Association P.A.C.",
"email":"txchiro#mail.com"
}
{
"id":"9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name":"El Paso Energy Corp. PAC",
"email":"elpaso#mail.com"
}
I also noticed that if I use elpaso and txchiro instead of walker the query works as expected!
noticed that the issue happens, when I use only parts of the field. If i search by the exact entire email address, everything works fine.
is this expected from match_phrase?
You are not getting any result from walker because elasticsearch uses a standard analyzer if no analyzer is specified which will tokenize walkergrym#mail.com as
GET /_analyze
{
"analyzer" : "standard",
"text" : "walkergrym#mail.com"
}
The following token will be generated
{
"tokens": [
{
"token": "walkergrym",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mail.com",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Since there is no token for walker you are not getting "walkergrym#mail.com" in your search result.
Whereas for "txchiro#mail.com", token generated are txchiro and mail.com and for "elpaso#mail.com" tokens are elpaso and mail.com
You can use the edge_ngram tokenizer, to achieve your required result
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
},
"id": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"email": "elpaso"
}
},
{
"match": {
"email": "walker"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66907434",
"_type": "_doc",
"_id": "1",
"_score": 3.9233165,
"_source": {
"id": "4a43f351-7b62-42f2-9b32-9832465d271f",
"name": "Walker, Gary (Mr.) .",
"email": "walkergrym#mail.com"
}
},
{
"_index": "66907434",
"_type": "_doc",
"_id": "3",
"_score": 3.9233165,
"_source": {
"id": "9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name": "El Paso Energy Corp. PAC",
"email": "elpaso#mail.com"
}
}
]

Elasticsearch query to return part of words searched for

I would like to know how I can return "thanks" or "thanking" if I search for "thank"
Currently I have a multi-match query which returns only occurrences of "thank" like "thank you" but not "thanksgiving" or "thanks". I am using ElasticSearch 7.9.1
query: {
bool: {
must: [
{match: {accountId}},
{
multi_match: {
query: "thank",
type: "most_fields",
fields: ["text", "address", "description", "notes", "name"],
}
}
],
filter: {match: {type: "personaldetails"}}
}
},
Also is it possible to combine the multimatch query with a queryString on one of the fields (say description, where I would do a querystring search only on description and a phrase match on other fields)
{ "query": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
Any input is appreciated.
thanks
You can use edge_ngrma tokenizer that first breaks text down into
words whenever it encounters one of a list of specified characters,
then it emits N-grams of each word where the start of the N-gram is
anchored to the beginning of the word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 5,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"notes": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"notes":"thank"
}
{
"notes":"thank you"
}
{
"notes":"thanks"
}
{
"notes":"thanksgiving"
}
Search Query:
{
"query": {
"multi_match" : {
"query": "thank",
"fields": [ "notes", "name" ]
}
}
}
Search Result:
"hits": [
{
"_index": "65511630",
"_type": "_doc",
"_id": "1",
"_score": 0.1448707,
"_source": {
"notes": "thank"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "3",
"_score": 0.1448707,
"_source": {
"notes": "thank you"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "2",
"_score": 0.12199639,
"_source": {
"notes": "thanks"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "4",
"_score": 0.06264679,
"_source": {
"notes": "thanksgiving"
}
}
]
To combine multi-match query with query string, use the below query:
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "thank",
"fields": [
"notes",
"name"
]
}
},
"should": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
}
}

elasticsearch ignore accents on search

I have an elasticsearch index with customer informations
I have some issues looking for some results with accents
for example, I have {name: 'anais'} and {name: anaïs}
Running
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anaïs"}
}
}
I would like to get both same for this query, in this case I only have anaïs
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anais"}
}
}
I would like to get anais and anaïs, in this case I only have anais
I tried adding an analyser
PUT /my-new-celebrity/_settings
{
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
But in this case for both search I only get anais
Looks like you forgot to apply your custom default analyzer on your name field, below is working example:
Index def with mapping and setting
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings" : {
"properties" :{
"name" : {
"type" : "text",
"analyzer" : "default" // note this
}
}
}
}
Index sample docs
{
"name" : "anais"
}
{
"name" : "anaïs"
}
Search query same as yours
{
"size": 25,
"query": {
"match": {
"name": "anaïs"
}
}
}
And expected both search results
"hits": [
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"name": "anaïs"
}
},
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"name": "anais"
}
}
]

Elasticsearch - How to specify the same analyzer for search and index

I'm working on a Spanish search engine. (I don't speak Spanish) But based on my research, the goal is more or less like this: 1. filter stopwords like "dos","de","la"... 2. stem the words for both search and index. e.g If you search "primera", then "primero","primer" should also show up.
My attempt:
es_analyzer={
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type": "spanish"
},
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
}
}
The problem:
When I use "type":"spanish" in the "default_search", my query "primera" gets stemmed to "primer", which is correct, but even though I specified to use "spanish_stemmer" in the filter, the documents in the index aren't stemmed. So as a result when I search for "primera", it only shows exact matches for "primer". Any suggestions on fixing this?
Potential fix but I haven't figured out the syntax:
Using built-in "spanish" analyzer in filter. What's the syntax?
Adding spanish stemmer and stopwords in "default_search". But I don't know how to use compound settings there.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type":"spanish",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
},
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer":"default_search"
}
}
}
}
Index Data:
{
"title": "primer"
}
{
"title": "primera"
}
{
"title": "primero"
}
Search Query:
{
"query":{
"match":{
"title":"primer"
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "3",
"_score": 0.13353139,
"_source": {
"title": "primer"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "1",
"_score": 0.13353139,
"_source": {
"title": "primera"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "2",
"_score": 0.13353139,
"_source": {
"title": "primero"
}
}
]

Elastic search results are not as expected

I've a field indexed with custom analyzer with the below configuration
"COMPNAYNAME" : {
"type" : "text",
"analyzer" : "textAnalyzer"
}
"textAnalyzer" : {
"filter" : [
"lowercase"
],
"char_filter" : [ ],
"type" : "custom",
"tokenizer" : "ngram_tokenizer"
}
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"min_gram" : "2",
"max_gram" : "3"
}
}
While I'm searching for a text "ikea" I'm getting the below results
Query :
GET company_info_test_1/_search
{
"query": {
"match": {
"COMPNAYNAME": {"query": "ikea"}
}
}
}
Fallowing are the results,
1.mikea
2.likeable
3.maaikeart
4.likeables
5.ikea b.v. <------
6.likeachef
7.ikea breda <------
8.bernikeart
9.ikea duiven
10.mikea media
I'm expecting the exact match result should be boosted more than the rest of the results.
Could you please help me what is the best way to index if I have to search with exact match as well as with fizziness.
Thanks in advance.
You can use ngram tokenizer along with "search_analyzer": "standard" Refer this to know more about search_analyzer
As pointed out by #EvaldasBuinauskas you can also use edge_ngram tokenizer here, if you want the tokens to be generated from the beginning only and not from the middle.
Adding a working example with index data, mapping, search query, and result
Index Data:
{ "title": "ikea b.v."}
{ "title" : "mikea" }
{ "title" : "maaikeart"}
Index Mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
Search Query:
{
"query": {
"match" : {
"title" : "ikea"
}
}
}
Search Result:
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "4",
"_score": 0.1499838, <-- note this
"_source": {
"title": "ikea b.v."
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 0.13562363, <-- note this
"_source": {
"title": "mikea"
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "3",
"_score": 0.083597526,
"_source": {
"title": "maaikeart"
}
}
]

Resources