I'm working on a Spanish search engine. (I don't speak Spanish) But based on my research, the goal is more or less like this: 1. filter stopwords like "dos","de","la"... 2. stem the words for both search and index. e.g If you search "primera", then "primero","primer" should also show up.
My attempt:
es_analyzer={
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type": "spanish"
},
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
}
}
The problem:
When I use "type":"spanish" in the "default_search", my query "primera" gets stemmed to "primer", which is correct, but even though I specified to use "spanish_stemmer" in the filter, the documents in the index aren't stemmed. So as a result when I search for "primera", it only shows exact matches for "primer". Any suggestions on fixing this?
Potential fix but I haven't figured out the syntax:
Using built-in "spanish" analyzer in filter. What's the syntax?
Adding spanish stemmer and stopwords in "default_search". But I don't know how to use compound settings there.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type":"spanish",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
},
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer":"default_search"
}
}
}
}
Index Data:
{
"title": "primer"
}
{
"title": "primera"
}
{
"title": "primero"
}
Search Query:
{
"query":{
"match":{
"title":"primer"
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "3",
"_score": 0.13353139,
"_source": {
"title": "primer"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "1",
"_score": 0.13353139,
"_source": {
"title": "primera"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "2",
"_score": 0.13353139,
"_source": {
"title": "primero"
}
}
]
Related
For the EN language I have a custom analyser using the porter_stem. I want queries with the words "virus" and "viruses" to return the same results.
What I am finding is that porter stems virus->viru and viruses->virus. Consequently I get differing results.
How can I handle this?
You can achieve your use case, i.e, queries with the words "virus" and "viruses" should return the same result, by using snowball token filter,
that stems all the words to their root word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_snow"
]
}
},
"filter": {
"my_snow": {
"type": "snowball",
"language": "English"
}
}
}
},
"mappings": {
"properties": {
"desc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analyze
{
"analyzer" : "my_analyzer",
"text" : "viruses"
}
Following tokens are generated -
{
"tokens": [
{
"token": "virus",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Index Data:
{
"desc":"viruses"
}
{
"desc":"virus"
}
Search Query:
{
"query": {
"match": {
"desc": {
"query": "viruses"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65707743",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"desc": "viruses"
}
},
{
"_index": "65707743",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"desc": "virus"
}
}
]
I would like to know how I can return "thanks" or "thanking" if I search for "thank"
Currently I have a multi-match query which returns only occurrences of "thank" like "thank you" but not "thanksgiving" or "thanks". I am using ElasticSearch 7.9.1
query: {
bool: {
must: [
{match: {accountId}},
{
multi_match: {
query: "thank",
type: "most_fields",
fields: ["text", "address", "description", "notes", "name"],
}
}
],
filter: {match: {type: "personaldetails"}}
}
},
Also is it possible to combine the multimatch query with a queryString on one of the fields (say description, where I would do a querystring search only on description and a phrase match on other fields)
{ "query": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
Any input is appreciated.
thanks
You can use edge_ngrma tokenizer that first breaks text down into
words whenever it encounters one of a list of specified characters,
then it emits N-grams of each word where the start of the N-gram is
anchored to the beginning of the word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 5,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"notes": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"notes":"thank"
}
{
"notes":"thank you"
}
{
"notes":"thanks"
}
{
"notes":"thanksgiving"
}
Search Query:
{
"query": {
"multi_match" : {
"query": "thank",
"fields": [ "notes", "name" ]
}
}
}
Search Result:
"hits": [
{
"_index": "65511630",
"_type": "_doc",
"_id": "1",
"_score": 0.1448707,
"_source": {
"notes": "thank"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "3",
"_score": 0.1448707,
"_source": {
"notes": "thank you"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "2",
"_score": 0.12199639,
"_source": {
"notes": "thanks"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "4",
"_score": 0.06264679,
"_source": {
"notes": "thanksgiving"
}
}
]
To combine multi-match query with query string, use the below query:
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "thank",
"fields": [
"notes",
"name"
]
}
},
"should": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
}
}
My chalenge here is to create a autocomplete field (django and ES), where I could search "apeni", "rua apen" or "roa apen" and have got "rua apeninos" as the main (or unique) option. I have already tried suggest and completion in ES, but both use prefix (don't work with "apen"). I tried wildcards as well, but couldn't use fuzzy (don't work with "roa apeni" or "apini"). So, now I am tring match with fuzzy.
But even when query term is differente, like "rua ape" or "rua apot", it returns the same two docs with street_desc equal "rua apeninos" and "rua apotribu" and both with score 1.0.
Query:
{
"aggs":{
"addresses":{
"filters":{
"filters":{
"street":{
"match":{
"street_desc":{
"query":"rua ape",
"fuzziness":"AUTO",
"prefix_length":0,
"max_expansions":50
}
}
}
}
},
"aggs":{
"street_bucket":{
"significant_terms":{
"field":"street_desc.raw",
"size":3
}
}
}
}
},
"sort":[
{
"_score":{
"order":"desc"
}
}
]
}
Index:
{
"catalogs":{
"mappings":{
"properties":{
"street_desc":{
"type":"text",
"fields":{
"raw":{
"type":"keyword"
}
},
"analyzer":"suggest_analyzer"
}
}
}
}
}
Analyzer: (python)
suggest_analyzer = analyzer(
'suggest_analyzer',
tokenizer=tokenizer("lowercase"),
filter=[token_filter('stopbr', 'stop', stopwords="_brazilian_")],
language="brazilian",
char_filter=["html_strip"]
)
Adding an end to end working example, which I tested on all the given search terms.
Index-mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample docs
{
"title" : "rua apotribu"
}
{
"title" : "rua apeninos"
}
Search queries
{
"query": {
"match": {
"title": {
"query": "apeni", //
"fuzziness":"AUTO"
}
}
}
}
And search result
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 1.1026623,
"_source": {
"title": "rua apeninos"
}
}
]
Now with apen also it gives search result
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 2.517861,
"_source": {
"title": "rua apeninos"
}
}
]
And now when query terms are different like rua apot, it brings both the docs with a much higher score to rua apotribu as shown in below search result.
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "2",
"_score": 2.9289336,
"_source": {
"title": "rua apotribu"
}
},
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 0.41107285,
"_source": {
"title": "rua apeninos"
}
}
]
I have an elasticsearch index with customer informations
I have some issues looking for some results with accents
for example, I have {name: 'anais'} and {name: anaïs}
Running
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anaïs"}
}
}
I would like to get both same for this query, in this case I only have anaïs
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anais"}
}
}
I would like to get anais and anaïs, in this case I only have anais
I tried adding an analyser
PUT /my-new-celebrity/_settings
{
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
But in this case for both search I only get anais
Looks like you forgot to apply your custom default analyzer on your name field, below is working example:
Index def with mapping and setting
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings" : {
"properties" :{
"name" : {
"type" : "text",
"analyzer" : "default" // note this
}
}
}
}
Index sample docs
{
"name" : "anais"
}
{
"name" : "anaïs"
}
Search query same as yours
{
"size": 25,
"query": {
"match": {
"name": "anaïs"
}
}
}
And expected both search results
"hits": [
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"name": "anaïs"
}
},
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"name": "anais"
}
}
]
I've a field indexed with custom analyzer with the below configuration
"COMPNAYNAME" : {
"type" : "text",
"analyzer" : "textAnalyzer"
}
"textAnalyzer" : {
"filter" : [
"lowercase"
],
"char_filter" : [ ],
"type" : "custom",
"tokenizer" : "ngram_tokenizer"
}
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"min_gram" : "2",
"max_gram" : "3"
}
}
While I'm searching for a text "ikea" I'm getting the below results
Query :
GET company_info_test_1/_search
{
"query": {
"match": {
"COMPNAYNAME": {"query": "ikea"}
}
}
}
Fallowing are the results,
1.mikea
2.likeable
3.maaikeart
4.likeables
5.ikea b.v. <------
6.likeachef
7.ikea breda <------
8.bernikeart
9.ikea duiven
10.mikea media
I'm expecting the exact match result should be boosted more than the rest of the results.
Could you please help me what is the best way to index if I have to search with exact match as well as with fizziness.
Thanks in advance.
You can use ngram tokenizer along with "search_analyzer": "standard" Refer this to know more about search_analyzer
As pointed out by #EvaldasBuinauskas you can also use edge_ngram tokenizer here, if you want the tokens to be generated from the beginning only and not from the middle.
Adding a working example with index data, mapping, search query, and result
Index Data:
{ "title": "ikea b.v."}
{ "title" : "mikea" }
{ "title" : "maaikeart"}
Index Mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
Search Query:
{
"query": {
"match" : {
"title" : "ikea"
}
}
}
Search Result:
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "4",
"_score": 0.1499838, <-- note this
"_source": {
"title": "ikea b.v."
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 0.13562363, <-- note this
"_source": {
"title": "mikea"
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "3",
"_score": 0.083597526,
"_source": {
"title": "maaikeart"
}
}
]