Query Elasticsearch index for words with and without accent - elasticsearch

I query for the word "café" and get 20 articles. Then I repeat the search for the word "cafe" and will only get 3 articles. So I'm looking for a possibility to handle words with letters with accent in the same way like words with letters without accent.
My problem is also, that I already have a filled index so I have to modify an existing system. I'm using Elasticsearch 6.5.
I found some useful information and went through the following steps:
Setting up folding analyzer
curl -H "Content-Type: application/json" --user <user:pass> -XPUT http://localhost/test/_settings?pretty -d '{
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}'
Modify existing mapping for the content field
curl -H "Content-Type: application/json" --user <user:pass> -XPUT http://localhost/test/mytype/_mapping -d '{
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"folded" : {
"type" : "text",
"analyzer" : "folding"
}
}
}
}
}'
Do the search
curl -H "Content-Type: application/json" --user <user:pass> -XGET http://localhost/test/_search -d '{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "cafe"
}
}
]
}
},
"size" : 10,
"from" : 0
}'
But it's the same effect like before: I only find the articles with "cafe", not also the articles with "café". Is there something I miss?

Great start! You have created a new analyzer and changed your mapping, however, you also now need to reindex your data in order to fill in the new content.folded field.
You can do it very easily by calling the update by query endpoint like this:
curl --user <user:pass> -XPOST http://localhost/test/_update_by_query

In your search query you should mention content.folded, folding analyzer is assigned to content.folded and not content.
After a mappings update you will have to reindex your data in order to apply the change.
Reindex step by step Reindex
A working example:
Mappings
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"fields": {
"folded": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
}
}
Inserting few documents
POST my_index/_doc/1
{
"content":"café"
}
POST my_index/_doc/2
{
"content":"cafe"
}
Search Query
GET my_index/_search
{
"query": {
"match": {
"content.folded": "cafe"
}
}
}
Results
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"content" : "café"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"content" : "cafe"
}
}
]
}
Hope this helps

Related

Synonyms relevance issue in Elasticsearch

I am trying to configured synonyms in elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Below is sample data which i have indexed:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
Below is query which i am trying:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
Current Result:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at position one.
Could you please suggest me how i can achive.
I have tried to applied boosting and weightage as well but no luck.
Thanks in advance !!!
Elasticsearch "replaces" every instance of a synonym all other synonyms, and does so on both indexing and searching (unless you provide a separate search_analyzer) so you're losing the exact token. To keep this information, use a subfield with standard analyzer and then use multi_match query to match either synonyms or exact value + boost the exact field.
I have got answer from Elastic Forum here. I have copied below for quick referance.
Hello there,
Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.
You can also see what your analyzer does to your input with this;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
I did some trying out so, you should change your index analyzer to my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}

Search by exact match in all fields in Elasticsearch

Let's say I have 3 documents, each of them only contains one field (but let's imagine that there are more, and we need to search through all fields).
Field value is "first second"
Field value is "second first"
Field value is "first second third"
Here is a script that can be used to create these 3 documents:
# drop the index completely, use with care!
curl -iX DELETE 'http://localhost:9200/test'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/one' -d '{"name":"first second"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/two' -d '{"name":"second first"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/three' -d '{"name":"first second third"}'
I need to find the only document (document 1) that has exactly "first second" text in one of its fields.
Here is what I tried.
A. Plain search:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second"
}
}
}'
returns all 3 documents
B. Quoting
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "\"first second\""
}
}
}'
gives 2 documents: 1 and 3, because both contain 'first second'.
Here https://stackoverflow.com/a/28024714/7637120 they suggest to use 'keyword' analyzer to analyze the fields when indexing, but I would like to avoid any customizations to the mapping.
Is it possible to avoid them and still only find document 1?
Yes, you can do that by declaring name mapping type as keyword. The key to solve your problem is just simple -- declare name mapping type:keyword and off you go
to demonstrate it, I have done these
1) created mapping with `keyword` for `name` field`
2) indexed the three documents
3) searched with a `match` query
mappings
PUT so_test16
{
"mappings": {
"_doc":{
"properties":{
"name": {
"type": "keyword"
}
}
}
}
}
Indexing the documents
POST /so_test16/_doc
{
"id": 1,
"name": "first second"
}
POST /so_test16/_doc
{
"id": 2,
"name": "second first"
}
POST /so_test16/_doc
{
"id": 3,
"name": "first second third"
}
The query
GET /so_test16/_search
{
"query": {
"match": {"name": "first second"}
}
}
and the result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "so_test16",
"_type" : "_doc",
"_id" : "m1KXx2sB4TH56W1hdTF9",
"_score" : 0.2876821,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
Adding second solution
( if the name is not a keyword type but a text type. Only thing here is fielddata:true also needed to be added for name field)
Mappings
PUT so_test18
{
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fielddata": true
}
}
}
}
}
and the search query
GET /so_test18/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {"name": "first second"}}
],
"filter": {
"script": {
"script": {
"lang": "painless",
"source": "doc['name'].values.length == 2"
}
}
}
}
}
}
and the response
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.3971361,
"hits" : [
{
"_index" : "so_test18",
"_type" : "_doc",
"_id" : "o1JryGsB4TH56W1hhzGT",
"_score" : 0.3971361,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
In Elasticsearch 7.1.0, it seems that you can use keyword analyzer even without creating a special mapping. At least I didn't, and the following query does what I need:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second",
"analyzer": "keyword"
}
}
}'

Elasticsearch prefix query not working on date

I have the following documents in elasticsearch, and I'd like to apply prefix query on logtime field, but nothing would return.
{
"_index" : "test",
"_type" : "fluentd",
"_id" : "6Cn38mMBMKvgU4HnnURh",
"_score" : 1.0,
"_source" : {
"logtime" : "2018-06-11 03:08:02,117",
"userid" : "",
"payload" : "40",
"qs" : "[['I have a dream, that everybody'], ['the'], ['steins']]"
}
}
the prefix query is
curl -X GET "localhost:9200/test/_search" -H 'Content-Type: application/json' -d'{ "query": {"prefix" : { "logtime" : "2018-06-11" }}}'
Could someone help? Thanks a lot.
You can use Range Query in that case like
{
"query": {
"range": {
"createdDate": {
"gte":"2018-06-11",
"lte": "2018-06-11",
"format": "yyyy-MM-dd"
}
}
}
}
Hope it helps.

Using Email tokenizer in elasticsearch

Did try some examples from elasticsearch documentation and from google but nothing helped in figuring out..
just a sample data I have is just few blog posts. I am trying to see all posts with email address. When I use "email":"someone" I see all the posts matching someone but when I change to use someone#gmail.com nothing shows up!
"hits": [
{
"_index": "blog",
"_type": "post",
"_id": "2",
"_score": 1,
"_source": {
"user": "sreenath",
"email": "someone#gmail.com",
"postDate": "2011-12-12",
"body": "Trying to figure out this",
"title": "Elastic search testing"
}
}
]
when I use Get query is as shown below, I see all posts matching someone#anything.com. But I want to change this
{ "term" : { "email" : "someone" }} to { "term" : { "email" : "someone#gmail.com" }}
GET blog/post/_search
{
"query" : {
"filtered" : {
"filter" : {
"and" : [
{ "term" :
{ "email" : "someone" }
}
]
}
}
}
}
I did the curl -XPUT for the following, but did not help
curl -XPUT localhost:9200/test/ -d '
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
}
}
'
You have created a custom analyzer for email addresses but you are not using it. You need to declare the email field in your mapping type to actually use that analyzer, like below. Also make sure to create the right index with that analyzer, i.e. blog and not test
change this
|
v
curl -XPUT localhost:9200/blog/ -d '{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
},
"mappings": { <--- add this
"post": {
"properties": {
"email": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
'

ElasticSearch - searching different doc_types with the same field name but different analyzers

Let's say I make a simple ElasticSearch index:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"char_filter": {
"de_acronym": {
"type": "mapping",
"mappings": [".=>"]
}
},
"analyzer": {
"analyzer1": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": ["de_acronym"]
}
}
}
}
}'
And I make two doc_types that have the same property name but they are analyzed slightly differently from one another:
curl -XPUT 'http://localhost:9200/test/_mapping/docA' -d '{
"docA": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}'
curl -XPUT 'http://localhost:9200/test/_mapping/docB' -d '{
"docB": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer1"
}
}
}
}'
Next, let's say I put a document in each doc_type with the same name:
curl -XPUT 'http://localhost:9200/test/docA/1' -d '{ "name" : "U.S. Army" }'
curl -XPUT 'http://localhost:9200/test/docB/1' -d '{ "name" : "U.S. Army" }'
Let's try to search for "U.S. Army" in both doc types at the same time:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.5,
"hits" : [ {
"_index" : "test",
"_type" : "docA",
"_id" : "1",
"_score" : 1.5,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I only get one result! I get the other result when I specify docB's analyzer:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '
{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army",
"analyzer": "analyzer1"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "docB",
"_id" : "1",
"_score" : 1.0,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I was under the impression that ES would search each doc_type with the appropriate analyzer. Is there a way to do this?
The ElasticSearch docs say that precedence for search analyzer goes:
1) The analyzer defined in the query itself, else
2) The analyzer defined in the field mapping, else
...
In this case, is ElasticSearch arbitrarily choosing which field mapping to use?
Take a look at this issue in github, which seems to have started from this post in ES google groups. I believe it answers your question:
if its in a filtered query, we can't infer it, so we simply pick one of those and use its analysis settings

Resources