Finding exact value with synonym applied in elasticsearch - elasticsearch

I have synonym set up for certain fields, however I would like it to be applied to fields I map to not_analyzed.
For example I have field foo storing seksyen 10 which is not_analyzed, and a synonym entry is added for seksyen, section (dealing with mixed languages within a document).
'foo': {'type': 'string', 'index': 'not_analyzed'}
Suppose user issue a query
{"term": {"foo": "section 10"}}
and is expecting foo with seksyen 10 and section 10. However, with the current mapping I can't return seksyen 10 given the query. Also I am doing a filtered query here because I don't want these to be returned
whatever seksyen 10
seksyen 10, something
whatever section 10 something
I just want synonym expansion to be applied to the query, without it specified in the query. How should I do that?

First of all, using term will not do any analysis on the searched text, so you need a different type of query.
You can do it like the following:
{
"mappings": {
"test": {
"properties": {
"foo": {
"type": "string",
"index": "not_analyzed",
"search_analyzer": "synonym"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"seksyen, section"
]
}
}
}
}
}
So, you define a search_analyzer to be used at search time only. And then you need to give up on term filter otherwise it will not work:
{
"query": {
"match": {
"foo": "section"
}
}
}
The solution above works in ES 1.x. In ES 2.x the search_analyzer and a not_analyzed field will not be possible anymore.

Related

Multi-field search for synonym in the query string

It looks like Elasticsearch does not take field analyzers into account for multi-field search using query string, without specifying field.
Can this be configured for index or specified in the query?
Here it is a hands on example.
Given files from commit (spring-data-elasticsearch).
There is a test SynonymRepositoryTests, which will pass with QueryBuilders.queryStringQuery("text:british") and QueryBuilders.queryStringQuery("british").analyzer("synonym_analyzer") queries.
Is it possible to make it passing with QueryBuilders.queryStringQuery("british") query, without specifying field and analyzer for query?
You could query without specifying fields or analyzers. By default query string query will query on _all field which is combination of all fields and uses standard analyzer. so QueryBuilders.queryStringQuery("british") will work.
You can exclude some fields from all fields while creating index and you can also create custom all field with the help of copy_to functionality.
UPDATE
You would have to use your custom analyzer on _all fields while creating index.
PUT text_index
{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"prefix_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
}
}
}
},
"mappings": {
"test_type": {
"_all": {
"enabled": true,
"analyzer": "prefix_analyzer" <---- your synonym analyzer
},
"properties": {
"name": {
"type": "string"
},
"tag": {
"type": "string",
"analyzer": "simple"
}
}
}
}
}
You can replace prefix_analyzer with your synonym_analyzer and then it should work.

case insensitive elasticsearch with uppercase or lowercase

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.
I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara.
elastic search uses analyzer which makes everything lowercase.
I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results.
I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.
Is there any way to do this in elastic search?
{"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 80
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now.
this is the link to onedrive where I kept both files.
https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc
Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by #Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter
I created an index with the following settings and mapping:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"typehere": {
"properties": {
"name": {
"type": "string",
"analyzer": "custom_analyzer"
},
"description": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
}
Indexed two documents
Doc 1
PUT /test_index/test_mapping/1
{
"name" : "Sara Connor",
"Description" : "My real name is Sarah Connor."
}
Doc 2
PUT /test_index/test_mapping/2
{
"name" : "John Connor",
"Description" : "I might save humanity someday."
}
Do a simple search
POST /test_index/_search?query=sara
{
"query" : {
"match" : {
"name" : "SARA"
}
}
}
And get back only the first document. I tried with "sara" and "Sara" also, same results.
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "test_index",
"_type": "test_mapping",
"_id": "1",
"_score": 0.19178301,
"_source": {
"name": "Sara Connor",
"Description": "My real name is Sarah Connor."
}
}
]
}
}
The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:
Create two analysers one with ngram filter and second analyser
without ngram filter because you don’t need to analyse input search
query using ngram because you have an exact value that you want to search.
Define mappings correctly for your fields. There are two fields in
the mapping that allow you to specify analysers. One is used for
storage (analyzer) and second, is used for searching
(search_analyzer) – if you specified only analyser field then
specified analyser is used for index and search time.
You can read more about it here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
And your code should look like that:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"index_store_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"ngram_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "index_store_ngram",
"search_analyzer": "standard"
}
}
}
}
}
post /my_index/my_type/1
{
"name": "Sara_11_01"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "sara"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SARA"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SaRa"
}
}
}
Edit 1: updated code for a new example provided in the question
This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:
Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
Now, coming to the answer part:
It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:
000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874
The most common format is just an integer proceeded by some number of zeros.
How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:
{
"query": {
"multi_match": {
"query": "65",
"fields": ["serviceOrderNo^2", "_all"],
}
}
}
One way of doing this is using the lucene flavour regular exression query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
"query": {
"regexp":{
"serviceOrderNo": "[0]*65"
}
}
Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this:
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html
"query": {
"query_string": {
"default_field": "serviceOrderNo",
"query": "0*65"
}
}
These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.
If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.
Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".
You can also check out the ES website documentation on Analyzers.
"settings":{
"analysis": {
"filter":{
"trimZero": {
"type":"pattern_capture",
"patterns":"^0*([0-9]*$)"
}
},
"analyzer": {
"serviceOrderNo":{
"type":"custom",
"tokenizer":"standard",
"filter":"trimZero"
}
}
}
},
"mappings":{
"serviceorderdto": {
"properties":{
"serviceOrderNo":{
"type":"String",
"analyzer":"serviceOrderNo"
}
}
}
}
One way to do this is to use an ngram token filter so that "12345" gets tokenized as:
[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]
When tokenized this way, "65" is a match for "000000065".
To set this up, create a new index that has a custom analyzer that uses an ngram filter:
POST /my-index
{
"mappings": {
"serviceorderdto": {
"properties": {
"serviceOrderNo": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Index some data.
Then run your query:
GET /my-index/_search
{
"query": {
"multi_match": {
"query": "55",
"fields": [
"serviceOrderNo^2",
"_all"
]
}
}
}

Why prefix returns documents without the specific prefix?

I want to return only documents which their name start with "pizza". this is what I've done:
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name": "pizza"
}
}
}
}
}
But I've got these 3 documents:
{
"name": "Viana Pizza",
"city": "Mashhad",
"address": "Vakil abad",
"foods": ["Pizza"],
"salad": true,
"rate": 5.0
}
{
"name": "Pizza Pizza",
"city": "Mashhad",
"address": "Bahar st",
"foods": ["Pizza"],
"salad": true,
"rate": 8.5
}
{
"name": "Reza Pizza",
"city": "Tehran",
"address": "Vali Asr",
"foods": ["Pizza"],
"salad": true,
"rate": 7.5
}
As you can see, Only one of them has "pizza" in the beginning of the name field.
What's wrong?
Probably, the simplest explanation given that you didn't provide the actual mapping, is that you have th e "name" field as "string" and "analyzed" (the default). Which means that "Reza Pizza" will be transformed to "reza" and "pizza" terms.
And your filter will match against terms, not against entire fields. Because ES analyzes the fields and forms terms when the standard mapping is used.
You need to either change your "name" field to "not_analyzed" or add another field to mirror the "name" but this mirror field to be "not_analyzed". Also, for text "pizza" (lowercase) to work in this case you need to create a custom analyzer.
Below you have the solution with the mirror field:
PUT /pizza
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
And in searching you need to use the mirror field:
GET /pizza/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name.raw": "pizza"
}
}
}
}
}
That's all about Elasticsearch analyzers. Let's read the documentation on prefix filter:
Filters documents that have fields containing terms with a specified prefix (not analyzed).
Here we can see that this filter matches terms, not the whole field value. When you index the document, ES splits your field values to terms using analyzers. Default analyzer splits value by whitespace and convert parts to lowercse. So all three results have term pizza in the name field and pizza term perfectly matches pizza prefix. If you want to match field value as is - I'd suggest you to map name field as not_analyzed

Resources