Latenise token on query time - elasticsearch

I need to latenise the query tokens that I use when querying (or filtering). I can do this on application level, but I was wondering if elasticsearch provides an out of the box solution.
I'm using ES 1.7.5 (as a service)

By default elasticsearch will use the same analyzer at index time and query time but it is possible to specify a search_analyzer which will only be used at query time.
Let's take a look at the following example:
# First we define an analyzer which will fold non ascii characters called `latinize`.
PUT books
{
"settings": {
"analysis": {
"analyzer": {
"latinize": {
"tokenizer": "standard",
"filter": ["asciifolding"]
}
}
}
},
"mappings": {
"book": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard", # We use the standard analyzer at index time.
"search_analyzer": "latinize" # But we use the latinize analyzer at query time.
}
}
}
}
}
# Now let's create a document and search for it with a non latinized string.
POST books/book
{
"name": "aaoaao"
}
POST books/_search
{
"query": {
"match": {
"name": "ääöääö"
}
}
}
And bam! There is our document.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "books",
"_type": "book",
"_id": "AVkIXdNyDpmDHTvI6Cp1",
"_score": 0.30685282,
"_source": {
"name": "aaoaao"
}
}
]
}
}

Related

Elasticsearch query with fuzziness AUTO not working as expected

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}
This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

elasticsearch - number of searches affects revelance?

I have the following mapping:
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
I've inserted two docs:
POST music/song
{
"song_field" : "Premeditiated murder"
}
POST music/song
{
"song_field" : "Premeditiated"
}
Here is the query:
POST music/song/_search
{
"size": 10,
"query": {
"match": {
"song_field": {
"query": "Premeditiated murd",
"fuzziness": 2
}
}
}
}
Response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.78730416,
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUf6XK1ancUpEdFLdz8",
"_score": 0.78730416,
"_source": {
"song_field": "Premeditiated"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUfUbocancUpEdFLdUf",
"_score": 0.668494,
"_source": {
"song_field": "Premeditiated murder"
}
}
]
}
}
I have two questions:
Why does Premeditiated score is higher ? How can I get a resonable correction + auto-complete?
Does searching the same document over and over again affects default es score ?
You get wrong response because sorting by relevance is broken for very small set of data when you have multiple shareds. Relevance is calculated for each shared and then results from each shared are merged and return so your "Premeditiated" has bigger relevance in one shared. This is a common issue and is well described here: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html
There are two ways to solve this issue you can use:
1. number_of_shards option =1 during defining index mapping
2. add the following information to your search query: search_type=dfs_query_then_fetch
After using one of the above options you will get a result you want.
Regarding your second question: every time when you search scoring is calculated. Even if you are searching the same document over and over again the scoring is calculated and _score result is always the same. If you want to read more how scoring works then you need to read "Controlling relevance" chapter https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-relevance.html. You can always add explain property to your query to see how scroing was calculated https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-explain.html.
P.S
Great that you provided your JSONs but there is a wrong field inside query it should be song_field instead of song_field_1. Additionaly your response doesn’t fit to data stored inside type (look at _source field in the respown) but it doesn't matter here:P.

how to switch on the elasticsearch stemming

I don't know how to turn on the Elasticsearch English word stemming. I am sorry I didn't find out a clear example to do that.
Here is what I did
Creating the index
PUT /staff/list/ -d
{
"settings" : {
"analysis": {
"analyzer": {
"standard": {
"type": "standard"
}
}
}
}
}
Adding document
PUT /staff/list/jason
{
"Title" : "searches"
}
when I search for search
GET /staff/list/_search?q=search
The result doesnt appear.
What index setting I should do to make the stemming works.
Many thanks in advance
Please note that the default Elasticsearch analyzer do not support stemming.
In order to support stemming you may need to create a custom analyzer.
Here is how you do it:
Create the index and define an analyzer called my_analyzer
PUT /staff
{
"settings" : {
"analysis": {
"filter": {
"filter_snowball_en": {
"type": "snowball",
"language": "English"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"filter_snowball_en"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Configure mapping that assigns my_analyzer to list type
PUT /staff/_mapping/list
{
"list": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
Index documents
PUT /staff/list/jason
{
"title": "searches"
}
PUT /staff/list/debby
{
"title": "searched open"
}
Search and stemmed results
GET staff/list/_search
{
"query": {
"query_string": {
"query": "title:opened"
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "staff",
"_type": "list",
"_id": "debby",
"_score": 1,
"_source": {
"title": "open"
}
}]
}
}
As you can see in the search results, debby document which contains the term
open was returned although we where searching for opened.
Hope that helps.
When you create the index, you are doing nothing (just re-declaring the standard analyzer).
The standard analyzer is the default that Elasticsearch uses, which doesn't stem any word.
You need to map the fields to their respective analyzers at your index creation (mapping documentation):
PUT /staff -d
{
"mappings": {
"list": {
"properties": {
"Title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
I guess english analyzer fits to your case (which uses the standard tokenizer).

How to perform an exact match query on an analyzed field in Elasticsearch?

This is probably a very commonly asked question, however the answers I've got so far isn't satisfactory.
Problem:
I have an es index that is composed of nearly 100 fields. Most of the fields are string type and set as analyzed. However, the query can be both partial (match) or exact (more like term). So, if my index contains a string field with value super duper cool pizza, there can be partial query like duper super and will match with the document, however, there can be exact query like cool pizza which should not match the document. On the other hand, Super Duper COOL PIzza again should match with this document.
So far, the partial match part is easy, I used AND operator in a match query. However can't get the other type done.
I have looked into other posts related to this problem and this post contains the closest solution:
Elasticsearch exact matches on analyzed fields
Out of the three solutions, the first one feels very complex as I have a lot of fields and I do not use the REST api, I am creating queries dynamically using QueryBuilders with NativeSearchQueryBuilder from their Java api. Also it generates a lots of possible patterns which I think will cause performance issues.
The second one is a much easier solution but again, I have to maintain a lot more (almost) redundant data and, I don't think using term queries are ever going to solve my problem.
The last one has a problem I think, it will not prevent super duper to be matched with super duper cool pizza which is not the output I want.
So is there any other way I can achieve the goal? I can post some sample mapping if required for clearing the question farther. I am already keeping the source as well (in case that can be used). Please feel free to suggest any improvements as well.
Thanks in advance.
[UPDATE]
Finally, I used multi_field, keeping a raw field for exact queries. When I insert I use some custom modification on data, and during searching, I used the same modification routines on input text. This part is not handled by Elasticsearch. If you want to do that, you have to design appropriate analyzers as well.
Index settings and mapping queries:
PUT test_index
POST test_index/_close
PUT test_index/_settings
{
"index": {
"analysis": {
"analyzer": {
"standard_uppercase": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "keyword",
"filter": ["uppercase"]
}
}
}
}
}
PUT test_index/doc/_mapping
{
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "standard_uppercase"
}
}
}
}
}
}
POST test_index/_open
Inserting some sample data:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Exact query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"term": {
"text_field.raw": "PIZZA"
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.4054651,
"_source": {
"text_field": "pizza"
}
}
]
}
}
Partial query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"match": {
"text_field": {
"query": "pizza",
"operator": "AND",
"type": "boolean"
}
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"text_field": "pizza"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
PS: These are generated queries, that's why there are some redundant blocks, as there would be many other fields concatenated into the queries.
Sad part is, now I need to rewrite the whole mapping again :(
I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase_token_filter"]
}
},
"filter": {
"lowercase_token_filter": {
"type": "lowercase"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lowercase": {
"type": "string",
"analyzer": "lowercase_analyzer"
}
}
}
}
}
}
}
I added a couple of docs for testing:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:
POST /test_index/_search
{
"query": {
"match": {
"text_field.lowercase": "Super Duper COOL PIzza"
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).
It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):
POST /test_index/_search
{
"size": 0,
"aggs": {
"text_field_standard": {
"terms": {
"field": "text_field"
}
},
"text_field_raw": {
"terms": {
"field": "text_field.raw"
}
},
"text_field_lowercase": {
"terms": {
"field": "text_field.lowercase"
}
}
}
}
...{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"text_field_raw": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_lowercase": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_standard": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 2
},
{
"key": "cool",
"doc_count": 1
},
{
"key": "duper",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "some",
"doc_count": 1
},
{
"key": "super",
"doc_count": 1
},
{
"key": "text",
"doc_count": 1
}
]
}
}
}
Here's the code I used to test this out:
http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1
If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

ElasticSearch - Match (email value) returns wrong registers

I'm using match to search for a specific email but the result is wrong. The match property brings me results similar. If the result exists, the result displays on first lines but when the results not exists, it brings me result by same domain.
Here is my query:
{
"query": {
"match" : {
"email" : "placplac#xxx.net"
}
}
}
This email doesn't exist in my base but returning values like banana#xxx.net, ronyvon#xxx.net*, etc.
How can i force to return only if the value is equal from the query?
Thank in advance.
You need to put "index":"not_analyzed" on the "email" field. That way, the only terms that are queried against are the exact values that have been stored to that field (as opposed to the case with the standard analyzer, which is the default used if no analyzer is listed).
To illustrate, I set up a simple mapping with the email field not analyzed, and added two simple docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
PUT /test_index/doc/1
{"email": "placplac#xxx.net"}
PUT /test_index/doc/2
{"email": "placplac#nowhere.net"}
Now your match query will return only the document that matches the query exactly:
POST /test_index/_search
{
"query": {
"match" : {
"email" : "placplac#xxx.net"
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"email": "placplac#xxx.net"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/12763f63f2a75bf30ff956c25097b5955074508a
PS: What you actually probably want here is a term query or even term filter, since you don't want any analysis on the query text. So maybe something like:
POST /test_index/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"email": "placplac#xxx.net"
}
}
}
}
}

Resources