Elasticsearch fuzzy query - elasticsearch

I am trying to make fuzzy search that should be intended like this
And I have my index like this
{
"test": {
"aliases": {},
"mappings": {
"properties": {
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"last_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "test",
"creation_date": "1617623285742",
"number_of_replicas": "1",
"uuid": "MxSWoxSoS6y6x5Jdt2AvMQ",
"version": {
"created": "7120099"
}
}
}
}
}
Inside that index there is one data with
{
"first_name": "homo sapiens",
"last_name": "moho"
}
I tried to query like this but it doesn't work
{
"query": {
"match": {
"first_name": {
"query": "hosan",
"fuzziness": "AUTO:0,0"
}
}
}
}
but if I search with "hoom", "homoo" or "homos" it works.
Can someone help me with this fuzzy search? Thanks!

With a query term consisting of 5 characters, (hosan), a fuzziness value of auto will only give you an edit distance value of 1, which is not going to be enough to get you from hosan to homo. The max edit distance value you can achieve with auto is 2, and you will only achieve that if your query term is greater than 5 characters. You can force a fuzziness value of 3 or 4 to attempt to achieve your desired results, but the reason the ES default is max of 2 is because higher numbers can start yielding unexpected and unwieldy results. Note also that your other search examples (hoom, homoo, etc) are matching only on the word homo. Match queries are OR queries by default, and will return results for any matched term.
Just for reference, auto will give you 0 edit distance for query terms of length 1-2 characters, 1 edit distance for query term of 3-5 characters, and 2 edit distance for query terms greater than 5 characters.
So I would bump up your fuzziness value by 1 until you get the result returned when searching on hosan, but only to prove out what I’m outlining here. I personally would not go above a fuzziness value of 2, maybe 3, in any production environment.

After a lot of research about elasticsearch and fuzzy search, I found that it is imposible to only use fuzzy to expect the result like "homo sapiens" with search keyword "hosan". Then to solve this I need to combine fuzzy with regex query from elasticsearch

Related

Elasticsearch exact multiword (array) query for one field

I try to write a query where I have multiple Exact search terms lets say an array of strings
like
["Q4 Test WC Schüssel", "Q4_18 Bankerlampen", "MORE_SEARCHTERMS"]
And I have an index with an property data.name and I want to search for each of my array strings inside this ONE field for the exact value and I want all entries back where one of my array strings matches.
{
"mappings": {
"_doc": {
"country": {
"type": "keyword"
},
"data": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
I thought this would be an easy task, but I am not shure if I have the wrong google terms where I search for this problem to get an example query.
Use terms query
GET /_search
{
"query": {
"terms": {
"name.keyword": ["Q4 Test WC Schüssel", "Q4_18 Bankerlampen", "MORE_SEARCHTERMS"]
}
}
}

Elasticsearch match string with spaces, columns, dashes exactly

I'm using Elasticsearch 6.8, and trying to write a query in python notebook. Here is a mapping used for the index i'm working with:
{ "mapping": { "news": { "properties": { "dateCreated": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis" }, "itemId": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "market": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "timeWindow": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } }
I'm trying to search for exact string like "[2020-08-16 10:00:00.0,2020-08-16 11:00:00.0]" in "timeWindow" field (which is a "text" type, not a "date" field), and also select by market="en-us" (market is a "text" field too). This string has spaces,colons,commas, a lot of whitecharacters, and I don't know how to make a right query.
At the moment I have this query:
res = es.search(index='my_index',
doc_type='news',
body={
'size': size,
'query':{
"bool":{
"must":[{
"simple_query_string": {
"query": "[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]",
"default_operator": "and",
"minimum_should_match":"100%"
}
},
{"match":{"market":"en-us"}}
]
}
}
})
The problem is that is doesn't match my "simple_query_string" for timeWindow string exactly (I understand that this string gets tokenized, splitted into parts like "2020","08","17","00","01", etc, and each token is analyzed separately), and I'm getting different values for timeWindow that I want to exclude, like
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 00:05:00.0,2020-08-17 01:05:00.0]'
...
'[2020-08-17 00:50:00.0,2020-08-17 01:50:00.0]'
'[2020-08-17 00:55:00.0,2020-08-17 01:55:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]']
Is there a way to do what I want?
UPD (and answer):
My current query uses "term" and "timeWindow.keyword", this combination allows me to do exact search for string with spaces and other whitecharacters:
res = es.search(index='msn_click_events', doc_type='news', body={
'size': size,
'query':{
"bool":{
"must":[{
"term": {
"timeWindow.keyword": tw
}
},
{"match":{"market":"en-us"}}
]
}
}
})
And this query selects only right timewindows values (string):
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]'
'[2020-08-17 02:00:00.0,2020-08-17 03:00:00.0]'
...
'[2020-08-17 22:00:00.0,2020-08-17 23:00:00.0]'
'[2020-08-17 23:00:00.0,2020-08-18 00:00:00.0]']
On your timeWindow field you need a keyword aka exact search but you are using the full-text query and as you defined this field as text field and you already guessed it correct, it gets analyzed during the index time, hence you are not getting the correct results.
If you are using the dynamic mapping, then .keyword field would be generated for each text field in the mapping, so you can simply use timeWindow.keyword in your query and it will work.
If you have defined your mapping than you need to add the keyword field to store the timewindow, reindex the data and use that keyword field in query to get the expected results.

Elasticsearch query to find documents with number of values of a term equal to a specified number

I have an Elasticsearch index "library" with below mapping:
{
"mappings": {
"book": {
"properties": {
"title": { "type": "text", "index": "not_analyzed" },
"author": { "type": "text", "index": "not_analyzed" },
"price": { "type": "integer" },
}
}
}
Now I want make a query to find all documents(book) where number of author is equal to 3. i.e. I want to make a query which will match
curl -XGET "http://localhost:9200/library/_search?pretty=true" -d '{
"query": {
"match": {
Number of values of term "author" = 3.
}
}
}'
Is there any way to make such an query without adding an extra term?
[I know the aggregation to find all possible values of a term in search result but wasn't able to convert that aggregation in according to above criteria.]
Can't find a way to get exactly each author with 3 documents.
Aggregation will give you all possible values. But, it also show you the doc_count - and there we can find our way:
{
"size": 0,
"aggregations": {
"authors": {
"terms": {
"field": "author",
"min_doc_count": 3,
"size": 5
}
}
}
}
min_doc_count - will filter only buckets with, at least, 3 documents.
size - will give you only first 5 documents (remember that, by default, buckets are sorted by 'doc_count' ascending).
Now you can adjust size to get exactly those authors with 3 documents.

Aggregation in elastic search

Need help with aggregation in elastic search. Is it possible to agreggate values of a particular field as an array or list - This is more of a grouping for example instead of getting the result as
{"Book_Id":"102","Review_Text":"DescentRead"},{"Book_Id":"102","Review_Text":"For Kids."},{"Book_Id":"103","Review_Text":"Great"},{"Book_Id":"103","Review_Text":"Excellent"}
can i get all the reviews of each book as a list ?
[ { Book_Id: 102, Review_Text: [ "DescentRead", "For Kids"], { Book_Id: 103, reviews: [ "Great","Excellent"] } ]
Tried some trail with aggs but not able to get it. Any pointers would help!!
Could aggregations with top hits work? The limitation is that you need to specify a max amount of hits per aggregation (will return the top 100 results per book ID in the example ordered by the review text), but apart from that you can do run it as a normal query and specify which fields to return, how they should be sorted (to get the top hits), etc.
Example aggs query:
POST
http://myserver:9200/books/book/_search
{
"size": 0,
"aggs": {
"BookReviews": {
"terms": {
"field": "Book_Id.keyword"
},
"aggs": {
"top_reviews": {
"top_hits": {
"sort": [ { "Review_Text.keyword": { "order": "desc" } } ],
"size": 100,
"_source": {
"includes": [ "Review_Text" ]
}
}
}
}
}
}
}
Note that the name for the aggregations ("BookReviews" and "top_reviews") you can use any name you choose, and that same name will appear in the resulting aggregation tree. You can do multi level aggregations on terms in your index, and inclute top hits on any level, typically for drill-down reporting or similar cases.
Mapping used:
{
"books": {
"mappings": {
"book": {
"properties": {
"Book_Id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"Review_Text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
"size": 0 in the root node will omit any hits for the search and only return the aggs trees.
You can also add a normal "query": {} block on the same level as size and aggs if you need to filter the results before elastic starts aggregating.
Read more in the elasticsearch documentation pages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
(If you provide a more complete example dataset, we can give a more realistic example query, as there isn't a lot of data in the example for sorting or scoring the results)

case insensitive elasticsearch with uppercase or lowercase

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.
I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara.
elastic search uses analyzer which makes everything lowercase.
I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results.
I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.
Is there any way to do this in elastic search?
{"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 80
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now.
this is the link to onedrive where I kept both files.
https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc
Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by #Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter
I created an index with the following settings and mapping:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"typehere": {
"properties": {
"name": {
"type": "string",
"analyzer": "custom_analyzer"
},
"description": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
}
Indexed two documents
Doc 1
PUT /test_index/test_mapping/1
{
"name" : "Sara Connor",
"Description" : "My real name is Sarah Connor."
}
Doc 2
PUT /test_index/test_mapping/2
{
"name" : "John Connor",
"Description" : "I might save humanity someday."
}
Do a simple search
POST /test_index/_search?query=sara
{
"query" : {
"match" : {
"name" : "SARA"
}
}
}
And get back only the first document. I tried with "sara" and "Sara" also, same results.
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "test_index",
"_type": "test_mapping",
"_id": "1",
"_score": 0.19178301,
"_source": {
"name": "Sara Connor",
"Description": "My real name is Sarah Connor."
}
}
]
}
}
The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:
Create two analysers one with ngram filter and second analyser
without ngram filter because you don’t need to analyse input search
query using ngram because you have an exact value that you want to search.
Define mappings correctly for your fields. There are two fields in
the mapping that allow you to specify analysers. One is used for
storage (analyzer) and second, is used for searching
(search_analyzer) – if you specified only analyser field then
specified analyser is used for index and search time.
You can read more about it here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
And your code should look like that:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"index_store_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"ngram_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "index_store_ngram",
"search_analyzer": "standard"
}
}
}
}
}
post /my_index/my_type/1
{
"name": "Sara_11_01"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "sara"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SARA"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SaRa"
}
}
}
Edit 1: updated code for a new example provided in the question
This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:
Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
Now, coming to the answer part:
It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

Resources