Fuzzy string matching using Levenshtein algorithm in Elasticsearch - elasticsearch

I have just started exploring Elasticsearch. I created a document as follows:
curl -XPUT "http://localhost:9200/cities/city/1" -d'
{
"name": "Saint Louis"
}'
I now tried do a fuzzy search on the name field with a Levenshtein distance of 5 as follows :
curl -XGET "http://localhost:9200/_search " -d'
{
"query": {
"fuzzy": {
"name" : {
"value" : "St. Louis",
"fuzziness" : 5
}
}
}
}'
But its not returning any match. I expect the Saint Louis record to be returned. How can i fix my query ?
Thanks.

The problem with your query is that only a maximum edit distance of 2 is allowed.
In the case above what you probably want to do is have a synonym for St. to Saint, and that would match for you. Of course, this would depend on your data as St could also be "street".
If you want to just test the fuzzy searching, you could try this example
curl -XGET "http://localhost:9200/_search " -d'
{
"query": {
"fuzzy": {
"name" : {
"value" : "Louiee",
"fuzziness" : 2
}
}
}
}

Related

multi_match query returning no results elasticsearch

I am trying a multi_match query in ElasticSearch but the query is returning no results. The query is:
curl -XPOST "http://localhost:9200/smartjn/feed_details/_search" -d'
{
"query" : {
"multi_match" : {
"query" : "Dho*",
"fields" : [ "title", "wardname" ]
}
}
}'
{"took":11,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
I have value in wardname field starting with Dho,
{
_id: ObjectId("56f43c0344fc86e73b1170b0"),
title: "Constant road work",
approvalStatus: "approved",
subward: "56a6124244fc868a255fe3fe",
wardname: "Dhokali"
}
not sure why is it not returning anything. Any help greatly appreciated.
Thanks
You need to use Phrase Prefix query if you want to search something that starts with some string. Try following query.
curl -XPOST "http://localhost:9200/smartjn/feed_details/_search" -d'
{
"query" : {
"multi_match" : {
"query" : "Dho*",
"fields" : [ "title", "wardname" ],
"type': 'phrase_prefix"
}
}
}'

Elastic Search Percolate Boolean Queries

I am trying to get boolean queries which are stored in ES using Percolate API.
Index mapping is given below
curl -XPUT 'localhost:9200/my-index' -d '{
"mappings": {
"taggers": {
"properties": {
"content": {
"type": "string"
}
}
}
}
}'
I am inserting records like this (Queries contain proper boolean format (AND, OR, NOT etc) as given in below example)
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
"query" : {
"match" : {
"content" : "Audi AND BMW"
}
}
}'
And then I am posting a document to get matched queries.
curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
"doc" : {
"content" : "I like audi very much"
}
}'
In above case no records should come because boolean query is "Audi AND BMW" but it is still giving record. It means that it is ignoring AND condition. I am not able to figure out that why it is not working for boolean queries.
You need to percolate this query instead, match queries do not understand the AND operator (they will treat it like the normal token and), but query_string does.
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
"query" : {
"query_string" : {
"query" : "Audi AND BMW",
"default_field": "content"
}
}
}'

Percolate not returning results as expected

We're trying to set up and use percolate, but we aren't quite getting results as expected.
First, I register a few queries:
curl -XPUT 'localhost:9200/index-234234/.percolator/query1' -d '{
"query" : {
"range" : {
"price" : { "gte": 100 }
}
}
}'
curl -XPUT 'localhost:9200/index-234234/.percolator/query2' -d '{
"query" : {
"range" : {
"price" : { "gte": 200 }
}
}
}'
And then, when I try to match it against 150, which should ideally match only query1, instead it matches both queries:
curl -XGET 'localhost:9200/index-234234/message/_percolate' -d '{
"doc" : {
"price" : 150
}
}'
{"took":4,"_shards":{"total":5,"successful":5,"failed":0},"total":2,"matches":[{"_index":"index-234234","_id":"query1"},{"_index":"index-234234","_id":"query2"}]}
Any pointers as to why this is happening would be much appreciated.
The problem is that you are registering your percolator queries prior to setting up the mappings for the document. The percolator has to register the query without a defined mapping and this can be an issue particularly for range queries.
You should start over again by deleting the index and then run this mapping command first:
curl -XPOST localhost:9200/index-234234 -d '{
"mappings" : {
"message" : {
"properties" : {
"price" : {
"type" : "long"
}
}
}
}
}'
Then execute your previous commands (register the two percolator queries and then percolate one document) you will get the following correct response:
{"took":3,"_shards":{"total":5,"successful":5,"failed":0},"total":1,"matches":[{"_index":"index-234234","_id":"query1"}]}
You may find this discussion from a couple of years ago helpful:
http://grokbase.com/t/gg/elasticsearch/124x6hq4ev/range-query-in-percolate-not-working
Not a solution, but this works (without knowing why) for me:
Register both percolator queries
Do the _percolator request (returns your result: "total": 2)
Register both percolator queries again (both are now in version 2)
Do the _percolator request again (returns right result: "total": 1)

Is it possible to rank span_near queries with unique results higher than duplicate results?

Assume I have two documents that have a "catField" containing the following information:
Document one:
happy cat
sad cat
meh cat
Document two:
happy cat
happy cat
happy cat
I am attempting to write a query that fulfils two requirements:
Find any word with a length of at least three followed by the word "cat".
The query should also rank documents with more unique types of cats (document one) higher than those that have the same types of cats (document two).
Here is my initial solution that uses span_near with regexp that fulfils the first requirement:
"span_near": {
"clauses": [
{
"span_multi": {
"match": {
"regexp": {
"catField": "[a-z]{3,}"
}
}
}
},
{
"span_multi": {
"match": {
"regexp": {
"catField": "cat"
}
}
}
}
],
"slop": 0,
"in_order": true
}
This works great for finding documents with lists of cats, but it will rank Document one, and Document two (above) the same. How can I fulfil that second requirement of ranking unique cat lists higher than non-unique ones?
So here is an approach using some indexing magic to get what you want. I'm not entirely certain of your requirements (since you are probably working with data more complicated than just "happy cat"), but it should get you started in the index-time direction.
This may or may not be the right approach for your setup. Depending on index size and query load, phrase queries/span queries/bool combinations may work better. Your requirements are tricky though, since they depend on order, size of preceding token, and number of variations.
The advantage of this is that much of your complex logic is baked into the index, gaining speed at query time. It does make your data a bit more rigid however.
curl -XDELETE localhost:9200/cats
curl -XPUT localhost:9200/cats -d '
{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index" : {
"analysis" : {
"analyzer" : {
"catalyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : ["cat_pattern", "unique", "cat_replace"]
}
},
"filter" : {
"cat_pattern" : {
"type" : "pattern_capture",
"preserve_original" : false,
"patterns" : [
"([a-z]{3,} cat)"
]
},
"cat_replace" : {
"type" : "pattern_replace",
"preserve_original" : false,
"pattern" : "([a-z]{3,} cat)",
"replacement" : "cat"
}
}
}
}
},
"mappings" : {
"cats" : {
"properties" : {
"catField" : {
"type" : "multi_field",
"fields": {
"catField" : {
"type": "string",
"analyzer": "standard"
},
"catalyzed" : {
"type": "string",
"index_analyzer": "catalyzer",
"search_analyzer" : "whitespace"
}
}
}
}
}
}
}'
First we are creating an index with a bunch of custom analysis. First we tokenize with a keyword analyzer (which doesn't actually tokenize, just emits a single token). Then we use a pattern_capture filter to find all "cats" that are preceded with a word longer than three characters. We then use a unique filter to get rid of duplicates (e.g. "happy cat" three times in a row). Finally we use a pattern_replace to change our "happy cat" into just "cat".
The final tokens for a field will just be "cat", but there will be more occurrences of "cat" if there are multiple types of cats.
At search time, we can simply search for "cat" and the docs that mention "cat" more often are boosted higher. More mentions means more unique types due to our analysis, so we get the boosting behavior "for free".
I used a multi-field, so you can still query the original field (e.g if you want to search for "happy cat").
Demonstration using the above mappings:
curl -XPOST localhost:9200/cats/cats/1 -d '
{
"catField" : ["sad cat", "happy cat", "meh cat"]
}'
curl -XPOST localhost:9200/cats/cats/2 -d '
{
"catField" : ["happy cat", "happy cat", "happy cat"]
}'
curl -XPOST localhost:9200/cats/cats/3 -d '
{
"catField" : ["a cat", "x cat", "y cat"]
}'
curl -XPOST localhost:9200/cats/cats/_search -d '
{
"query" : {
"match": {
"catField.catalyzed": "cat"
}
}
}'
Notice that the third document isn't returned by the search, since it doesn't have a cat that is preceeded by a type longer than three characters.

ElasticSearch has_parent query

I am experimenting with Elasticsearch parent/child with some simple examples from fun-with-elasticsearch-s-children-and-nested-documents/. I am able to query child elements by running the query in the blog
curl -XPOST localhost:9200/authors/bare_author/_search -d '{
However, I could not tweak the example for has_parent query. Can someone please point what I am doing wrong, as I keep getting 0 results.
This is what I tried
#Returns 0 hits
curl -XPOST localhost:9200/authors/book/_search -d '{
"query": {
"has_parent": {
"type": "bare_author",
"query" : {
"filtered": {
"query": { "match_all": {}},
"filter" : {"term": { "name": "Alastair Reynolds"}}
}
}
}
}
}'
#did not work either
curl -XPOST localhost:9200/authors/book/_search -d '{
"query": {
"has_parent" : {
"type" : "bare_author",
"query" : {
"term" : {
"name" : "Alastair Reynolds"
}
}
}
}
}'
This works with match but its just matching the first name
#works but matches just first name
curl -XPOST localhost:9200/authors/book/_search -d '{
"query": {
"has_parent" : {
"type" : "bare_author",
"query" : {
"match" : {"name": "Alastair"}
}
}
}
}'
I suppose you are using the default mappings, thus analysing the name field using the standard analyzer. On the other hand, term query and term filter don't support text analysis thus you search for the token Alastair Reynolds while in the index you have alastair and reynolds as two different tokens and lowercased.
The match query returns result because it's analyzed, thus underneath lowercased and it finds matches. You can just change your term query and make it a match query, it will find matches even with multiple terms, because in that case it will be tokenized on whitespaces and will generate a boolean or dismax query out of the different terms provided.

Resources