Fuzzy not functioning as expected (one term search, see example) - elasticsearch

Consider the following results from:
curl -XGET 'http://localhost:9200/megacorp/employee/_search' -d
'{ "query" :
{"match":
{"last_name": "Smith"}
}
}'
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 0.30685282,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing on the weekends.",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.30685282,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
}
]
}
}
Now when I execute the following query:
curl -XGET 'http://localhost:9200/megacorp/employee/_search' -d
'{ "query" :
{"fuzzy":
{"last_name":
{"value":"Smitt",
"fuzziness": 1
}
}
}
}'
Returns NO results despite the Levenshtein distance of "Smith" and "Smitt" being 1. The same thing results with a value of "Smit." If I put in a fuzziness value of 2, I get results. What am I missing here?

I assume that the last_name field your are querying is an analyzed string. The indexed term will though be smith and not Smith.
Returns NO results despite the Levenshtein distance of "Smith" and
"Smitt" being 1.
The fuzzy query don't analyze term, so actually, your Levenshtein distance is not 1 but 2 :
Smitt -> Smith
Smith -> smith
Try using this mapping, and your query with fuzziness = 1 will work :
PUT /megacorp/employee/_mapping
{
"employee":{
"properties":{
"last_name":{
"type":"string",
"index":"not_analyzed"
}
}
}
}
Hope this helps

Related

Understand Elasticsearch Multivalue Fields

I am trying to understand the position_increment_gap as it is explained on the Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/guide/current/_multivalue_fields_2.html
I created the same index as in the example and inserted a single document
PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith", "Justin Trudeau"]
}
Then I try a phrase query for Abraham Lincoln and it matches, as expected
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "names",
"_type": "doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"names": [
"john abraham",
"lincoln smith",
"justin trudeau"
]
}
}
]
}
}
The documentation explains that the match occurs because ES produces the tokens john abraham lincoln smith justin trudeau and it recommends inserting a position_increment_gap of 100 to avoid matching abraham lincoln unless I have a slop of 100.
I changed the index to have a position_increment_gap of 1 as shown below:
PUT names
{
"mappings": {
"doc": {
"properties": {
"names": {
"type":"text",
"position_increment_gap": 1
}
}
}
}
}
If I'm understanding the documentation, using a gap of 1 should allow me to match "abraham smith". But it doesn't match. Nor does "abraham lincoln", "abraham justin", or "abraham trudeau". "lincoln smith", "john abraham" and "justin trudeau" all continue to match.
I must be misunderstanding the documentation.
Thanks for any suggestions.

Query string with boost fields in Elastic Search

I am using Query String with Boost Fields in Elastic Search 1.7. It is working fine but in some scenario, I am not getting expected result.
Query:
query
{
"from": 0,
"size": 10,
"explain": true,
"query": {
"function_score": {
"query": {
"query_string": {
"query": "account and data",
"fields": [
"title^5"
"authors^4",
"year^5",
"topic^6"
],
"default_operator": "and",
"analyze_wildcard": true
}
},
"score_mode": "sum",
"boost_mode": "sum",
"max_boost": 100
}
}
}
Sample Data :
{
"took": 50,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 12.833213,
"hits": [
{
"_id": "19850",
"_score": 12.833213,
"_source": {
"ID": "19850",
"Year": "2010",
"Title": "account and data :..."
}
},
{
"_id": "16896",
"_score": 11.867042,
"_source": {
"ID": "16896",
"Year": "2014",
"Title": "effectivness of data..."
}
},
{
"_id": "59862",
"_score": 9.706333,
"_source": {
"ID": "59862",
"Year": "2007",
"Title": "best system to..."
}
},
{
"_id": "18501",
"_score": 9.685843,
"_source": {
"ID": "18501",
"Year": "2010",
"Title": "management of..."
}
}
]
}
I am getting above sample data by using query and that is as per expectation. But now, If I increase weight of year to 100 then I expect 4th result at 3rd position and 3rd result at 4th position. I tried many things but I don't know what I am doing wrong.
The boost is only used when the query matches the field you are boosting and it multiplies the score elastic search computes with the boosting you defined. In your query you are looking for "account and data" and that doesn't match any year so the boosting in the year will not be used.
Are you trying to take the year into account for ordering? If that is the case you can try adding the field_value_factor to your query like this:
"query" : {
"function_score": {
"query": { <your query goes here> },
"field_value_factor": {
"field": "year"
}
}
}
This will multiply the year with the score elastic search computes so it will take the year into account without necessary ordering by the year. You can read more about it here https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html.
You can always use the explain tool to figure out how elastic search came up with the score and thus returned the results in that order. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

manipulate returned fields in elasticsearch

Is there a way to manipulate (for example concatenate) returned fields from a query?
This is how I created my index:
PUT /megacorp/employee/1
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
And this is how I query it:
GET /megacorp/employee/_search
{
"query": {"match_all": {}}
}
The response is this:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
}
]
}
}
That's all working fine.
What I want is to concatenate two fields from the _source and display it in the output as a new field.
first_name and last_name should be combined to a new field "full_name". I can't figure out how to do that without creating a new field in my index. I have looked at "copy_to", but it requires you to explicitly set the store property in the mapping and you have to explicitly ask for the stored field in the query. But the main downside is that when you do both that, the first_name and last_name are returned comma separated. I would like a nice string: "John Smith"
GET /megacorp/employee/_search
{
"query": {"match_all": {}},
"script_fields": {
"combined": {
"script": "_source['first_name'] + ' ' + _source['last_name']"
}
}
}
And you need to enable dynamic scripting.
You can use script_fields to achieve that
GET /megacorp/employee/_search
{
"query": {"match_all": {}},
"script_fields" : {
"full_name" : {
"script" : "[doc['first_name'].value, doc['last_name'].value].join(' ')"
}
}
}
You need to make sure to enable dynamic scripting in order for this to work.

Get specific fields from index in elasticsearch

I have an index in elastic-search.
Sample structure :
{
"Article": "Article7645674712",
"Genre": "Genre92231455",
"relationDesc": [
"Article",
"Genre"
],
"org": "user",
"dateCreated": {
"date": "08/05/2015",
"time": "16:22 IST"
},
"dateModified": "08/05/2015"
}
From this index i want to retrieve selected fields: org and dateModified.
I want result like this
{
"took": 265,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 1,
"hits": [
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "3",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "08/05/2015"
}
}
},
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "4",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "10/05/2015"
}
}
}
]
}
}
How to query elastic-search to get only selected specific fields ?
You can retrieve only a specific set of fields in the result hits using the _source parameter like this:
curl -XGET localhost:9200/couchrecords/couchbaseDocument/_search?_source=org,dateModified
Or in this format:
curl -XPOST localhost:9200/couchrecords/couchbaseDocument/_search -d '{
"_source": ["doc.org", "doc.dateModified"], <---- you just need to add this
"query": {
"match_all":{} <----- or whatever query you have
}
}'
That's easy. Considering any query of this format :
{
"query": {
...
},
}
You'll just need to add the fields field into your query which in your case will result in the following :
{
"query": {
...
},
"fields" : ["org","dateModified"]
}
{
"_source" : ["org","dateModified"],
"query": {
...
}
}
Check ElasticSearch source filtering.

Should I include spaces in fuzzy query fields?

I have this data:
name:
first: 'John'
last: 'Smith'
When I store it in ES, AFAICT it's better to make it one field. However, should this one field be:
name: 'John Smith'
or
name: 'JohnSmith'
?
I'm thinking that the query should be:
query:
match:
name:
query: searchTerm
fuzziness: 'AUTO'
operator: 'and'
Example search terms are what people might type in a search box, like
John
Jhon Smi
J Smith
Smith
etc.
You will probably want a combination of ngrams and a fuzzy match query. I wrote a blog post about ngrams for Qbox if you need a primer: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch. I'll swipe the starter code at the end of the post to illustrate what I mean here.
Also, I don't think it matters much whether you use two fields for name, or just one. If you have some other reason you want two fields, you may want to use the _all field in your query. For simplicity I'll just use a single field here.
Here is a mapping that will get you the partial-word matching you want, assuming you only care about tokens that start at the beginning of words (otherwise use ngrams instead of edge ngrams). There are lots of nuances to using ngrams, so I'll refer to you the documentation and my primer if you want more info.
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
One thing to note here, in particular: "min_gram": 1. This means that single-character tokens will be generated from indexed values. This will cast a pretty wide net when you query (lots of words begin with "j", for example), so you may get some unexpected results, especially when combined with fuzziness. But this is needed to get your "J Smith" query to work right. So there are some trade-offs to consider.
For illustration, I indexed four documents:
PUT /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"John Hancock"}
{"index":{"_id":2}}
{"name":"John Smith"}
{"index":{"_id":3}}
{"name":"Bob Smith"}
{"index":{"_id":4}}
{"name":"Bob Jones"}
Your query mostly works, with a couple of caveats.
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "John",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
}
this query returns three documents, because of ngrams plus fuzziness:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.90169895,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.90169895,
"_source": {
"name": "John Hancock"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.90169895,
"_source": {
"name": "John Smith"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.6235822,
"_source": {
"name": "Bob Jones"
}
}
]
}
}
That may not be what you want. Also, "AUTO" doesn't work with the "Jhon Smi" query, because "Jhon" is an edit distance of 2 from "John", and "AUTO" uses an edit distance of 1 for strings of 3-5 characters (see the docs for more info). So I have to use this query instead:
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "Jhon Smi",
"fuzziness": 2,
"operator": "and"
}
}
}
}
...
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4219328,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4219328,
"_source": {
"name": "John Smith"
}
}
]
}
}
The other queries work as expected. So this solution isn't perfect, but it will get you close.
Here's all the code I used:
http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77

Resources