ElasticSearch: Attempting to get spelling suggestion on proper names - elasticsearch

Before I begin, let me just say that I'm no ElasticSearch expert, but I am currently tasked with tweaking some analyzers to get spelling suggestions working better in a couple of different situations. I've seen examples of people who are doing spelling suggestions on proper names, so I know it must be possible, but I've been at this for a couple days now, and I must be missing something, because ElasticSearch doesn't seem to recognize the name I'm looking for. Can you please help me figure this out? Thanks in advance!
Here's the analyzer I'm using for index as well as search:
"full_text": {
"filter": [
"lowercase",
"asciifolding",
],
"type": "custom",
"tokenizer": "keyword"
},
This should demonstrate that the field is tokenizing into one long keyword, which I want.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "my_field"
}
}
}
}
...and it outputs something like this, which shows how the field is being tokenized. Looks good:
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 75,
"max_score": 0.58574116,
"hits": [
{
"_index": "my_index",
"_type": "thing",
"_id": "1",
"_score": 0.58574116,
"fields": {
"terms": [
[
"combine 5"
]
]
}
}
}
}
... but when I do a suggest query, it doesn't suggest the field, even though it's just off by a space.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"suggest": {
"suggest-0": {
"term": {
"field": "_all",
"size": 5
},
"text": "combine5"
}
}
}
Which returns a bunch of documents and this suggestion:
"suggest": {
"suggest-0": [
{
"text": "combine5",
"offset": 0,
"length": 8,
"options": [
{
"text": "combined",
"score": 0.875,
"freq": 15
},
{
"text": "combine",
"score": 0.85714287,
"freq": 17
}
]
}
]
}
Note that if I change the spelling suggestion to work just on the field that contains the text, it does suggest it, but not when I'm using _all. Is there a way to get the words in a specific field to be suggested when suggesting against _all?

I'm not sure this qualifies as exactly the answer I was looking for, but I ended up solving this by adding a field on the document containing the keyword value that I was looking for "combine5", so now it is registered as a word and if I suggest on that field, or _all, the word is suggested. It's also found in queries against _all.

Related

Sort keyword field array within ElasticSearch document by relevance

I've got an ElasticSearch index that looks something like this:
{
"mappings": {
"article": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "keyword"
},
}
}
}
And data that looks something like this:
{ "title": "Something about Dogs", "tags": ["articles", "dogs"] },
{ "title": "Something about Cats", "tags": ["articles", "cats"] },
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
If I search for dog, I get the first and third documents, as I'd expect. And I can weight the search documents the way I like (in reality, I'm using a function_score query to weight on a bunch of fields irrelevant to this question).
What I'd like to do is sort the tags field so that the most relevant tags are returned first, without affecting the sort order of the documents themselves. So I'm hoping for a result like this:
{ "title": "Something about Dog Food", "tags": ["dogs", "dogfood", "articles"] }
Instead of what I get now:
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
The documentation on sort and function score don't cover my case. Any help appreciated. Thanks!
You cannot sort the _source (your array of tags) of the documents given its "matching" capability. One way of doing this is by using nested fields and inner_hits that allows you to sort the matching nested fields.
My suggestion is to transform your tags in a nested field (I chose keyword there just by simplicity, but you can also have text and the analyzer of your choice):
PUT test
{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string"
},
"tags": {
"type": "nested",
"properties": {
"value": {
"type": "keyword"
}
}
}
}
}
}
}
And use this kind of query:
GET test/_search
{
"_source": {
"exclude": "tags"
},
"query": {
"bool": {
"must": [
{
"match": {
"title": "dogs"
}
},
{
"nested": {
"path": "tags",
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"match": {
"tags.value": "dogs"
}
}
]
}
},
"inner_hits": {
"sort": {
"_score": "desc"
}
}
}
}
]
}
}
}
Where you try to match on the tags nested field value for the same text you try to match on title. Then, using inner_hits sorting, you can actually sort the nested values based on their inner scoring.
#Val's suggestion is very good, but is good as long as for your "relevant tags" you are ok with just a simple text matching as a substring (i1.indexOf(params.search)). His solution's biggest advantage is that you don't have to change the mapping.
My solution's big advantage is that you are actually using Elasticsearch true search capabilities to determine the "relevant" tags. But the drawback is that you need nested field instead of the regular simple keyword.
What you get from a search call are the source documents. The documents in the response are returned in exactly the same form as when you indexed them, which means that if you indexed ["articles", "dogs", "dogfood"], you'll always get that array in that unaltered form.
One way to get around this is to declare a script_field that applies a small script to sort your array and return the result of that sort.
What the script does is simply move the terms that contain the search term in the front of the list
{
"_source": ["title"],
"query" : {
"match_all": {}
},
"script_fields" : {
"sorted_tags" : {
"script" : {
"lang": "painless",
"source": "return params._source.tags.stream().sorted((i1, i2) -> i1.indexOf(params.search) > -1 ? -1 : 1).collect(Collectors.toList())",
"params" : {
"search": "dog"
}
}
}
}
}
This will return something like this, as you can see the sorted_tags array contains the terms as you expect.
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tests",
"_type": "article",
"_id": "1",
"_score": 1,
"_source": {
"title": "Something about Dog Food"
},
"fields": {
"sorted_tags": [
"dogfood",
"dogs",
"articles"
]
}
}
]
}
}

Elasticsearch - Show index-wide count for each returned result based from a given term

Firstly i apologise if the terminology i use is incorrect as i am learning elasticsearch day by day and maybe use incorrect phrases.
After spending several days trying to figure this out and pulling my hair out i seem to be hitting brick walls every-time.
I am trying to get elasticsearch to provide a document count for each returned result, I will provide an example below..
{
"suggest": {
"text": "aberdeen",
"city": {
"completion": {
"field": "city_suggest",
"size": "2"
}
},
"street": {
"completion": {
"field": "street_suggest",
"size": "2"
}
}
},
"size": 0,
"aggs": {
"meta": {
"filter": {
"term": {
"city.raw": "aberdeen"
}
},
"aggs": {
"name": {
"terms": {
"field": "city.raw"
}
}
}
}
}
}
The above query returns the following results:
{
"took": 37,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1870535,
"max_score": 0,
"hits": []
},
"aggregations": {
"meta": {
"doc_count": 119196,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Aberdeen",
"doc_count": 119196
}
]
}
}
},
"suggest": {
"city": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Aberdeen",
"score": 100
}
]
}
],
"street": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Davidson House, Aberdeen, AB15",
"score": 80
},
{
"text": "Bruce House, Aberdeen, AB15",
"score": 80
}
]
}
]
}
}
The result i am trying to achieve is to have an overall document count of each returned result so for example, The returned street address of "Davidson House, Aberdeen, AB15" would say how many documents in the index match this given address and this would be repeated for each result and the same for the city in a similar way to how the aggregated city currently shows the overall count.
{
"key": "Aberdeen",
"doc_count": 119196
}
Here is an example of something similar in production
The problem i believe i have faced with aggregations is i do not know the values that are going to be returned otherwise i could predefine them with aggregations like i did the city thus requesting the overall count of each given result that way.
To help give an overall example of how i pictured the results to be i will show how i pictured that possible working results to be like:
"suggest": {
"city": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Aberdeen",
"score": 100,
"total_addresses": 196152
}
]
}
],
"street": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Davidson House, Aberdeen, AB15",
"score": 80,
"total_addresses": 158
},
{
"text": "Bruce House, Aberdeen, AB15",
"score": 80,
"total_addresses": 30
}
]
}
]
}
En terms of the elasticsearch version i am using, I have two dev servers running elasticsearch 2.3 and 5.5 to see if the newer version of elasticsearch would make a difference and unfortunately i came up short so i have been using 2.3 in favour of 5.5
Any help or advice would be greatly appreciated, Thanks all.
you need to divide your query in two. First use the suggest API to gather suggestions, then run the aggregation on the result. The drawback of this solution would be, that you have a crazy fast suggestion (less than a millisecond, if you're lucky), against a longer running aggregation. If thats ok for you, this might be a good approach.
Another idea might be to have an own suggestion index with preaggregated data, that contains such a count - this index gets recreated regurlarly in the background.

How to perform an exact match query on an analyzed field in Elasticsearch?

This is probably a very commonly asked question, however the answers I've got so far isn't satisfactory.
Problem:
I have an es index that is composed of nearly 100 fields. Most of the fields are string type and set as analyzed. However, the query can be both partial (match) or exact (more like term). So, if my index contains a string field with value super duper cool pizza, there can be partial query like duper super and will match with the document, however, there can be exact query like cool pizza which should not match the document. On the other hand, Super Duper COOL PIzza again should match with this document.
So far, the partial match part is easy, I used AND operator in a match query. However can't get the other type done.
I have looked into other posts related to this problem and this post contains the closest solution:
Elasticsearch exact matches on analyzed fields
Out of the three solutions, the first one feels very complex as I have a lot of fields and I do not use the REST api, I am creating queries dynamically using QueryBuilders with NativeSearchQueryBuilder from their Java api. Also it generates a lots of possible patterns which I think will cause performance issues.
The second one is a much easier solution but again, I have to maintain a lot more (almost) redundant data and, I don't think using term queries are ever going to solve my problem.
The last one has a problem I think, it will not prevent super duper to be matched with super duper cool pizza which is not the output I want.
So is there any other way I can achieve the goal? I can post some sample mapping if required for clearing the question farther. I am already keeping the source as well (in case that can be used). Please feel free to suggest any improvements as well.
Thanks in advance.
[UPDATE]
Finally, I used multi_field, keeping a raw field for exact queries. When I insert I use some custom modification on data, and during searching, I used the same modification routines on input text. This part is not handled by Elasticsearch. If you want to do that, you have to design appropriate analyzers as well.
Index settings and mapping queries:
PUT test_index
POST test_index/_close
PUT test_index/_settings
{
"index": {
"analysis": {
"analyzer": {
"standard_uppercase": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "keyword",
"filter": ["uppercase"]
}
}
}
}
}
PUT test_index/doc/_mapping
{
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "standard_uppercase"
}
}
}
}
}
}
POST test_index/_open
Inserting some sample data:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Exact query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"term": {
"text_field.raw": "PIZZA"
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.4054651,
"_source": {
"text_field": "pizza"
}
}
]
}
}
Partial query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"match": {
"text_field": {
"query": "pizza",
"operator": "AND",
"type": "boolean"
}
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"text_field": "pizza"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
PS: These are generated queries, that's why there are some redundant blocks, as there would be many other fields concatenated into the queries.
Sad part is, now I need to rewrite the whole mapping again :(
I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase_token_filter"]
}
},
"filter": {
"lowercase_token_filter": {
"type": "lowercase"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lowercase": {
"type": "string",
"analyzer": "lowercase_analyzer"
}
}
}
}
}
}
}
I added a couple of docs for testing:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:
POST /test_index/_search
{
"query": {
"match": {
"text_field.lowercase": "Super Duper COOL PIzza"
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).
It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):
POST /test_index/_search
{
"size": 0,
"aggs": {
"text_field_standard": {
"terms": {
"field": "text_field"
}
},
"text_field_raw": {
"terms": {
"field": "text_field.raw"
}
},
"text_field_lowercase": {
"terms": {
"field": "text_field.lowercase"
}
}
}
}
...{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"text_field_raw": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_lowercase": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_standard": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 2
},
{
"key": "cool",
"doc_count": 1
},
{
"key": "duper",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "some",
"doc_count": 1
},
{
"key": "super",
"doc_count": 1
},
{
"key": "text",
"doc_count": 1
}
]
}
}
}
Here's the code I used to test this out:
http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1
If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

Elastic Search fulltext search query and filters

I wanna perform a full-text search, but I also wanna use one or many possible filters. The simplified structure of my document, when searching with /things/_search?q=*foo*:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "things",
"_type": "thing",
"_id": "63",
"_score": 1,
"fields": {
"name": [
"foo bar"
],
"description": [
"this is my description"
],
"type": [
"inanimate"
]
}
}
]
}
}
This works well enough, but how do I combine filters with a query? Let's say I wanna search for "foo" in an index with multiple documents, but I only want to get those with type == "inanimate"?
This is my attempt so far:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*foo*"
}
},
"filter": {
"bool": {
"must": {
"term": { "type": "inanimate" }
}
}
}
}
}
}
When I remove the filter part, it returns an accurate set of document hits. But with this filter-definition it does not return anything, even though I can manually verify that there are documents with type == "inanimate".
Since you have not done explicit mapping, term query is looking for an exact match. you need to add "index : not_analyzed" to type field and then your query will work.
This will give you correct documents
{
"query": {
"match": {
"type": "inanimate"
}
}
}
but this is not the solution, You need do explicit mapping as I said.

Elasticsearch: query for multiple words across multiple fields (with prefix)

I'm trying to implement an auto-suggest control powered by an ES index. The index has multiple fields and I want to be able to query across multiple fields using the AND operator and allowing for partial matches (prefix only).
Just as an example, let's say I got 2 fields I want to query on: "colour" and "animal".
I would like to be able to fulfil queries like "duc", "duck", "purpl", "purple", "purple duck".
I managed to get all these working using multi_match() with AND operator.
What I don't seem to be able to do is match on queries like "purple duc", as multi_match doesn't allow for wildcards.
I've looked into match_phrase_prefix() but as i understand it, it doesn't span across multiple fields.
I'm turning toward the implementation of a tokeniser: it feels the solution may be there, so ultimately the questions are:
1) can someone confirm there's no out-of-the-box function to do what I want to do? It feels like a common enough pattern that there could be something ready to use.
2) can someone suggest any solution? Are tokenizers part of the solution?
I'm more than happy to be pointed in the right direction and do more research myself.
Obviously if someone has working solutions to share that would be awesome.
Thanks in advance
- F
I actually wrote a blog post about this awhile back for Qbox, which you can find here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. (Unfortunately some of the links on the post are broken, and can't easily be fixed at this point, but hopefully you'll get the idea.)
I'll refer you to the post for the details, but here is some code you can use to test it out quickly. Note that I'm using edge ngrams instead of full ngrams.
Also note in particular the use of the _all field, and the match query operator.
Okay, so here is the mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"edgeNGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"edgeNGram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"enabled": true,
"index_analyzer": "edgeNGram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"field1": {
"type": "string",
"include_in_all": true
},
"field2": {
"type": "string",
"include_in_all": true
}
}
}
}
}
Now add a few documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"purple duck","field2":"brown fox"}
{"index":{"_id":2}}
{"field1":"slow purple duck","field2":"quick brown fox"}
{"index":{"_id":3}}
{"field1":"red turtle","field2":"quick rabbit"}
And this query seems to illustrate what you're wanting:
POST /test_index/_search
{
"query": {
"match": {
"_all": {
"query": "purp fo slo",
"operator": "and"
}
}
}
}
returning:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19930676,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.19930676,
"_source": {
"field1": "slow purple duck",
"field2": "quick brown fox"
}
}
]
}
}
Here is the code I used to test it out:
http://sense.qbox.io/gist/b87e426062f453d946d643c7fa3d5480cd8e26ec

Resources