Completion Suggester in ElasticSearch On Existing Field - elasticsearch

In my elasticsearch index, I have indexed a bunch of jobs. For simplicity, let's just say they are a bunch of Job Titles. When people are typing a job title into my search engine, I want to "Auto Complete" with possible matches.
I've investigated the Completion Suggester here : http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
However all the examples I've found involve creating a new field on your index, and manually populating this field while indexing/rivering.
Is there any way to have a completion suggester on an existing field? Even if it means reindexing data that's fine. For example, when I want to keep the original not_analysed text, I can do something like this in the mappings :
"JobTitle": {
"type": "string",
"fields": {
"Original": {
"type": "string",
"index": "not_analyzed"
}
}
}
Is this possible to do with the suggesters?
If not, is it possible to do a non whitespace tokenizing/N-Gram search instead to get these fields? While it would be slower, I assume that would work.

Okay, here is the easy way that (may or) may not scale, using prefix queries.
I'll create an index using the "fields" technique you mentioned, and some handy job description data I found here:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"experienced bra fitter", "desc":"I bet they had trouble finding candidates for this one."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PlayStation Brand Ambassador", "desc":"please report to your residence in the United States of Nintendo."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Eyebrow Threading", "desc":"I REALLY hope this has something to do with dolls."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Administraive/ Secretary", "desc":"ok, ok, we get it. It’s clear where you need help."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Finish Carpenter", "desc":"for when the Start Carpenter gets tired."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Helpdesk Technician # Pentagon", "desc":"“Uh, hello? I’m having a problem with this missile…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Nail Tech", "desc":"so nails can be pretty complicated…"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Remedy Engineer", "desc":"aren’t those called “doctors”?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Saltlick Cashier", "desc":"new trend in the equestrian industry. Ok, enough horsing around."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Molecular Biologist II", "desc":"when Molecular Biologist I gets promoted."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Breakfast Sandwich Maker", "desc":"we also got one of these recently."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Hotel Housekeepers", "desc":"why can’t they just say ‘hotelkeepers’?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Preschool Teacher #4065", "desc":"either that’s a really big school or they’ve got robot teachers."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"glacéau drop team", "desc":"for a new sport at the Winter Olympics: ice-water spilling."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PLUMMER/ELECTRICIAN", "desc":"get a dictionary/thesaurus first."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"DoodyCalls Technician", "desc":"they really shouldn’t put down janitors like that."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Golf Staff", "desc":"and here I thought they were called clubs."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Pressure Washers", "desc":"what’s next, heat cleaners?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Sandwich Artist", "desc":"another “Jesus in my food” wannabe."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Self Storage Manager", "desc":"this is for self storage?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Qualified Infant Caregiver", "desc":"too bad for all the unqualified caregivers on the list."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Ground Support", "desc":"but there’s just more dirt under there."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Gymboree Teacher", "desc":"the hardest part is not burning your hands sliding down the pole."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"COMMERCIAL space hunter", "desc":"so they did find animals further out in the cosmos? Who knew."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"JOB COACH", "desc":"if they’re unemployed when they get to you, what does that say about them?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"KIDS KAMP INSTRUCTOR!", "desc":"no spelling ability required."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"POOLS SUPERVISOR", "desc":"“yeah, they’re still wet…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"HOUSE MANAGER/TEEN SUPERVISOR", "desc":"see the dictionary under P, for Parent."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Licensed Seamless Gutter Contractor", "desc":"just sounds bad."}
Then I can easily run a prefix query:
POST /test_index/_search
{
"query": {
"prefix": {
"title": {
"value": "san"
}
}
}
}
...
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "mcRfqtwzTyWE7ZNsKFvwEg",
"_score": 1,
"_source": {
"title": "Breakfast Sandwich Maker",
"desc": "we also got one of these recently."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
Or if I want to be more careful about the matches I can use the un-analyzed field:
POST /test_index/_search
{
"query": {
"prefix": {
"title.raw": {
"value": "San"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
This is the easy way. Ngrams are a bit more involved, but not difficult. I'll add that in another answer in a bit.
Here's the code I used:
http://sense.qbox.io/gist/4e066d051d7dab5fe819264b0f4b26d958d115a9
EDIT: Ngram version
Borrowing the analyzers from this blog post (shameless plug), I can set up the index as follows:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Notice that I use different analyzers for indexing and for searching; that's important because if the search query is broken up into ngrams we will probably get a lot more hits than we want.
Populating with the same dataset used above, I can query with a simple match query to get the results I expect:
POST /test_index/_search
{
"query": {
"match": {
"title": "sup"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.8631258,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "4pcAOmPNSYupjz7lSes8jw",
"_score": 1.8631258,
"_source": {
"title": "Ground Support",
"desc": "but there’s just more dirt under there."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "DVFOC6DsTa6eH_a-RtbUUw",
"_score": 1.8631258,
"_source": {
"title": "POOLS SUPERVISOR",
"desc": "“yeah, they’re still wet…”"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "klleY_bnQ4uFmCPF94sLOw",
"_score": 1.4905007,
"_source": {
"title": "HOUSE MANAGER/TEEN SUPERVISOR",
"desc": "see the dictionary under P, for Parent."
}
}
]
}
}
Here's the code:
http://sense.qbox.io/gist/b0e77bb7f05a4527de5ab4345749c793f923794c

Related

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

ngram matching gives same score to less relevant documents

I am searching for Bob Smith in my elasticsearch index. The results Bob Smith and Bobbi Smith both come back in the response with the same score. I want Bob Smith to have a higher score so that it appears first in my result set. Why are the scores the equivalent?
Here is my query
{
"query": {
"query_string": {
"query": "Bob Smith",
"fields": [
"text_field"
]
}
}
}
Below are my index's settings. I am using the ngram token filter described here: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
{
"contacts_5test": {
"aliases": {},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"term_vector": "yes",
"analyzer": "ngram_filter_analyzer"
}
}
},
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "contacts_5test",
"creation_date": "1588987227997",
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": "4",
"max_gram": "4"
}
},
"analyzer": {
"ngram_filter_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "HqOXu9bNRwCHSeK39WWlxw",
"version": {
"created": "7060199"
}
}
}
}
}
Here are the results from my query...
"hits": [
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "1",
"_score": 0.69795835,
"_source": {
"text_field": "Bob Smith"
}
},
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "2",
"_score": 0.69795835,
"_source": {
"text_field": "Bobbi Smith"
}
}
]
If I instead search for Bobbi Smith, elastic returns both documents, but with a higher score for Bobbi Smith. This makes more sense.
I was able to reproduce your issue and reason for this is due to the use of your ngram_filter, which doesn't create any token for bob as the minimum length of the token should be 4 while standard tokenizer created bob token but then it gets filtered out in your ngram_filter where you mentioned min_gram as 4.
Even I tried with less min_gram length to 3, which would create the tokens but the issue is that both bob and bobbie will have same bob tokens, hence score for both of them will be same.
While when you search for Bobbi Smith, then bobbi ie exact token will be present only in one document, hence you get the higher score.
Note:- Please use the analyze API and explain API to inspect the tokens generated and how these are matched, this would help you to understand the issue and my explanation in details and my

How to perform an exact match query on an analyzed field in Elasticsearch?

This is probably a very commonly asked question, however the answers I've got so far isn't satisfactory.
Problem:
I have an es index that is composed of nearly 100 fields. Most of the fields are string type and set as analyzed. However, the query can be both partial (match) or exact (more like term). So, if my index contains a string field with value super duper cool pizza, there can be partial query like duper super and will match with the document, however, there can be exact query like cool pizza which should not match the document. On the other hand, Super Duper COOL PIzza again should match with this document.
So far, the partial match part is easy, I used AND operator in a match query. However can't get the other type done.
I have looked into other posts related to this problem and this post contains the closest solution:
Elasticsearch exact matches on analyzed fields
Out of the three solutions, the first one feels very complex as I have a lot of fields and I do not use the REST api, I am creating queries dynamically using QueryBuilders with NativeSearchQueryBuilder from their Java api. Also it generates a lots of possible patterns which I think will cause performance issues.
The second one is a much easier solution but again, I have to maintain a lot more (almost) redundant data and, I don't think using term queries are ever going to solve my problem.
The last one has a problem I think, it will not prevent super duper to be matched with super duper cool pizza which is not the output I want.
So is there any other way I can achieve the goal? I can post some sample mapping if required for clearing the question farther. I am already keeping the source as well (in case that can be used). Please feel free to suggest any improvements as well.
Thanks in advance.
[UPDATE]
Finally, I used multi_field, keeping a raw field for exact queries. When I insert I use some custom modification on data, and during searching, I used the same modification routines on input text. This part is not handled by Elasticsearch. If you want to do that, you have to design appropriate analyzers as well.
Index settings and mapping queries:
PUT test_index
POST test_index/_close
PUT test_index/_settings
{
"index": {
"analysis": {
"analyzer": {
"standard_uppercase": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "keyword",
"filter": ["uppercase"]
}
}
}
}
}
PUT test_index/doc/_mapping
{
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "standard_uppercase"
}
}
}
}
}
}
POST test_index/_open
Inserting some sample data:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Exact query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"term": {
"text_field.raw": "PIZZA"
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.4054651,
"_source": {
"text_field": "pizza"
}
}
]
}
}
Partial query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"match": {
"text_field": {
"query": "pizza",
"operator": "AND",
"type": "boolean"
}
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"text_field": "pizza"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
PS: These are generated queries, that's why there are some redundant blocks, as there would be many other fields concatenated into the queries.
Sad part is, now I need to rewrite the whole mapping again :(
I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase_token_filter"]
}
},
"filter": {
"lowercase_token_filter": {
"type": "lowercase"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lowercase": {
"type": "string",
"analyzer": "lowercase_analyzer"
}
}
}
}
}
}
}
I added a couple of docs for testing:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:
POST /test_index/_search
{
"query": {
"match": {
"text_field.lowercase": "Super Duper COOL PIzza"
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).
It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):
POST /test_index/_search
{
"size": 0,
"aggs": {
"text_field_standard": {
"terms": {
"field": "text_field"
}
},
"text_field_raw": {
"terms": {
"field": "text_field.raw"
}
},
"text_field_lowercase": {
"terms": {
"field": "text_field.lowercase"
}
}
}
}
...{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"text_field_raw": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_lowercase": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_standard": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 2
},
{
"key": "cool",
"doc_count": 1
},
{
"key": "duper",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "some",
"doc_count": 1
},
{
"key": "super",
"doc_count": 1
},
{
"key": "text",
"doc_count": 1
}
]
}
}
}
Here's the code I used to test this out:
http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1
If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

Should I include spaces in fuzzy query fields?

I have this data:
name:
first: 'John'
last: 'Smith'
When I store it in ES, AFAICT it's better to make it one field. However, should this one field be:
name: 'John Smith'
or
name: 'JohnSmith'
?
I'm thinking that the query should be:
query:
match:
name:
query: searchTerm
fuzziness: 'AUTO'
operator: 'and'
Example search terms are what people might type in a search box, like
John
Jhon Smi
J Smith
Smith
etc.
You will probably want a combination of ngrams and a fuzzy match query. I wrote a blog post about ngrams for Qbox if you need a primer: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch. I'll swipe the starter code at the end of the post to illustrate what I mean here.
Also, I don't think it matters much whether you use two fields for name, or just one. If you have some other reason you want two fields, you may want to use the _all field in your query. For simplicity I'll just use a single field here.
Here is a mapping that will get you the partial-word matching you want, assuming you only care about tokens that start at the beginning of words (otherwise use ngrams instead of edge ngrams). There are lots of nuances to using ngrams, so I'll refer to you the documentation and my primer if you want more info.
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
One thing to note here, in particular: "min_gram": 1. This means that single-character tokens will be generated from indexed values. This will cast a pretty wide net when you query (lots of words begin with "j", for example), so you may get some unexpected results, especially when combined with fuzziness. But this is needed to get your "J Smith" query to work right. So there are some trade-offs to consider.
For illustration, I indexed four documents:
PUT /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"John Hancock"}
{"index":{"_id":2}}
{"name":"John Smith"}
{"index":{"_id":3}}
{"name":"Bob Smith"}
{"index":{"_id":4}}
{"name":"Bob Jones"}
Your query mostly works, with a couple of caveats.
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "John",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
}
this query returns three documents, because of ngrams plus fuzziness:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.90169895,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.90169895,
"_source": {
"name": "John Hancock"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.90169895,
"_source": {
"name": "John Smith"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.6235822,
"_source": {
"name": "Bob Jones"
}
}
]
}
}
That may not be what you want. Also, "AUTO" doesn't work with the "Jhon Smi" query, because "Jhon" is an edit distance of 2 from "John", and "AUTO" uses an edit distance of 1 for strings of 3-5 characters (see the docs for more info). So I have to use this query instead:
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "Jhon Smi",
"fuzziness": 2,
"operator": "and"
}
}
}
}
...
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4219328,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4219328,
"_source": {
"name": "John Smith"
}
}
]
}
}
The other queries work as expected. So this solution isn't perfect, but it will get you close.
Here's all the code I used:
http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77

Elastic Search- Fetch Distinct Tags

I have document of following format:
{
_id :"1",
tags:["guava","apple","mango", "banana", "gulmohar"]
}
{
_id:"2",
tags: ["orange","guava", "mango shakes", "apple pie", "grammar"]
}
{
_id:"3",
tags: ["apple","grapes", "water", "gulmohar","water-melon", "green"]
}
Now, I want to fetch unique tags value from whole document 'tags field' starting with prefix g*, so that these unique tags will be display by tag suggestors(Stackoverflow site is an example).
For example: Whenever user types, 'g':
"guava", "gulmohar", "grammar", "grapes" and "green" should be returned as a result.
ie. the query should returns distinct tags with prefix g*.
I tried everywhere, browse whole documentations, searched es forum, but I didn't find any clue, much to my dismay.
I tried aggregations, but aggregations returns the distinct count for whole words/token in tags field. It does not return the unique list of tags starting with 'g'.
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"allow_leading_wildcard": false,
"fields": [
"tags"
],
"query": "g*",
"fuzziness":0
}
}
]
}
},
"filter": {
//some condition on other field...
}
}
},
"aggs": {
"distinct_tags": {
"terms": {
"field": "tags",
"size": 10
}
}
},
result of above: guava(w), apple(q), mango(1),...
Can someone please suggest me the correct way to fetch all the distinct tags with prefix input_prefix*?
It's a bit of a hack, but this seems to accomplish what you want.
I created an index and added your docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"tags":["guava","apple","mango", "banana", "gulmohar"]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"tags": ["orange","guava", "mango shakes", "apple pie", "grammar"]}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"tags": ["guava","apple","grapes", "water", "grammar","gulmohar","water-melon", "green"]}
Then I used a combination of prefix query and highlighting as follows:
POST /test_index/_search
{
"query": {
"prefix": {
"tags": {
"value": "g"
}
}
},
"fields": [ ],
"highlight": {
"pre_tags": [""],
"post_tags": [""],
"fields": {
"tags": {}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"highlight": {
"tags": [
"guava",
"gulmohar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grammar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grapes",
"grammar",
"gulmohar",
"green"
]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/c14675ee8bd3934389a6cb0c85ff57621a17bf11
What you're trying to do amounts to autocomplete, of course, and there are perhaps better ways of going about that than what I posted above (though they are a bit more involved). Here are a couple of blog posts we did about ways to set up autocomplete:
http://blog.qbox.io/quick-and-dirty-autocomplete-with-elasticsearch-completion-suggest
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
As per #Sloan Ahrens advice, I did following:
Updated the mapping:
"tags": {
"type": "completion",
"context": {
"filter_color": {
"type": "category",
"default": "",
"path": "fruits.color"
},
"filter_type": {
"type": "category",
"default": "",
"path": "fruits.type"
}
}
}
Reference: ES API Guide
Inserted these indexes:
{
_id :"1",
tags:{input" :["guava","apple","mango", "banana", "gulmohar"]},
fruits:{color:'bar',type:'alice'}
}
{
_id:"2",
tags:{["orange","guava", "mango shakes", "apple pie", "grammar"]}
fruits:{color:'foo',type:'bob'}
}
{
_id:"3",
tags:{ ["apple","grapes", "water", "gulmohar","water-melon", "green"]}
fruits:{color:'foo',type:'alice'}
}
I don't need to modify much, my original index. Just added input before tags array.
POST rescu1/_suggest?pretty'
{
"suggest": {
"text": "g",
"completion": {
"field": "tags",
"size": 10,
"context": {
"filter_color": "bar",
"filter_type": "alice"
}
}
}
}
gave me the desired output.
I accepted #Sloan Ahrens answer as his suggestions worked like a charm for me, and he showed me the right direction.

Resources