add score to elasticsearch completion suggester inputs - elasticsearch

I need to implement elasticsearch completion suggester.
I have an index mapped like this:
{
"user": {
"properties": {
"username": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"email": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"name": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"name_suggest": {
"payloads": true,
"type": "completion"
}
}
}
}
I add documents to the index like this:
{
"doc": {
"id": 1,
"username": "jack",
"name": "Jack Nicholson",
"email": "nick#myemail.com",
"name_suggest": {
"input": [
"jack",
"Jack Nicholson",
"nick#myemail.com"
],
"payload": {
"id": 1,
"name": "Jack Nicholson",
"username": "jack",
"email": "nick#myemail.com"
},
"output": "Jack Nicholson (jack) - nick#myemail.com"
}
},
"doc_as_upsert": true
}
And I send this request to my_index/_suggest:
{
"user": {
"text": "jack",
"completion": {
"field": "name_suggest"
}
}
}
I get the resulting options that look like this:
[
{
"text": "John Smith",
"score": 1.0,
"payload": {
"id": 11,
"name": "John Smith",
"username": "jack",
"email": "john#myemail.com"
}
},
{
"text": "Jack Nickolson",
"score": 1.0,
"payload": {
"id": 1,
"name": "Jack Nickolson",
"username": "jack.n",
"email": "nickolson#myemail.com"
}
},
{
"text": "Jackson Jermaine",
"score": 1.0,
"payload": {
"id": 10,
"name": "Jackson Jermaine",
"username": "jermaine",
"email": "jermaine#myemail.com"
}
},
{
"text": "Tito Jackson",
"score": 1.0,
"payload": {
"id": 9,
"name": "Tito Jackson",
"username": "tito",
"email": "jackson#myemail.com"
}
},
{
"text": "Michael Jackson",
"score": 1.0,
"payload": {
"id": 6,
"name": "Michael Jackson",
"username": "michael_jackson",
"email": "jackson_michael#myemail.com"
}
}
]
This works fine but, I need to have the options sorted that way that those that have username matched come first. I can do it manually, but that would prevent me to use length and offset and would be slower.
Is it possible to add scoring to the individual inputs (not the whole suggests), and that way affect the sorting? With the approach that I use it seems it is not.
Another related question, is it possible to specify in the input an array of fields instead of an array of values, and that way avoid the duplication? If yes, would setting the score on the fields be taken into account when ES generates suggestions?

You can add score to your input with the weight option.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#indexing

Related

AWS Open Search/Elastic search wild card search on full index

For example I have below 1 json data sample where multiple fields having the value '1001'. Like this I have many Json document. I want to search particular keyword like '1001' across any field (can be nested json field as well). I have gone through multiple document where they are suggesting to put the particular field name to search. Is there a way to achieve this without knowing which field has the search text?
URL: https://linuxhint.com/wildcard-query-elasticsearch/
{
"id": "1001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "1001" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "1001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

rethinkdb, How could I pluck the result by a value in particular "array index"?

sample data
[
{
"createdDate": 1508588333821,
"data": {
"image_extension": "png",
"name": "Golden",
"qty": 1,
"remark": "#296-2",
"status": "RETURN",
"owner": [
{
"name": "app1emaker",
"location": 1
},
{
"name": "simss92_lmao",
"location": 31
}
]
},
"deleted": false,
"docId": 307,
"docType": "product",
"id": "db0131f9-9359-4aa3-b6ed-cd9f3ff4aa3e",
"updatedDate": 1553155281691
},
{
"createdDate": 1508588333324,
"data": {
"image_extension": "png",
"name": "Golden",
"qty": 1,
"remark": "#296-2",
"status": "DISCARD",
"owner": [
{
"name": "At533",
"location": 7
},
{
"name": "madsimon",
"location": 64
},
{
"name": "boyboy96",
"location": 1
},
{
"name": "xinfengCN",
"location": 5
}
]
},
"deleted": false,
"docId": 308,
"docType": "product",
"id": "3790bdaa-5347-4ab0-8149-37332c23c6ea",
"updatedDate": 1554555231691
},
...
...
]
And said that, I would like to select the data.owner on array index 0 only (or I should say data.owner[0]), which are
{
"name": "app1emaker",
"location": 1
}
and
{
"name": "At533",
"location": 7
}
in this case. I have a failed code below.
r.db('carbon').table("items").pluck(['id', 'docId', 'createdDate',{data:{name: true, owner:[0]}}])
I saw that for some functions like orderBy, rethinkdb allowed to use orderBy(r.row('data')('owner')(0)('name')) for access nested object, but I have no idea how to do this for pluck? could anyone give me some hints?
Thanks a lot
pluck can not do that, but you can fall back to use the map:
r.db("carbon").table("items").map(function(doc){
return {
"id": doc("id"),
"docId": doc("docId"),
"createdDate": doc("createdDate"),
"data": {
"name": doc("data")("name"),
"owner": doc("data")("owner")(0)
}
}
})

Word-oriented completion suggester (ElasticSearch 5.x)

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:
Completion suggester is document-oriented
Suggestions are aware of the
document they belong to. Now, associated documents (_source) are
returned as part of completion suggestions.
In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.
Let's say we have this simple mapping:
{
"my-index": {
"mappings": {
"users": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple"
}
}
}
}
}
}
With a few test documents:
{
"_index": "my-index",
"_type": "users",
"_id": "1",
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"_index": "my-index",
"_type": "users",
"_id": "2",
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
And a by-the-book query:
POST /my-index/_suggest?pretty
{
"my-suggest" : {
"text" : "joh",
"completion" : {
"field" : "suggest"
}
}
}
Which yields the following results:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
]
}
]
}
In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.
However, I would like to receive one (1) word. Something simple like this:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
"John"
]
}
]
}
Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.
Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?
EDIT:
As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:
Keeping the new index in sync.
Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".
To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.
As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"completion_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"completion_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 24
}
}
}
},
"mappings": {
"users": {
"properties": {
"autocomplete": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"completion": {
"type": "text",
"analyzer": "completion_analyzer",
"search_analyzer": "standard"
}
}
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
}
}
}
Then you index a few documents:
POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }
Then you can query for joh and get one result for John and another one for Johnny
{
"size": 0,
"query": {
"term": {
"autocomplete.completion": "john d"
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "autocomplete.raw"
}
}
}
}
Results:
{
"aggregations": {
"suggestions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 1
},
{
"key": "John Deere",
"doc_count": 1
}
]
}
}
}
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
An additional field skip_duplicates will be added in the next release 6.x.
From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:
POST music/_search?pretty
{
"suggest": {
"song-suggest" : {
"prefix" : "nor",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.
The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

Find distinct inner objects in Elasticsearch

We're trying to find distinct inner objects in Elasticsearch. This would be a minimum example for our case.
We're stuck with something like the following mapping (changing types or indices or adding new fields wouldn't be a problem, but the structure should remain as it is):
{
"building": {
"properties": {
"street": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"house number": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"city": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"people": {
"type": "object",
"store": "yes",
"index": "not_analyzed",
"properties": {
"firstName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"lastName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
}
}
}
}
}
}
Assuming we have this example data:
{
"buildings": [
{
"street": "Baker Street",
"house number": "221 B",
"city": "London",
"people": [
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Jane",
"lastName": "Doe"
}
]
},
{
"street": "Baker Street",
"house number": "5",
"city": "London",
"people": [
{
"firstName": "John",
"lastName": "Doe"
}
]
},
{
"street": "Garden Street",
"house number": "1",
"city": "London",
"people": [
{
"firstName": "Jane",
"lastName": "Smith"
}
]
}
]
}
When we query for the street "Baker Street" (and whatever additional options needed), we expect to get the following list:
[
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Jane",
"lastName": "Doe"
}
]
The format does not matter too much, but we should be able to parse the first and last name. Just, as our actual data-set is much larger, we need the entries to be distinct.
We are using Elasticsearch 1.7.
We finally solved our problem.
Our solution is (as we expected) a pre-calculated people_all field. But instead of using copy_to or transform we're just writing it as we are writing the other fields when importing our data. The field looks as follows:
"people": {
"type": "nested",
..
"properties": {
"firstName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"lastName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"people_all": {
"type": "string",
"index": "not_analyzed"
}
}
}
Please pay attention on the "index": "not_analyzed" at the people_all field. This is important to have complete buckets. If you don't use it, our example will return 3 buckets "john", "jane" and "doe".
After writing this new field we can run an aggragetion as follows:
{
"size": 0,
"query": {
"term": {
"street": "Baker Street"
}
},
"aggs": {
"people_distinct": {
"nested": {
"path": "people"
},
"aggs": {
"people_all_distinct": {
"terms": {
"field": "people.people_all",
"size": 0
}
}
}
}
}
}
And we return the following response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"people_distinct": {
"doc_count": 3,
"people_name_distinct": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 2
},
{
"key": "Jane Doe",
"doc_count": 1
}
]
}
}
}
}
Out of the buckets in the response we are now able to create the distinct people objects.
Please let us know if there is a better way to reach our goal.
Parsing the buckets is not an optimal solution and it would be more fancy to have the fields firstName and lastName in each bucket.
As suggested in the comment your mapping of people should be of type nested rather than object as it could give unexpected results. You also need to reindex your data after that.
As for the question, You need to aggregate results based on your query.
{
"query": {
"term": {
"street": "Baker Street"
}
},
"aggs": {
"distinct_people": {
"terms": {
"field": "people",
"size": 1000
}
}
}
}
Please note that I have set size to 1000 inside aggregation, you might have to use bigger number to get all distinct people, ES returns only 10 results by default.
You could set the query size to 0 or use the parameter search_type=count if you are interested only in aggregated buckets.
You can read more about aggregations here. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
I hope this helps!Let me know if this does not work out.

Sort nested object in Elasticsearch

I'm using the following mapping:
PUT /my_index
{
"mappings": {
"blogpost": {
"properties": {
"title": {"type": "string"}
"comments": {
"type": "nested",
"properties": {
"comment": { "type": "string" },
"date": { "type": "date" }
}
}
}
}
}
}
Example of document:
PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"comments": [
{
"comment": "Great article",
"date": "2014-09-01"
},
{
"comment": "More like this please",
"date": "2014-10-22"
},
{
"comment": "Visit my website",
"date": "2014-07-02"
},
{
"comment": "Awesome",
"date": "2014-08-23"
}
]
}
My question is how to retrieve this document and sort the nested object "comments" by "date"? the result:
PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"comments": [
{
"comment": "Awesome",
"date": "2014-07-23"
},
{
"comment": "Visit my website",
"date": "2014-08-02"
},
{
"comment": "Great article",
"date": "2014-09-01"
},
{
"comment": "More like this please",
"date": "2014-10-22"
}
]
}
You need to sort on the inner_hits to sort the nested objects. This will give you the desired output
GET my_index/_search
{
"query": {
"nested": {
"path": "comments",
"query": {
"match_all": {}
},
"inner_hits": {
"sort": {
"comments.date": {
"order": "asc"
}
},
"size": 5
}
}
},
"_source": [
"title"
]
}
I am using source filtering to get only "title" as comments will be retrieved inside inner_hit but you can avoid that if you want
size is 5 because default value is 3 and we have 4 objects in the given example.
Hope this helps!

Resources