How to get results weighted by references from ElasticSearch - elasticsearch

I have a dataset consisting of Notes referencing other Notes.
{id:"aaa", note: "lorem ipsum", references: ["ccc"]},
{id:"bbb", note: "lorem ipsum", references: ["aaa","ccc"]},
{id:"ccc", note: "lorem ipsum", references: ["bbb"]},
I want elastic search to use the references to weight the results, so in this case if I search for lorem I should get id "ccc" back since it has the most references. According to their docs, their graph solution does exactly this, but I also see examples where they are doing similar things.
But no explanation of how this is mapped to the Index. So my question is: how does one set up an ES index that uses references (indices)?

Other answers gave some clues, but then #7379490 provided the answer in another channel:
There is no way of doing this directly in ES. There are two possible solutions:
pre calculate references and pass them into ES by mapping a new value to the document.
Or aggregate and use aggregation to sort the response:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"aggs": {
"references":{
"terms" : {
"field" : "references.keyword",
"order": { "_count": "desc" }
}
}
}
}
}

Function_score can be used to give higher score to documents with more values for reference field
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"note": {
"type": "text"
},
"references": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Query:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"functions": [
{
"script_score": {
"script": "_score * doc['references.keyword'].length" --> references length
}
}
]
}
}
}
Result:
"hits" : [
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iJObRHIBXTOhZcHPaYks",
"_score" : 0.035661265,
"_source" : {
"id" : 2,
"note" : "lorem ipsum",
"references" : [
1,
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "h5ObRHIBXTOhZcHPYok2",
"_score" : 0.017830633,
"_source" : {
"id" : 1,
"note" : "lorem ipsum",
"references" : [
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iZObRHIBXTOhZcHPb4k4",
"_score" : 0.017830633,
"_source" : {
"id" : 3,
"note" : "lorem ipsum",
"references" : [
2
]
}
}
]

If your data contain referencesGiven and you need to search by referencesReceived, I would recommend a 2-way pass:
insert all documents with an empty field for referencesReceived: []
(or referencesReceivedCount: 0 if that is enough)
for each document, for each item in referencesGiven, update the document receiving the reference

Related

How to Order Completion Suggester with Fuzziness

When using a Completion Suggester with Fuzziness defined the ordering of results for suggestions are alphabetical instead of most relevant. It seems that whatever the fuzzines is set to is removed from the search/query term at the end of the term. This is not what I expected from reading the Completion Suggester Fuzziness docs which state:
Suggestions that share the longest prefix to the query prefix will be scored higher.
But that is not true. Here is a use case that proves this:
PUT test/
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"title":{
"type":"keyword",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
POST test/_bulk
{ "index" : {"_id": "1"}}
{ "title": "HOLARAT" }
{ "index" : {"_id": "2"}}
{ "title": "HOLBROOK" }
{ "index" : {"_id": "3"}}
{ "title": "HOLCONNEN" }
{ "index" : {"_id": "4"}}
{ "title": "HOLDEN" }
{ "index" : {"_id": "5"}}
{ "title": "HOLLAND" }
The above creates an index and adds some data.
If a suggestion query is done on said data:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"title-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}
It returns the first 3 results in alphabetical order of the last matching character, instead of the longest prefix (which would be HOLLAND):
{
...
"suggest" : {
"title-suggestion" : [
{
"text" : "HOLL",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "HOLARAT",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"title" : "HOLARAT"
}
},
{
"text" : "HOLBROOK",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"title" : "HOLBROOK"
}
},
{
"text" : "HOLCONNEN",
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.0,
"_source" : {
"title" : "HOLCONNEN"
}
}
]
}
]
}
}
If the size param is removed then we can see that the score is the same for all entries, instead of the longest prefix being higher as stated.
With this being the case, how can results from Completion Suggesters with Fuzziness defined be ordered with the longest prefix at the top?
This has been reported in the past and this behavior is actually by design.
What I usually do in this case is to send two suggest queries (similar to what has been suggested here), one for exact match and another for fuzzy match. If the exact match contains a suggestion, I use it, otherwise I resort to using the fuzzy ones.
With the suggest query below, you'll get HOLLAND as exact-suggestion and then the fuzzy matches in fuzzy-suggestion:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"fuzzy-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
},
"exact-suggestion": {
"completion": {
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}

Getting only the most recent records from ElasticSearch

I have a user index of ElasticSearch where each user has a name and multiple other user related information and also an indexedAt field which specify when the user information is being indexed. When any information of user changes I create a new record of the user and store it. Therefore each user can have many multiple records in the index.
Now Simply I want to get only the most up to date information of the queried users.
For example if I run the following query, it will return all of the records of John and Smith. But I want only the most recent record for each of the users.
{
"size": 10000,
"query": {
"bool": {
"should": [
{
"match_phrase": {
"name": "John"
}
},
{
"match_phrase": {
"name": "Smith"
}
}
]
}
},
"sort": [
{
"indexedAt": {
"order": "desc"
}
}
]
}
You can use inner_hits to get your answer
GET /temp_index/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{
"match_phrase": {
"name": "John"
}
},
{
"match_phrase": {
"name": "Smith"
}
}
]
}
},
"collapse": {
"field": "name.keyword",
"inner_hits": {
"name": "most_recent",
"size": 1,
"sort": [{"indexedAt": "desc"}]
}
}
}
This will get you a result similar to below
{
"_index" : "temp_index",
"_type" : "_doc",
"_id" : "KSHBjnMBPr3VGlJjXe3d",
"_score" : 0.8266786,
"_source" : {
"name" : "John",
"indexedAt" : 1015
},
"fields" : {
"name.keyword" : [
"John"
]
},
"inner_hits" : {
"most_recent" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "temp_index",
"_type" : "_doc",
"_id" : "LyHBjnMBPr3VGlJji-24",
"_score" : null,
"_source" : {
"name" : "John",
"indexedAt" : 1050
},
"sort" : [
1050
]
}
]
}
}
}
},
You can access the inner_hits portion to get the document which was most recently indexed (i.e. with the largest indexedAt value)

Skip duplicates on field in a Elasticsearch search result

Is it possible to remove duplicates on a given field?
For example the following query:
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"_source": [
"name_admin",
"parent_sku",
"sku"
],
"size": 2
}
is retrieving
"hits" : [
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central30603",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816401",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
},
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central156578",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816395",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
}
]
I'd like to skip duplicates on parent_sku so I only have one result per parent_sku like it's possible with suggestion by doing something like "skip_duplicates": true.
I know I cloud achieve this with an aggregation but I'd like to stick with a search, as my query is a bit more complicated and as I'm using the scroll API which doesn't work with aggregations.
Field collapsing should help here
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"collapse" : {
"field" : "parent_sku",
"inner_hits": {
"name": "parent",
"size": 1
}
},
"_source": false,
"size": 2
}
The above query will return one document par parent_sku.

Filter nested objects in ElasticSearch 6.8.1

I didn't find any answers how to do simple thing in ElasticSearch 6.8 I need to filter nested objects.
Index
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
},
"mappings": {
"human": {
"properties": {
"cats": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"breed": {
"type": "text"
},
"colors": {
"type": "integer"
}
}
},
"name": {
"type": "text"
}
}
}
}
}
Data
{
"name": "iridakos",
"cats": [
{
"colors": 1,
"name": "Irida",
"breed": "European Shorthair"
},
{
"colors": 2,
"name": "Phoebe",
"breed": "european"
},
{
"colors": 3,
"name": "Nino",
"breed": "Aegean"
}
]
}
select human with name="iridakos" and cats with breed contains 'European' (ignore case).
Only two cats should be returned.
Million thanks for helping.
For nested datatypes, you would need to make use of nested queries.
Elasticsearch would always return the entire document as a response. Note that nested datatype means that every item in the list would be treated as an entire document in itself.
Hence in addition to return entire document, if you also want to know the exact hits, you would need to make use of inner_hits feature.
Below query should help you.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "iridakos"
}
},
{
"nested": {
"path": "cats",
"query": {
"match": {
"cats.breed": "european"
}
},
"inner_hits": {}
}
}
]
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.74455214,
"hits" : [
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1", <--- The document that hit
"_score" : 0.74455214,
"_source" : {
"name" : "iridakos",
"cats" : [
{
"colors" : 1,
"name" : "Irida",
"breed" : "European Shorthair"
},
{
"colors" : 2,
"name" : "Phoebe",
"breed" : "european"
},
{
"colors" : 3,
"name" : "Nino",
"breed" : "Aegean"
}
]
},
"inner_hits" : { <---- Note this
"cats" : {
"hits" : {
"total" : {
"value" : 2, <---- Count of nested doc hits
"relation" : "eq"
},
"max_score" : 0.52354836,
"hits" : [
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "cats",
"offset" : 1
},
"_score" : 0.52354836,
"_source" : { <---- First Nested Document
"breed" : "european"
}
},
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "cats",
"offset" : 0
},
"_score" : 0.39019167,
"_source" : { <---- Second Document
"breed" : "European Shorthair"
}
}
]
}
}
}
}
]
}
}
Note in your response how the inner_hits section would appear where you would find the exact hits.
Hope this helps!
You could use something like this:
{
"query": {
"bool": {
"must": [
{ "match": { "name": "iridakos" }},
{ "match": { "cats.breed": "European" }}
]
}
}
}
To search on a cat's breed, you can use the dot-notation.

Is there any way to add the field in document but hide it from _source, also document should be analysed and searchable

I want to add one field to the document which should be searchable but when we do get/search it should not appear under _source.
I have tried index and store options but its not achievable through it.
Its more like _all or copy_to, but in my case value is provided by me (not collecting from other fields of the document.)
I am looking for mapping through which I can achieve below cases.
When I put document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
and do search
GET my_index/_search
output should be
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
}
also when I do the below search
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
it should result me
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
Simply use source filtering to exclude the content field:
GET my_index/_search
{
"_source": {
"excludes": [ "content" ]
},
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
We can achieve this using below mapping :
PUT my_index
{
"mappings": {
"_doc": {
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
Add document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
When you run the query to search content on the field 'content' :
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
You will get the result with hits as below:
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"date" : "2015-01-01",
"title" : "Some short title"
}
}
]
}
It hides the field 'content'. :)
Hence achieved it with the help of mapping. You don't need to exclude it from query each time you make get/search call.
More read on source :
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/mapping-source-field.html

Resources