Get only last version (custom field) of document when executing a search - elasticsearch

I am using the Java API for elasticsearch and I am trying to get only the last version (which is a custom field) of each document when executing a search.
For example :
{ id: 1, name: "John Greenwood", version: 1}
{ id: 1, name: "John Greenwood", version: 2}
{ id: 2, name: "John Underwood", version: 1}
While searching with Jhon, I want this result :
{ id: 1, name: "John Greenwood", follower_count: 2}
{ id: 2, name: "John Underwood", follower_count: 1}
Apparently I am supposed to use aggregation, but Im not sure how to use them with the Java API.
Also, how can I regroup the documents with the ID also ? Because I only want the latest version for the same ID

Tldr;
Yes, you are on the right track.
You will want to aggregate on the id of each user. The get the top_hit per regard to the version.
Solution
The first aggregation per_id is grouping user by their id, then inside this aggregation we perform another one.
lastest_version that is going to select the best hit with regards to the version. I select the size: 1 to get a top 1 per group.
GET 74550367/_search
{
"query": {
"match_all": {}
},
"aggs": {
"per_id": {
"terms": {
"field": "id"
},
"aggs": {
"lastest_version": {
"top_hits": {
"sort": [
{
"version": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
To Reproduce
POST _bulk
{ "index": {"_index":"74550367"}}
{ "id": 1, "name": "John Greenwood", "version": 1}
{ "index": {"_index":"74550367"}}
{ "id": 1, "name": "John Greenwood", "version": 2}
{ "index": {"_index":"74550367"}}
{ "id": 2, "name": "John Underwood", "version": 1}

Related

Search by results of previous search in elasticsearch

It is possible to make a search by the results of another search?. For example:
// index: A
{ "ID": 1, "status": "done" }
{ "ID": 2, "status": "processing" }
{ "ID": 3, "status": "done" }
{ "ID": 4, "status": "done" }
// index: B
{ "ID": 1, "user": 1, "value": 10 }
{ "ID": 1, "user": 2, "value": 3 }
{ "ID": 2, "user": 1,"value": 1 }
{ "ID": 3, "user": 1, "value": 3 }
{ "ID": 4, "user": 1, "value": 7 }
Q1: Search in index "A" status == "done" and return the ID
RES: 1,3,4
Q2: From the results in Q1 search value > 5 and return the ID
RES: 1,4
My current solution is use two queries and download the results of "Q1" and make a second search in "Q2" but is very complicated because have 30k of results.
the problem to me seems to be more of a traditional union of filters in 2 indexes sort of a join , what we have in relational databases , not sure of the exact solution but recently had used a plug-in for the joins -> https://siren.io/siren-federate-20-0-introducing-a-scalable-inner-join-for-elasticsearch/ this might help

Merge / flatten sub aggs into main agg

Is there away in elasticsearch to get the results back in a sort of flattend form (multiple child/sub aggs?
For instance currently i am trying to get back all product types and their status (online / offline).
This is what i end up with:
aggs
[
{ key: SuperProduct, doc_count:3, subagg:[
{status:online, doc_count:1},
{status:offline, doc_count:2}
]
},
{ key: SuperProduct2, doc_count:10, subagg:[
{status:online, doc_count:7},
{status:offline, doc_count:3}
]
Charting libraries tend to like it flattened so i was wondering if elasticsearch could probide it in this sort of manner:
[
{ products_key: 'SuperProduct', status_key:'online', doc_count:1},
{ products_key: 'SuperProduct', status_key:'offline', doc_count:2},
{ products_key: 'SuperProduct2', status_key:'online', doc_count:7},
{ products_key: 'SuperProduct2', status_key:'offline', doc_count:3}
]
Thanks
It is possible with composite aggregation which you can use to link two terms aggregations:
// POST /i/_search
{
"size": 0,
"aggregations": {
"distribution": {
"composite": {
"sources": [
{"product": {"terms": {"field": "product.keyword"}}},
{"status": {"terms": {"field": "status.keyword"}}}
]
}
}
}
}
This results in following structure:
{
"aggregations": {
"distribution": {
"after_key": {
"product": "B",
"status": "online"
},
"buckets": [
{
"key": {
"product": "A",
"status": "offline"
},
"doc_count": 3
},
{
"key": {
"product": "A",
"status": "online"
},
"doc_count": 2
},
{
"key": {
"product": "B",
"status": "offline"
},
"doc_count": 1
},
{
"key": {
"product": "B",
"status": "online"
},
"doc_count": 4
}
]
}
}
}
If for any reason composite aggregation doesn't fulfill your needs, you can create (via copy_to or by concatenation) or simulate (via scripted fields) field that would uniquely identify bucket. In our project we went with concatenation (partially for the necessity to collapse on this field), e.g. {"bucket": "SuperProductA:online"}, which results in dirtier output (you'll have to decode that field back or use top hits to get original values) but still does the job.

Elastic search Average time difference Aggregate Query

I have documents in elasticsearch in which each document looks something like as follows:
{
"id": "T12890ADSA12",
"status": "ENDED",
"type": "SAMPLE",
"updatedAt": "2020-05-29T18:18:08.483Z",
"events": [
{
"event": "STARTED",
"version": 1,
"timestamp": "2020-04-30T13:41:25.862Z"
},
{
"event": "INPROGRESS",
"version": 2,
"timestamp": "2020-05-14T17:03:09.137Z"
},
{
"event": "INPROGRESS",
"version": 3,
"timestamp": "2020-05-17T17:03:09.137Z"
},
{
"event": "ENDED",
"version": 4,
"timestamp": "2020-05-29T18:18:08.483Z"
}
],
"createdAt": "2020-04-30T13:41:25.862Z"
}
Now, I wanted to write a query in elasticsearch to get all the documents which are of type "SAMPLE" and I can get the average time between STARTED and ENDED of all those documents. Eg. Avg of (2020-05-29T18:18:08.483Z - 2020-04-30T13:41:25.862Z, ....). Assume that STARTED and ENDED event is present only once in events array. Is there any way I can do that?
You can do something like this. The query selects the events of type SAMPLE and status ENDED (to make sure there is a ENDED event). Then the avg aggregation uses scripting to gather the STARTED and ENDED timestamps and subtracts them to return the number of days:
POST test/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"status.keyword": "ENDED"
}
},
{
"term": {
"type.keyword": "SAMPLE"
}
}
]
}
},
"aggs": {
"duration": {
"avg": {
"script": "Map findEvent(List events, String type) {return events.find(it -> it.event == type);} def started = Instant.parse(findEvent(params._source.events, 'STARTED').timestamp); def ended = Instant.parse(findEvent(params._source.events, 'ENDED').timestamp); return ChronoUnit.DAYS.between(started, ended);"
}
}
}
}
The script looks like this:
Map findEvent(List events, String type) {
return events.find(it -> it.event == type);
}
def started = Instant.parse(findEvent(params._source.events, 'STARTED').timestamp);
def ended = Instant.parse(findEvent(params._source.events, 'ENDED').timestamp);
return ChronoUnit.DAYS.between(started, ended);

ReferenceManyFields (One to Many Relationship)

I am working on a project where I have to create one to many relationships which will get all the list of records referenced by id in another table and I have to display all the selected data in the multi-select field (selectArrayInput). Please help me out in this, if you help with an example that would be great.
Thanks in advance.
Example:
district
id name
1 A
2 B
3 C
block
id district_id name
1 1 ABC
2 1 XYZ
3 2 DEF
I am using https://github.com/Steams/ra-data-hasura-graphql hasura-graphql dataprovider for my application.
You're likely looking for "nested object queries" (see: https://hasura.io/docs/1.0/graphql/manual/queries/nested-object-queries.html#nested-object-queries)
An example...
query MyQuery {
district(where: {id: {_eq: 1}}) {
id
name
blocks {
id
name
}
}
}
result:
{
"data": {
"district": [
{
"id": 1,
"name": "A",
"blocks": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "XYZ"
}
]
}
]
}
}
Or...
query MyQuery2 {
block(where: {district: {name: {_eq: "A"}}}) {
id
name
district {
id
name
}
}
}
result:
{
"data": {
"block": [
{
"id": 1,
"name": "ABC",
"district": {
"id": 1,
"name": "A"
}
},
{
"id": 2,
"name": "XYZ",
"district": {
"id": 1,
"name": "A"
}
}
]
}
}
Setting up the tables this way...
blocks:
districts:
Aside: I recommend using plural table names as they are more standard, "districts" and "blocks"

Adding sort priority to documents matching certain condition

I am looking for a way to do this. I need to show all experts inside the users mapping. (Experts are documents with field role equals 3). But while showing the experts, I need to show experts having "Linkedin" inside their social medias (social_medias is an array field in the users mapping) first and those without "Army" afterwards. For ex:, I have 5 documents:
[
{
role: 3,
name: "David",
social_medias: ["Twitter", "Facebook"]
},
{
role: 3,
name: "James",
social_medias: ["Facebook", "Linkedin"]
},
{
role: 3,
name: "Michael",
social_medias: ["Linkedin", "Facebook"]
},
{
role: 3,
name: "Peter",
social_medias: ["Facebook"]
},
{
role: 3,
name: "John",
social_medias: ["Facebook", "Twitter"]
},
{
role: 2,
name: "Babu",
social_medias: ["Linkedin", "Facebook"]
}
]
So, I want to get documents with role 3 and while fetching it, documents having "Linkedin" in social media should come first. So, the output after query should be in this order:
[
{
role: 3,
name: "James",
social_medias: ["Facebook", "Linkedin"]
},
{
role: 3,
name: "Michael",
social_medias: ["Linkedin", "Facebook"]
},
{
role: 3,
name: "David",
social_medias: ["Twitter", "Facebook"]
},
{
role: 3,
name: "Peter",
social_medias: ["Facebook"]
},
{
role: 3,
name: "John",
social_medias: ["Facebook", "Twitter"]
}
]
I am trying with function_score now. I can specify a column to have more priority in function_score, but cant figure out how to specify condition based priority.
Why not let the default sorting in ES (sort by score) do the job for you, without custom ordering or custom scoring:
GET /my_index/media/_search
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{"match": {"social_medias": "Linkedin"}},
{"match_all": {}},
{"query_string": {
"default_field": "social_medias",
"query": "NOT Army"
}}
]
}
},
"filter": {
"term": {
"role": "3"
}
}
}
}
}
The query above filters for "role":"3" and then in a should clause it basically says: if the documents match social_medias field with value Linkedin then give them a score based on this matching. To, also, include all others documents that don't match Linkedin, add another should for match_all. Now, everything that matches match_all gets a score. If those documents, also, match Linkedin then they get an additional score, thus making them score higher and be first in the list of results.

Resources