Sort by date + show past results after upcoming results - elasticsearch

In Elastic I'd like to sort results by start_date ascending, but with past dates showing up after upcoming dates.
Example desired results:
[
{id: 5, start_date: '3000-01-01'},
{id: 7, start_date: '3001-01-01'},
{id: 8, start_date: '3002-01-01'},
{id: 1, start_date: '1990-01-01'},
{id: 4, start_date: '1991-01-01'},
{id: 3, start_date: '1992-01-01'},
]
Something like this would be possible in SQL:
ORDER BY (start_date > NOW()) DESC, start_date ASC
But I'm not sure how to accomplish this in Elastic. The only thing I can think of would be to set a boolean is_upcoming flag and reindex that every day.
Also I could be limiting and paginating the # of search results, so fetching them in reverse start_date order and then manipulating the results in my code isn't really doable.

It's perfectly possible using a sort script if your start_date is of type date and its format is yyyy-MM-dd (I found YYYY-... to not work properly).
GET future/_search
{
"sort": [
{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "return doc['start_date'].value.millis > params.now ? (doc['start_date'].value.millis - params.now) : Long.MAX_VALUE",
"params": {
"now": 1594637988236
}
},
"order": "asc"
}
},
{
"start_date": {
"order": "asc"
}
}
]
}
The parametrized now is needed for synchronization reasons as described here.

Related

Get only last version (custom field) of document when executing a search

I am using the Java API for elasticsearch and I am trying to get only the last version (which is a custom field) of each document when executing a search.
For example :
{ id: 1, name: "John Greenwood", version: 1}
{ id: 1, name: "John Greenwood", version: 2}
{ id: 2, name: "John Underwood", version: 1}
While searching with Jhon, I want this result :
{ id: 1, name: "John Greenwood", follower_count: 2}
{ id: 2, name: "John Underwood", follower_count: 1}
Apparently I am supposed to use aggregation, but Im not sure how to use them with the Java API.
Also, how can I regroup the documents with the ID also ? Because I only want the latest version for the same ID
Tldr;
Yes, you are on the right track.
You will want to aggregate on the id of each user. The get the top_hit per regard to the version.
Solution
The first aggregation per_id is grouping user by their id, then inside this aggregation we perform another one.
lastest_version that is going to select the best hit with regards to the version. I select the size: 1 to get a top 1 per group.
GET 74550367/_search
{
"query": {
"match_all": {}
},
"aggs": {
"per_id": {
"terms": {
"field": "id"
},
"aggs": {
"lastest_version": {
"top_hits": {
"sort": [
{
"version": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
To Reproduce
POST _bulk
{ "index": {"_index":"74550367"}}
{ "id": 1, "name": "John Greenwood", "version": 1}
{ "index": {"_index":"74550367"}}
{ "id": 1, "name": "John Greenwood", "version": 2}
{ "index": {"_index":"74550367"}}
{ "id": 2, "name": "John Underwood", "version": 1}

How to sort with case insensitive without changing the settings

My index name is data_new
Below is the code to insert into index
test = [ {'id':1,'name': 'A', 'professor': ['Bill Cage', 'accounting']},
{ 'id':2, 'name': 'AB', 'professor': ['Gregg Payne', 'engineering']},
{'id':3, 'name': 'a', 'professor': ['Bill Cage', 'accounting']},
{'id':4,'name': 'Tax Accounting 200', 'professor': ['Thomas Baszo', 'finance']},
{'id':5,'name': 'Capital Markets 350', 'professor': ['Thomas Baszo', 'finance']},
{'id':6,'name': 'Theatre 410', 'professor': ['Sebastian Hern', 'art']},
{'id':7,'name': 'Accounting 101', 'professor': ['Thomas Baszo', 'finance']},
{'id':8,'name': 'Marketing 101', 'professor': ['William Smith', 'finance']},
{'id':8,'name': 'Anthropology 230', 'professor': ['Devin Cranford', 'history']},
{'id':10, 'name': 'Computer Science 101',
'professor': ['Gregg Payne', 'engineering']}]
from elasticsearch import Elasticsearch
import json
es = Elasticsearch()
es.indices.create(index='data_new', ignore=400)
for e in test:
es.index(index="data_new", body=e, id=e['id'])
search = es.search(index="data_new", body={"from" : 0, "size" : 2,"query": {"match_all": {}}})
search['hits']['hits']
Right now
[{'id':1,'name': 'A'},
{ 'id':2, 'name': 'AB'},
{'id':3, 'name': 'a'}]
Expected is in below order
[{'id':1,'name': 'A'},
{ 'id':3, 'name': 'a'},
{'id':2, 'name': 'AB'}]
for input ["a", "b", "B", "C", "c", "A"]
the result is : ["A", "B", "C", "a", "b", "c"]
I want output as ["A", "a", "B", "b", "C", "c"]
Expected out
My first Expected output > I need to sort the output with respect to name only in {Case insensitive}. I need to normalise name keyword and sort
How to do the modification on search = es.search(index="data_new", body={"from" : 0, "size" : 2,"query": {"match_all": {}}})
I have updated the code with below
search = es.search(index="data_new", body={ "sort" : [{"name.keyword" : {"order" : "asc"}], {"size": 1000, "query": {"query_string": {"query": "A"}}})
with "normalizer": "case_insensitive"}
I got the error
RequestError: RequestError(400, 'x_content_parse_exception', '[1:41] [field_sort] unknown field [normalizer]')
In order to do this you will have to use a script with ctx._source.mykey.toLowerCase()
https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html
You can find another post which talk about it:
Script-based sorting on Elasticsearch date field
And a good article with an example here:
https://qbox.io/blog/how-to-painless-scripting-in-elasticsearch
Code will look like (not tested)
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "string",
"order": "asc",
"script": {
"lang": "painless",
"inline": "doc['name.keyword'].value.toLowerCase()"
}
}
}
}
Note: It's a bad practice and you should do it only for a one shot query. If you want your application to stay healthy you should implement the solution suggested by saeednasehi.
You can also use index sorting to be more performant.
In order to use normalizer, you need to define it into your mapping. you are not able to use it as an argument in your search. In your case, you need to have two fields for sort. I have made this by copying data to other fields. the first field has lowercase normalizer and the other one not.
PUT /test_index/
{
"settings": {
"analysis": {
"normalizer": {
"myLowercase": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings":{
"post":{
"properties":{
"name":{
"normalizer":"myLowercase",
"type":"keyword",
"copy_to": [
"name2"
]
},
"name2":{
"type":"keyword"
}
}
}
}
}
And your query would be something like this:
GET test_index/_search
{
"query": {
"match_all": {}
},"sort": [
{
"name": {
"order": "asc"
}
},
{
"name2":{
"order": "asc"
}
}
]
}
This is the mapping and setting that you must have for your name field in your indices and you need to add other fields to the mapping as well. Please have the attention that this is for elasticsearch version below 7. If you use elasticsearch version 7 you must delete doc_type which is named post here from the mapping.

How do I sort buckets by Term Aggregation's nested doc_count?

I have an index, invoices, that I need to aggregate into yearly buckets then sort.
I have succeeded in using Bucket Sort to sort my buckets by simple sum values (revenue and tax). However, I am struggling to sort by more deeply nested doc_count values (status).
I want to order my buckets not only by revenue, but also by the number of docs with a status field equal to 1, 2, 3 etc...
The documents in my index looks like this:
"_source": {
"created_at": "2018-07-07T03:11:34.327Z",
"status": 3,
"revenue": 68.474,
"tax": 6.85,
}
I request my aggregations like this:
const params = {
index: 'invoices',
size: 0,
body: {
aggs: {
sales: {
date_histogram: {
field: 'created_at',
interval: 'year',
},
aggs: {
total_revenue: { sum: { field: 'revenue' } },
total_tax: { sum: { field: 'tax' } },
statuses: {
terms: {
field: 'status',
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ total_revenue: { order: 'desc' } }],
},
},
},
},
},
},
}
The response (truncated) looks like this:
"aggregations": {
"sales": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 254,
"total_tax": {
"value": 735.53
},
"statuses": {
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 59
},
{
"key": 1,
"doc_count": 58
},
{
"key": 5,
"doc_count": 57
},
{
"key": 3,
"doc_count": 40
},
{
"key": 4,
"doc_count": 40
}
]
},
"total_revenue": {
"value": 7355.376005351543
}
},
]
}
}
I want to sort by key: 1, for example. Order the buckets according to which one has the greatest number of docs with a status value of 1. I tried to order my terms aggregation, then specify the desired key like this:
statuses: {
terms: {
field: 'status',
order: { _key: 'asc' },
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'statuses.buckets[0]._doc_count': { order: 'desc' } }],
},
},
However this did not work. It didn't error, it just doesn't seem to have any effect.
I noticed someone else on SO had a similar question many years ago, but I was hoping a better answer had emerged since then: Elasticsearch aggregation. Order by nested bucket doc_count
Thanks!
Nevermind I figured it out. I added a separate filter aggregation like this:
aggs: {
total_revamnt: { sum: { field: 'revamnt' } },
total_purchamnt: { sum: { field: 'purchamnt' } },
approved_invoices: {
filter: {
term: {
status: 1,
},
},
},
Then I was able to bucket sort that value like this:
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'approved_invoices>_count': { order: 'asc' } }],
},
},
In case if anyone comes to this issue again. Latest update tried with Elasticsearch version 7.10 could work in this way:
sales_bucket_sort: {
bucket_sort: {
sort: [{ '_count': { order: 'asc' } }],
},
}
With only _count specified, it will automatically take the doc_count and sort accordingly.
I believe this answer will just sort by the doc_count of the date_histogram aggregation, not the nested sort.
JP's answer works: create a filter with the target field: value then sort by it.

Adding additional fields to ElasticSearch terms aggregation

Indexed documents are like:
{
id: 1,
title: 'Blah',
...
platform: {id: 84, url: 'http://facebook.com', title: 'Facebook'}
...
}
What I want is count and output stats-by-platform.
For counting, I can use terms aggregation with platform.id as a field to count:
aggs: {
platforms: {
terms: {field: 'platform.id'}
}
}
This way I receive stats as a multiple buckets looking like {key: 8, doc_count: 162511}, as expected.
Now, can I somehow add to those buckets also platform.name and platform.url (for pretty output of stats)? The best I've came with looks like:
aggs: {
platforms: {
terms: {field: 'platform.id'},
aggs: {
name: {terms: {field: 'platform.name'}},
url: {terms: {field: 'platform.url'}}
}
}
}
Which, in fact, works, and returns pretty complicated structure in each bucket:
{key: 7,
doc_count: 528568,
url:
{doc_count_error_upper_bound: 0,
sum_other_doc_count: 0,
buckets: [{key: "http://facebook.com", doc_count: 528568}]},
name:
{doc_count_error_upper_bound: 0,
sum_other_doc_count: 0,
buckets: [{key: "Facebook", doc_count: 528568}]}},
Of course, name and url of platform could be extracted from this structure (like bucket.url.buckets.first.key), but is there more clean and simple way to do the task?
It seems the best way to show intentions is top hits aggregation: "from each aggregated group select only one document", and then extract platform from it:
aggs: {
platforms: {
terms: {field: 'platform.id'},
aggs: {
platform: {top_hits: {size: 1, _source: {include: ['platform']}}}
}
}
This way, each bucked will look like:
{"key": 7,
"doc_count": 529939,
"platform": {
"hits": {
"hits": [{
"_source": {
"platform":
{"id": 7, "name": "Facebook", "url": "http://facebook.com"}
}
}]
}
},
}
Which is kinda too deeep (as usual with ES), but clean: bucket.platform.hits.hits.first._source.platform
If you don't necessarily need to get the value of platform.id, you could get away with a single aggregation instead using a script that concatenates the two fields name and url:
aggs: {
platforms: {
terms: {script: 'doc["platform.name"].value + "," + doc["platform.url"].value'}
}
}

Sort by date in mongoose aggregation framework

Im working on a nodejs+mongodb project using mongoose. Now I have come across a question I don't know the answer to.
I am using aggregation framework to get grouped results. The grouping is done on a date excluding time data field like: "2013 02 06". Code looks like this:
MyModel.aggregate([
{$match: {$and: [{created_date: {$gte: start_date}}, {created_date: {$lte: end_date}}]}},
{$group: {
_id: {
year: {$year: "$created_at"},
month: {$month: "$created_at"},
day: {$dayOfMonth: "$created_at"}
},
count: {$sum: 1}
}},
{$project: {
date: {
year: "$_id.year",
month:"$_id.month",
day:"$_id.day"
},
count: 1,
_id: 0
}}
], callback);
The grouped results are perfect, except that they are not sorted. Here is an example of output:
[
{
count: 1,
date: {
year: 2013,
month: 2,
day: 7
}
},
{
count: 1906,
date: {
year: 2013,
month: 2,
day: 4
}
},
{
count: 1580,
date: {
year: 2013,
month: 2,
day: 5
}
},
{
count: 640,
date: {
year: 2013,
month: 2,
day: 6
}
}
]
I know the sorting is done by adding this: {$sort: val}. But now I'm not sure what should be the val so the results would be sorted by date as my grouping key es an object of 3 values constructing the date. Does anyone know how this could be accomplished?
EDIT
Have tryed this and it worked :)
{$sort: {"date.year":1, "date.month":1, "date.day":1}}
It appears that this question has a very simple answer :) Just need to sort by multiple nesteed columns like this:
{$sort: {"date.year":1, "date.month":1, "date.day":1}}
I got stuck with the same problem, thanks for your answer.
But I found out that you can get the same result with less code
MyModel.aggregate([
{$match: {$and: [{created_date: {$gte: start_date}}, {created_date: {$lte: end_date}}]}},
{$group: {
_id: {
year: {$year: "$created_at"},
month: {$month: "$created_at"},
day: {$dayOfMonth: "$created_at"}
},
count: {$sum: 1}
}},
{$project: {
date: "$_id", // so this is the shorter way
count: 1,
_id: 0
}},
{$sort: {"date": 1} } // and this will sort based on your date
], callback);
This would work if you are only sorting by date if you had other columsn to sort on. YOu would need to expand _id

Resources