Can't sort buckets based on specific fields of complex key - elasticsearch

New to Open Search and couldn't really find an answer that worked for this use case. Essentially, my query uses scripts to access field document values within a multi_term search, then aggregates them into buckets reflecting certain metrics. The bucket key is an array of strings in the format of ['val1', 'val2', 'val3'] with an associated key_as_string of 'val1|val2|val3'
My goal is to be able to sort these buckets after aggregation based on any of these 3 values. Problem is, I can't seem to get sorting to work outside of a root "order" entry that sorts by the entire key (I think). Query is here:
aggregations: {
plans: {
multi_terms: {
size: 10000,
terms: [
{
script: "doc['plan.title.keyword'].value"
},
{
script: "doc['plan.type.keyword'].value"
},
{
script: "doc['plan.id.keyword'].value"
}
],
order: { _key: order } // This orders buckets by entire key?
},
aggregations: {
completed: {
filter: {
term: { 'status.keyword': 'Completed' }
}
},
in_progress: {
filter: {
term: { 'status.keyword': 'Started' }
}
},
stopped: {
filter: {
term: { 'status.keyword': 'Stopped' }
}
},
assigned: {
filter: {
term: { 'status.keyword': 'Assigned' }
}
},
my_bucket: {
bucket_sort: {
sort: [{_key: {order: 'asc'}}] // Breaks sort
}
}
}
}
},
The output of the query is correct, but the order of buckets output is not and I can't seem to get it right. I've attempted various ways of implementing bucket_sort to no avail. Feels like there is an easy solution to this and I'm just not finding it. My end goal is to be able to sort the buckets returned by a specified index of the key.
Can anyone tell me what I'm doing wrong here?
Note: Using Open Search v2.3

Related

MongoDB - Sorting on new fields created dynamically by an $addFields step in an aggregation pipeline

I’m trying to understand how to improve an inside sorting operation for an aggregation, made on new fields created by an $addFields step. I’ve got a very articulated pipeline, which I’ll just show the part that I’m interested in:
[
... other steps ...
{ '$addFields': { 'list.a_new_field': { ... } },
{ '$addFields': { 'list.other_new_field': { '$sum': [ { '$max': '$list.a_new_field } ] } } },
{ '$sort': { 'list.other_new_field': -1 } },
... other steps ...
]
The sort is taking 60s to compute, as explain’d:
{ '$sort': { sortKey: { 'list.other_new_field': -1 } },
nReturned: 3053,
executionTimeMillisEstimate: 60667 } ]
The collection has 464 documents.
The problem here is that I don’t really know how to index the sorting, cause it’s on a new field. Is there any way I can optimize the query without messing with the logic of the pipeline?

Unable to sort aggregation bucket results in elasticsearch

I am trying to execute a query in elasticsearch to get a list of products with the largest sales change percentage. The aggregation results should be group by productId and sorted by salesChangePercent. I have search around for a solution and tried solutions such as sorting elasticsearch top hits results but I am not able to sort the aggregation buckets by salesChangePercent. The following query is the only one which work, however it does not seem right to me as I am using "max_salesChangePercent" to do the sorting.
Am I doing something wrong here? Is there a better or cleaner way to get the aggregation buckets sorted? Really appreciate any help I can get to improve the query.
GET product_sales/_search
{
“size”: 0,
“query”: {
“range”: {
“salesChangePercent”: { “gte”: 50 }
}
},
“aggs”: {
“unique_products”: {
“terms”: {
“field”: “productId”,
“order" : {
“max_salesChangePercent”: “desc”
}
},
“aggs”: {
“top-sales”: {
“top_hits”: {
“size”: 1,
“_source”: {
“includes”: [
“productId”,
“productName”,
“salesChangePercent”,
]
}
}
},
“max_salesChangePercent”: {
“max”: {
“field”: “salesChangePercent”
}
}
}
}
}
}

Sorting ElasticSearch query by multiple fields

I have some data that I'm trying to sort in a very specific order.
I've looked over a few questions here on SO and Elasticsearch sort on multiple queries was pretty helpful. From what I can tell I'm getting the data back in the correct order but it's not always the same data and appears to be very random as to what is returned from the query.
My question is, how do I get my data sorted correctly and get the expected data each time?
Example Data
[
{
id: 00,
...
current_outage: {
device_id: 00,
....
},
forecasted_outages: [
{
device_id: 00
}
]
},
{
id: 01,
...
current_outage: {
device_id: 01,
....
},
forecasted_outages: []
},
{
id: 02,
...
current_outage: null,
forecasted_outages: [
{
device_id: 02
}
]
},
{
id: 03,
...
current_outage: null,
forecasted_outages: []
},
]
Current Query
bool: {
should: [
{
constant_score: {
boost: 6,
filter: {
nested: {
path: 'current_outage',
query: {
exists: {
field: 'current_outage'
}
}
}
}
}
},
{
nested: {
path: 'forecasted_outages',
query: {
exists: {
field: 'forecasted_outages'
}
}
}
}
]
}
Just to reiterate, the above query returns the data in the format/sorted method I expect but it does NOT return the data that I expect each time. The returned data is very random as far as I can tell.
Sort Criteria:
First: Data with both current_outage and one or more forecasted_outages
Second: Data with only current_outage
Third: Data with only forecasted_outages
Edit
The data returning can be anything from zero to thousands of results depending on a user. The user has an option to paginate the data or return all of their relevant data.
Edit 2
The data returned will be anywhere from zero to 1,000 hits.
If the search hits is more than 10 (default result size) and all documents have same score (in your case it could be as you are provided constant score), then the data returned could be different for each run (giving randomness feeling).
The reason for this is, the search results are merged from different shards till the hit count reaches 10 and rest of the results are ignored. So every run can have different result based on the shards merged.
Increasing the result size to include all the search result can provide same data for every run.
UPDATE
Changing the Shard count to 1 might help (you have close and reopen the index if the index is already created).
PUT /twitter/_settings
{
"index" : {
"number_of_shards" : 1
}
}

ElasticSearch: get the bucket key inside bucket scripted_metric

I am trying to run this query in elasticsearch. Im trying to run a custom scripted_metric aggregation on my buckets. Within the metric script, I want to get access to the bucket key that it is aggregated on.
My documents in ES looks like this.
{
user_id: 5,
data: {
5: 200,
8: 300
}
},
{
user_id: 8,
data: {
5: 889,
8: 22
}
}
My aggregation query looks like this:
aggs = {
approvers: {
terms: {
field: 'user_id'
},
aggs: {
new_metric: {
scripted_metric: {
map_script: `
// IS IT POSSIBLE TO GET THE BUCKET KEY HERE?
// The bucket key here would be the user_id
// so i can do stuff like
doc['data'][**_term**]....
`
}
}
}
}
I had to do some digging and was likely having the same difficulty you were in finding a solution as to how to retrieve parent values... the only thing I could find was in regard to a special "_count" value on the child agg, but nothing related to its parent bucket names/keys.
If it's not a strict requirement to use a child agg with of a scripted_metric, I was able to find a way that allows you to at least access the bucket key within the parents. Maybe this can get you started in the direction of a solution:
aggs = {
approvers: {
terms: {
field: 'user_id',
script: '"There seems to be a magic value here: " + _value'
}
}
Sample adapted from this

ElasticSearch query failing due to state codes "in" and "or" being reserved words

I'm querying for states using the state code as the query string, and "in" and "or" (Indiana and Oregon) are failing, presumably because they're reserved words.
I can confirm that the data exists in the index correctly, because when I run:
curl -XGET 'localhost:9200/state/_search?size=200&pretty=true' -d '{"query" : {"match_all" : {}}}' > out.txt
I can see the data there for both the working states and the non-working states. Plus, if I change the state code of a non-working state in CouchDB to something like XYZ, I can verify that the change makes it to ES by running the above command and searching for XYZ. So I know I'm looking at the right data and it's indexing fine.
The problem is the query. Right now, here's what my entire query object looks like:
var q = {
size: 0,
query: {
filtered: {
query: { term: { postcode: 'tn' } },
filter: { term: { version: 2 } }
}
},
facets: {
version: { terms: { field: "version" } },
count : { statistical : { field : "latestValues.enroll" } }
}
};
If I run that query, I get no results. If I change the "or" out with "tn" or "tx" or "sc" etc., then it works fine.
I looked for a way to escape reserved words and found this link but it doesn't seem to work for me, when running the following query:
var q = {
size: 0,
query: {
filtered: {
query: { match_all: { } },
filter: { term: { version: 2, postcode: 'or' } }
}
},
facets: {
version: { terms: { field: "version" } },
count : { statistical : { field : "latestValues.enroll" } }
}
};
(Note that that query also works when changing out "or" with a non-reserved-word-state so I know it's not a problem with the query itself).
Any ideas?
This is not about "reserved" words, its about stop words. You are using an analyzer which removes stop words (the default analyzer up to a more recent version of Elasticsearch).
You'll need to change the analyzer for the field, see here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html
This will change require reindexing, though

Resources