Sorting ElasticSearch query by multiple fields - elasticsearch

I have some data that I'm trying to sort in a very specific order.
I've looked over a few questions here on SO and Elasticsearch sort on multiple queries was pretty helpful. From what I can tell I'm getting the data back in the correct order but it's not always the same data and appears to be very random as to what is returned from the query.
My question is, how do I get my data sorted correctly and get the expected data each time?
Example Data
[
{
id: 00,
...
current_outage: {
device_id: 00,
....
},
forecasted_outages: [
{
device_id: 00
}
]
},
{
id: 01,
...
current_outage: {
device_id: 01,
....
},
forecasted_outages: []
},
{
id: 02,
...
current_outage: null,
forecasted_outages: [
{
device_id: 02
}
]
},
{
id: 03,
...
current_outage: null,
forecasted_outages: []
},
]
Current Query
bool: {
should: [
{
constant_score: {
boost: 6,
filter: {
nested: {
path: 'current_outage',
query: {
exists: {
field: 'current_outage'
}
}
}
}
}
},
{
nested: {
path: 'forecasted_outages',
query: {
exists: {
field: 'forecasted_outages'
}
}
}
}
]
}
Just to reiterate, the above query returns the data in the format/sorted method I expect but it does NOT return the data that I expect each time. The returned data is very random as far as I can tell.
Sort Criteria:
First: Data with both current_outage and one or more forecasted_outages
Second: Data with only current_outage
Third: Data with only forecasted_outages
Edit
The data returning can be anything from zero to thousands of results depending on a user. The user has an option to paginate the data or return all of their relevant data.
Edit 2
The data returned will be anywhere from zero to 1,000 hits.

If the search hits is more than 10 (default result size) and all documents have same score (in your case it could be as you are provided constant score), then the data returned could be different for each run (giving randomness feeling).
The reason for this is, the search results are merged from different shards till the hit count reaches 10 and rest of the results are ignored. So every run can have different result based on the shards merged.
Increasing the result size to include all the search result can provide same data for every run.
UPDATE
Changing the Shard count to 1 might help (you have close and reopen the index if the index is already created).
PUT /twitter/_settings
{
"index" : {
"number_of_shards" : 1
}
}

Related

Can't sort buckets based on specific fields of complex key

New to Open Search and couldn't really find an answer that worked for this use case. Essentially, my query uses scripts to access field document values within a multi_term search, then aggregates them into buckets reflecting certain metrics. The bucket key is an array of strings in the format of ['val1', 'val2', 'val3'] with an associated key_as_string of 'val1|val2|val3'
My goal is to be able to sort these buckets after aggregation based on any of these 3 values. Problem is, I can't seem to get sorting to work outside of a root "order" entry that sorts by the entire key (I think). Query is here:
aggregations: {
plans: {
multi_terms: {
size: 10000,
terms: [
{
script: "doc['plan.title.keyword'].value"
},
{
script: "doc['plan.type.keyword'].value"
},
{
script: "doc['plan.id.keyword'].value"
}
],
order: { _key: order } // This orders buckets by entire key?
},
aggregations: {
completed: {
filter: {
term: { 'status.keyword': 'Completed' }
}
},
in_progress: {
filter: {
term: { 'status.keyword': 'Started' }
}
},
stopped: {
filter: {
term: { 'status.keyword': 'Stopped' }
}
},
assigned: {
filter: {
term: { 'status.keyword': 'Assigned' }
}
},
my_bucket: {
bucket_sort: {
sort: [{_key: {order: 'asc'}}] // Breaks sort
}
}
}
}
},
The output of the query is correct, but the order of buckets output is not and I can't seem to get it right. I've attempted various ways of implementing bucket_sort to no avail. Feels like there is an easy solution to this and I'm just not finding it. My end goal is to be able to sort the buckets returned by a specified index of the key.
Can anyone tell me what I'm doing wrong here?
Note: Using Open Search v2.3

Elasticsearch - get (unfiltered) aggregates for a (filtered) subset

I have an elasticsearch index containing "hit" documents (with fields like ip/timestamp/uri etc) which are populated from my nginx access logs.
I'm looking for a method of getting the total number of hits / ip - but for a subset of IPs, namely the ones that did a request today.
I know I can have a filtered aggregation by doing:
/search?size=0
{
'query': { 'bool': { 'must': [
{'range': { 'timestamp': { 'gte': $today}}},
{'query_string': {'query': 'status:200 OR status:404'}},
]}},
'aggregations': {'c': {'terms': {'field': 'ip', 'size': 99999}}}
}
but this will sum only the hits that were done today, I want the total number of hits in the index but only from IPs that have hits today. Is this possible?
-edit-
I've tried the global option but while
'aggregations': {'c': {'global': {}, 'aggs': {'c2': {'terms': {'field': 'remote_user', 'size': 99999}}}}}
returns counts from all IPs; it ignores my filter on timestamp (eg. it includes IPs that did hits a couple of days ago)
There is a way to achieve what you want in a single query but since it involves scripting and the performance might suffer depending on the volume of data you will be running this query on.
The idea is to leverage the scripted_metric aggregation in order to build your own aggregation logic over the whole document set.
What we do below is pretty simple:
we don't give any query, so we consider the full document set
Map phase: we build a map of all IPs and for each
we count the total number of hits
we flag it if it had hits today AND with the given status (same as what you do in your query)
Reduce phase: we return the total hits count for each IP that was flagged as having hits today
Here is how the query looks like:
POST my-index/_search
{
"size": 0,
"aggs": {
"all_time_hits": {
"scripted_metric": {
"init_script": "state.ips = [:]",
"map_script": """
// initialize total hits count for each IP and increment
def ip = doc['ip.keyword'].value;
if (state.ips[ip] == null) {
state.ips[ip] = [
'total_hits': 0,
'hits_today': false
]
}
state.ips[ip].total_hits++;
// flag IP if:
// 1. it has hits today
// 2. the hit had one of the given statuses
def today = Instant.ofEpochMilli(new Date().getTime()).truncatedTo(ChronoUnit.DAYS);
def hitDate = doc['timestamp'].value.toInstant().truncatedTo(ChronoUnit.DAYS);
def hitToday = today.equals(hitDate);
def statusOk = params.statuses.indexOf((int) doc['status'].value) >= 0;
state.ips[ip].hits_today = state.ips[ip].hits_today || (hitToday && statusOk);
""",
"combine_script": "return state.ips;",
"reduce_script": """
def ips = [:];
for (state in states) {
for (ip in state.keySet()) {
// only consider IPs that had hits today
if (state[ip].hits_today) {
if (ips[ip] == null) {
ips[ip] = 0;
}
ips[ip] += state[ip].total_hits;
}
}
}
return ips;
""",
"params": {
"statuses": [200, 404]
}
}
}
}
}
And here is how the answer looks like:
"aggregations" : {
"all_time_hits" : {
"value" : {
"123.123.123.125" : 1,
"123.123.123.123" : 4
}
}
}
I think that pretty much does what you expect.
The other option (more performant because no script) requires you to make two queries. First, a query with the date range and status check with a terms aggregation to retrieve all IPs that have hits today (like you do now), and then a second query where you filter on those IPs (using a terms query) over the whole index (no date range or status check) and get hits count for each of them using a terms aggregation.
In the example you have shared you have a query and your documents are filtered according to that. But you want your aggregation to take all documents regardless of the query.
This is why the global option exists.
This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Sample query example:
{
"query": {
"match": { "type": "t-shirt" }
},
"aggs": {
"all_products": {
"global": {},
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}

ElasticSearch: get the bucket key inside bucket scripted_metric

I am trying to run this query in elasticsearch. Im trying to run a custom scripted_metric aggregation on my buckets. Within the metric script, I want to get access to the bucket key that it is aggregated on.
My documents in ES looks like this.
{
user_id: 5,
data: {
5: 200,
8: 300
}
},
{
user_id: 8,
data: {
5: 889,
8: 22
}
}
My aggregation query looks like this:
aggs = {
approvers: {
terms: {
field: 'user_id'
},
aggs: {
new_metric: {
scripted_metric: {
map_script: `
// IS IT POSSIBLE TO GET THE BUCKET KEY HERE?
// The bucket key here would be the user_id
// so i can do stuff like
doc['data'][**_term**]....
`
}
}
}
}
I had to do some digging and was likely having the same difficulty you were in finding a solution as to how to retrieve parent values... the only thing I could find was in regard to a special "_count" value on the child agg, but nothing related to its parent bucket names/keys.
If it's not a strict requirement to use a child agg with of a scripted_metric, I was able to find a way that allows you to at least access the bucket key within the parents. Maybe this can get you started in the direction of a solution:
aggs = {
approvers: {
terms: {
field: 'user_id',
script: '"There seems to be a magic value here: " + _value'
}
}
Sample adapted from this

how to groupBY using spring data

hi i'm using spring data in My project and I'm trying group by two fields, heres the request:
#Query( "SELECT obj from Agence obj GROUP BY obj.secteur.nomSecteur,obj.nomAgence" )
Iterable<Agence> getSecteurAgenceByPc();
but it doesnt work for me..what i want is this result:
-Safi
-CTM
CZC1448YZN
2UA13817KT
-Rabat
-CTM
CZC1349G1B
2UA0490SVR
-Agdal
G3M4NOJ
-Essaouira
-CTM
CZC1221B85
-Gare Routiere Municipale
CZC145YL3
What I get is
{
"status": 0,
"data":
[
{
"secteur": "Safi",
"agence": "CTM"
},
{
"secteur": "Safi",
"agence": "Dep"
},
{
"secteur": "Rabat",
"agence": "Agdal"
},
{
"secteur": "Rabat",
"agence": "CTM"
},
{
"secteur": "Essaouira",
"agence": "CTM"
},
{
"secteur": "Essaouira",
"agence": "Gare Routiere Municipale"
}
]
}
What you want is not possible with JPQL.
What does Group By do?
It combines all rows that are identical in the columns in the group by clause in to one row. Since it combines multiple rows into one, data in other columns can only be present in some combined fashion. For example, you can include MIN/MAX or AVG values, but never the orginal values.
Also the result with always be a table, never a tree.
Also note: there is no duplicated data. Every combination of secteur and agence appears exactly once.
If you want a tree structure, you have to write some java code for that.

Elasticsearch order by type

I'm searching an index with multiple types by simply using 'http://es:9200/products/_search?q=sony'. This will return a lot of hits with many different types. The hits array contains all the results but not in the order I want it to; i want the 'television' type to always show before the rest. Is it possible at all to order by type?
You can achieve this by sorting on the pre-defined field _type. The query below sorts results in ascending order of document types.
POST <indexname>/_search
{
"sort": [
{
"_type": {
"order": "asc"
}
}
],
"query": {
<query goes here>
}
}
I do it by adding a numeric field _is_OF_TYPE to the indexed documents and set it to 1 for those docs that are of the given type. Then just sort on those fields in any order you want.
For example:
Document A:
{
_is_television: 1,
... some television props here ...
}
Document B:
{
_is_television: 1,
... another television props here ...
}
Document C:
{
_is_radio: 1,
... some radio props here ...
}
and so on...
Then in ElasricSearch query:
POST radio,television,foo,bar,baz/_search
{
"sort": [
{"_is_television": {"unmapped_type" : "long"}}, // television goes first
{"_is_radio": {"unmapped_type" : "long"}}, // then radio
{"_is_another_type": {"unmapped_type" : "long"}} // ... and so on
]
}
The benefit of this solution is speed. You simply sort on numeric fields. No script sorting required.

Resources