Grouping Grouped results from elasticsearch - elasticsearch

I'm a super duper newb with elasticsearch
I have a bunch of products on my elasticsearch. Each elasticsearch record has title, pid, product_group, color, size, qty... etc, many more fields
Now when I'm doing my request, what I want to happen is for it to group the results by pid, and then inside the _group part of the response, I also want those grouped as well, by product_group.
So in other words, if I have
pid: 1, product_group: 1, size: 1
pid: 1, product_group: 1, size: 2
pid: 1, product_group: 2, size: 1
pid: 1, product_group: 2, size: 2
pid: 2, product_group: 3, size: 1
pid: 2, product_group: 3, size: 2
pid: 2, product_group: 4, size: 1
pid: 2, product_group: 4, size: 2
I would want my top level search array to have 2 results: 1 for pid1 and 1 for pid2, and then inside of each of those results, inside the _group part of the json, I would expect 2 results each: pid1 would have a result for product_group 1 and product_group 2, and pid2 would have a _group result for product_group 3 and product_group 4.
Is this possible?
At the moment, this is how i'm modifying my query to group it based on pid:
group: {field: "pid", collapse: true}
I don't really know if I want collapse to be true or false, and I do'nt know how, or if it's even possible, to do a second layer of grouping like i'm asking for. Would appreciate any help.

The most straightforward way is to go with child terms aggs:
{
"size": 0,
"aggs": {
"by_pid": {
"terms": {
"field": "pid"
},
"aggs": {
"by_group": {
"terms": {
"field": "product_group"
},
"aggs": {
"underlying_docs": {
"top_hits": {}
}
}
}
}
}
}
}
Note that the last aggs group is optional -- I've put it there in case you'd like to know which docs have been bucketized to which particular band.

Related

Elastic search shards - total, successful, failed

This is the results
{
"_index": "vehicles",
"_id": "123",
"_version": 2,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
for query
PUT /vehicles/_doc/123
{
"make": "Honda",
"color": "Blue",
"HP": 250,
"milage": 24000,
"price": 19300.97
}
on elastic search 8.
May I know
The total shards (which is 2) does it include primary shard + replica shard?
The successful shards - I supposed that's the primary shard where the put is written into - can it be more than 1?
The failed - I supposed it's the failed primary shard?
As explained in the official documentation for the Index API response body:
_shards.total tells you how many shard copies (primaries + replicas) the index operation should be executed on
_shards.successful returns the number of shard copies the index operation succeeded on. Upon success, successful is at least 1, like in your case. Since by default, write operations only wait for the primary shards to be active before proceeding, only 1 is returned. If you want to see 2, then you need to add wait_for_active_shards=all in your indexing request
_shards.failed contains replication-related errors in the case an index operation failed on a replica shard. 0 indicates there were no failures.

Find sequences in time series data using Elasticsearch

I'm trying to find example Elasticsearch queries for returning sequences of events in a time series. My dataset is rainfall values at 10-minute intervals, and I want to find all storm events. A storm event would be considered continuous rainfall for more than 12 hours. This would equate to 72 consecutive records with a rainfall value greater than zero. I could do this in code, but to do so I'd have to page through thousands of records so I'm hoping for a query-based solution. A sample document is below.
I'm working in a University research group, so any solutions that involve premium tier licences are probably out due to budget.
Thanks!
{
"_index": "rabt-rainfall-2021.03.11",
"_type": "_doc",
"_id": "fS0EIngBfhLe-LSTQn4-",
"_version": 1,
"_score": null,
"_source": {
"#timestamp": "2021-03-11T16:00:07.637Z",
"current-rain-total": 8.13,
"rain-duration-in-mins": 10,
"last-recorded-time": "2021-03-11 15:54:59",
"rain-last-10-mins": 0,
"type": "rainfall",
"rain-rate-average": 0,
"#version": "1"
},
"fields": {
"#timestamp": [
"2021-03-11T16:00:07.637Z"
]
},
"sort": [
1615478407637
]
}
Update 1
Thanks to #Val my current query is
GET /rabt-rainfall-*/_eql/search
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence
[ rainfall where "rain-last-10-mins" > 0 ]
[ rainfall where "rain-last-10-mins" > 0 ]
until [ rainfall where "rain-last-10-mins" == 0 ]
"""
}
Having a sequence query with only one rule causes a syntax error, hence the duplicate. The query as it is runs but doesn't return any documents.
Update 2
Results weren't being returned due to me not escaping the property names correctly. However, due to the two sequence rules I'm getting matches of length 2, not of arbitrary length until the stop clause is met.
GET /rabt-rainfall-*/_eql/search
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence
[ rainfall where `rain-last-10-mins` > 0 ]
[ rainfall where `rain-last-10-mins` > 0 ]
until [ rainfall where `rain-last-10-mins` == 0 ]
"""
}
This would definitely be a job for EQL which allows you to return sequences of related data (ordered in time and matching some constraints):
GET /rabt-rainfall-2021.03.11/_eql/search?filter_path=-hits.events
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence with maxspan=12h
[ rainfall where `rain-last-10-mins` > 0 ]
until `rain-last-10-mins` == 0
"""
}
What the above query seeks to do is basically this:
get me the sequence of events of type rainfall
with rain-last-10-mins > 0
happening within a 12h window
up until rain-last-10-mins drops to 0
The until statement makes sure that the sequence "expires" as soon as an event has rain-last-10-mins: 0 within the given time window.
In the response, you're going to get the number of matching events in hits.total.value and if that number is 72 (because the time window is limited to 12h), then you know you have a matching sequence.
So your "storm" signal here is to detect whether the above query returns hits.total.value: 72 or lower.
Disclaimer: I haven't tested this, but in theory it should work the way I described.

How to force segments merge in Elasticsearch 5?

I'm working with Elasticsearch 5.2.2 and I would like to fully merge the segments of my index after an intensive indexing operation.
I'm using the following rest API in order to merge all the segments:
http://localhost:9200/my_index/_forcemerge
(I've tried also to add max_num_segments=1 in the POST request.)
And ES replies with:
{
"_shards": {
"total": 16,
"successful": 16,
"failed": 0
}
}
Note that my_index is composed by 16 shards.
But when I ask for node stats (http://localhost:9200/_nodes/stats) it replies with:
segments: {
count: 64,
[...]
}
So it seems that all the shards are split into 4 segments (64/16 = 4). In fact, an "ls" on the data directory confirms that there are 4 segments per shards:
~# ls /var/lib/elasticsearch/nodes/0/indices/ym_5_99nQrmvTlR_2vicDA/0/index/
_0.cfe _0.cfs _0.si _1.cfe _1.cfs _1.si _2.cfe _2.cfs _2.si _5.cfe _5.cfs _5.si segments_6 write.lock
And no concurrent merges are running (http://localhost:9200/_nodes/stats):
merges: {
current: 0,
[...]
}
And all the force_merge requests have been completed (http://localhost:9200/_nodes/stats):
force_merge: {
threads: 1,
queue: 0,
active: 0,
rejected: 0,
largest: 1,
completed: 3
}
I hadn't this problem with ES 2.2.
Who knows how to fully merge these segments?
Thank you all!
I am not sure whether your problem solved. just post here to let other people know.
this should be a bug. you can see following issue. use empty json body can make it work.
https://github.com/TravisTX/elasticsearch-head-chrome/issues/16

Manipulating Tree Structures for API endpoints

I am working with an RDBMS that contains a list of hierarchical objects stored like this:
Id Name ParentId
====================================
1 Food NULL
2 Drink NULL
3 Vegetables 1
4 Fruit 1
5 Liquor 2
6 Carrots 3
7 Onions 3
8 Strawberries 4
...
999 Celery 3
I do not know the specific reason why this was chosen, but it is fixed in so far as the rest of the system relies on fetching the structure in this form.
I want to expose this data via JSON using a RESTful API, and I wish to output this in the following format (array of arrays):
item:
{
id: 1, Description: "Food",
items: [
{
id: 3, Description: "Vegetables",
items: [ ... ]
},
{
id: 4, Description: "Fruit",
items: [ ... ]
}
]
},
item:
{
id: 2, Description: "Drink",
items: [ ... ]
}
What would be a sensible way of looping through the data and producing the desired output? I'm developing in C# but if there are libraries or examples for other languages I would be happy to re-implement where possible.
Thanks :)

find overlapping times in an array of hashes

I've got an array of classes, and I want to find where there may be a schedule overlap.
My array is something like this
[
{
id:2,
start: "3:30",
length: 40,
break: 30,
num_attendees: 14
},
{
id: 3,
start: "3: 40",
length: 60,
break: 40,
num_attendees: 4
},
{
id: 4,
start: "4: 40",
length: 30,
break: 10,
num_attendees: 40
}
]
pretty simple.
Now, I want to get an array where I add the start and the length, and then get the classes that overlap to notify the user that they have a conflict.
I know I can do a large for loop and compare that way, but I'm thinking there must be a nicer way to do this in Ruby, something like (ignore that we're not working in absolute minutes here, I've got that, I just want to keep the example simple).
overlap = class_list.select{|a,b| if a.start+a.length>b.start return a,b end}
any suggestions?
You can use Array#combination like this:
class_list.combination(2).select{|c1, c2|
# here check if c1 and c2 overlap
}

Resources