Elasticsearch collapse by a field A, with top hit by field B, with results sorted by field C - elasticsearch

Suppose I have these documents in elasticsearch:
{"name":"alpha", "grp":1, "priority": 1}
{"name":"beta", "grp":1, "priority": 3}
{"name":"gamma", "grp":2, "priority": 5}
{"name":"zeta", "grp":2, "priority": 1}
I want to query my index and get a single hit per grp.
The hit of a grp must be the document with highest priority value.
My overall query needs to return all fields, and be sorted by name.
Sample output:
{"name":"beta", "grp":1, "priority": 3}
{"name":"gamma", "grp":2, "priority": 5}
Query collapse doesn't seem to do the trick as I would need to sort by priority rather than name.
The collapsing is done by selecting only the top sorted document per collapse key
https://www.elastic.co/guide/en/elasticsearch/reference/current/collapse-search-results.html
I feel like there must be some combination of aggregations that will get the result I'm looking for, but I'm bashing my head into a wall. Please help me!?

There is no way to achieve it using collapse (yet), you can see the current progress here: https://github.com/elastic/elasticsearch/issues/45646

Related

Elasticsearch filter on aggregation result (for search and aggregation)

Part of this question is related to : Elasticsearch filter on aggregation
Context
Let's say my Elasticsearch index contains some orders. Each order has one field price and one field amount. This result in an index that look like this :
[
{
"docKey": "order01",
"user": "1",
"price": 8,
"amount": 20
},
{
"docKey": "order02",
"user": "1",
"price": 14,
"amount": 3
},
{
"docKey": "order03",
"user": "2",
"price": 5,
"amount": 1
},
{
"docKey": "order04",
"user": "2",
"price": 10,
"amount": 3
}
]
What I would like to do
What I want to do is a filter on some values aggregated per user. I want to do this kind of filter for search and also in order to apply aggregation on it. For example in this example I would like to retrieve the documents of all user that have their average order with a price in the range of 9-14.
User 1 has an average price order of 11 so we keep both of his orders.
User 2 has an average price order of 7.5 so both his orders are not kept.
This was the easy part. After I filter to only get the user one. I want to do some more aggregates on the result.
So for example : I want the repartition of the average per user of the amout field among the bucket [0,10] and [10,20] for all user that have an average order with a price in the range of 9-14.
The answer I except for this question is 0 in the bucket [0,10] and one in the bucket [10,20] (Only user 1 is kept because of his average price. His average amount is 11.5 so in the bucket [10,20]).
What I have tried
I have manage do to my filter in order to retrieve the users that have their average order with a price in the range of 9-14. I did this by first doing a term aggregation on the user filed. Then I do a subaggregation that is an avg aggregation on the price. Then I do a bucket selector pipeline aggregation that check if the previous computed average price is between 9 and 14.
I have also manage to do the aggregation I wanted but without the previous filter. I did exactly the same thing that for the filter for each range. Then I count the number of results in each bucket.
I havn't find any way to apply an other aggregation on bucket selector result. So i could not first do the filter and then apply the range...
Also theses solution are not elegant.. I don't think they will scale up as a big part of the document need to be returned in the answer and processed further (even if it's off internet I prefer avoiding doing this and I might be limited in the result size of an aggregation ?).
I manage to find a solution but it's not elegant and might be poorly scalable.
Make a term aggregation on the user.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the price.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the amount.
Do a bucket selector pipeline aggregation that filter to only keep avg_price in range [9-14].
Do a bucket selector pipeline aggregation that filter to only keep avg_amount in a [0-10]
Do a "count" bucket script pipeline aggregation (with script returning one).
Do a bucket sum pipeline aggregation that sum the count.
Repeat all the steps for all ranges wanted ([0-10], [10-20])

Detecting changes when comparing documents within an index in ElasticSearch

I'm using elastic search to store website crawl data in one index. Docs look something like this:
{"crawl_id": 1, url": "http://www.example.com", "status": 200}
{"crawl_id": 1, url": "http://www.example.com/test", "status": 200}
{"crawl_id": 2, url": "http://www.example.com", "status": 200}
{"crawl_id": 2, url": "http://www.example.com/test", "status": 500}
How would I compare 2 different crawls? For instance
I want to know which pages have changed their status code from 200 to 500, in crawl_id 2 when I compare crawl_id 2 with crawl_id 1.
I'd like to get the list of documents, but also aggregate on those results.
For instance 1 page changed from 200 to 500.
Any ideas?
I would use parent/child documents for that. Parents representing each URL, children representing each different crawl event. Then I'd select parents by searching the children (I ignore if this feature is still maintained or if it has changed its name to join data types).
I'd have also have a look to document versions and see which one fits my requirements better.

Elastic - Search across object without key specification

I have an index with hundreds of millions docs and each of them has an object "histogram" with values for each day:
"_source": {
"proxy": {
"histogram": {
"2017-11-20": 411,
"2017-11-21": 34,
"2017-11-22": 0,
"2017-11-23": 2,
"2017-11-24": 1,
"2017-11-25": 2692,
"2017-11-26": 11673
}
}
}
And I need one of two solutions:
Find docs where any value inside histogram object is greater then XX
Find docs where avg of values in histogram object is greater then XX
In point 1 I can use range query, but I must specify exactly name of field (i.e. proxy.histogram.2017-11-20). And wildcard version (proxy.histogram.*) doesnot work.
In point 2 I found in ES only average aggregation, but I don't want aggregation of these fields after query (because large of data), I want to only search these docs.

Elasticsearch: auto increment integer field across two index

I need a auto increment integer field across two index.
Can Elasticsearch do it automatically like MySQL "auto increment" field in a table?
Eg. when puts some documents in two different index:
POST /my_index_1/blogpost/
{
"title": "Foo Bar"
}
POST /my_index_2/blogpost/
{
"title": "Baz quux"
}
On retrieve it, i want:
GET /my_index_*/blogpost/
{
"uid" : 1,
"title": "Foo Bar"
},
{
"uid" : 2,
"title": "Baz quux"
}
No, ES does not have any auto increment feature since it is a distributed system, figuring out the correct value for the counter is non trivial. Especially since (bulk) indexing tends to be heavily concurrent. You can typically max out CPUs on all nodes if you throw enough documents at it.
So, your best option is to do this outside of ES before you send the documents to ES. Or even better, don't do this. If you need some kind of order of insertion, a better option is to simply use a timestamp. They are actually stored as a number internally. You still might get duplicates of course if two documents get indexed the same millisecond. A trick we've used to work around that is to offset documents indexed at the same time by 1 ms. to ensure we keep the insertion order.

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Resources