Using inner_hits inside aggregation script in Elasticsearch - elasticsearch

I have a products index with a categories nested field that contains all categories the product is assigned to, plus their ascendants.
I need to write a query that returns the top products for a given list of category ids and group those products by their parent category.
Since a product can be in 2 or more categories, I need to show the product grouped only under the categories that matched the search query.
This is what I got so far:
GET products/_search
{
"query": {
"nested": {
"path": "categories",
"query": {
"terms": {
"categories.id": [1,2,3]
}
},
"inner_hits": {}
}
},
"aggs": {
"category_group": {
"terms": {
"script": "...return the category to group by, depending on the matched nested category..."
},
"aggs": {
"top_products": {
"top_hits": {}
}
}
}
},
"size": 0
}
The problem I hit is that the terms aggregation that builds the grouping category buckets needs to know which nested category matched the search query. It looks like that information is available in the inner_hits array in the results, but I need it within the terms aggregation script field.
So, I guess my question is - is that possible at all? Or is there any other way to get the desired result?
P.S: There is further complication that some categories cannot be used for grouping, and in such case, their parent category should be used, but this is something I can handle in the script.

Related

ElasticSearch pagination on top of field collapse with inner hits

I'm new into ElasticSearch and I'm a bit stuck regarding a query that I need to create.
I have an index with dynamic mapping containing products with the following structure:
{
"category": "my-awesome-category",
"brandId": "0a4fcfb1-03b1-45f3-81e8-6c645021b165",
"productId": "85100e63-2582-4b50-aa54-35d2c32cb4fb",
"productScore": 0.9992707531499413
}
On top of this index I need to have a paginated query, with n products per page, that needs to return the products in a category sorted by productScore, but with the following requirements:
each page can only contain 3 products of the same brand
a brand can have products in multiple pages
So far I was able to come up with a query using field collapse that returns the top 3 products of a brand, but:
a brand it will only have results in 1 page, thus the 2nd requirement is not being fulfilled.
it is not guaranteed that the products are absolutely sorted by their productScore because I'm basically sorting the brands, by their top productScore, thus for instance if a brand has 3 products where 1 as the best score and the other 3 have the lowest scores of all products, then all of those 3 products are going to be displayed in the same page
My current query (for a page of 48 products) is:
{
"_source": [],
"from": 0, # this will increase by 16 for each page
"size": 16,
"query": {
"bool": {
"must": [
{
"match": {
"category": "my-awesome-category"
}
}
]
}
},
"collapse": {
"field": "brandId.keyword",
"inner_hits": {
"name": "brand_products_by_score",
"size": 3,
"_source": ["productId"],
"sort": [
{
"productScore": "desc"
}
]
}
},
"sort": [{
"productScore": "desc"
}]
}
Is it possible to have an ElasticSearch query on top of this index (or another holding the same data in different structure) that will give me the results that I need?

Elastic search multi index query

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?
If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

Limit The Number Of Results Processed By An Aggregation

I have a query with an aggregation. I want the aggregation to only operate on the top 500 hits returned by the query.
For example, let's say I have an index of comments. I want to query the top 500 matching comments and aggregate them based on the poster, so that I may answer the question: "Who are the top kitten and puppy posters?".
The query might look something like this:
POST comments/_search
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
The problem with this is, as far as I know, the aggregation will operate on ALL returned results, not the top 500.
Things I've Already Tried/Considered:
size at the query root only changes the number of hits returned by
the query, but has no effect on the aggregation.
size inside the
terms aggregation only affects the total number of buckets to return.
There used to be a limit filter in older versions that would limit the number of hits returned by a query (and therefore the number processed by the aggregation) but that was deprecated in favor of...
terminate-after which doesn't work because the results aren't sorted by score before being returned so I couldn't get the top 500, just a set of 500
Does anyone know how to limit the documents processed by an aggregation to only the top results?
EDIT: I'm using ES version 6.3
I think you are looking for sampler aggregation. You will have to wrap your poster aggregation into the sample aggregation.
The shard_size parameter is number of document that will be considered for the subaggregation. In your case 500.
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 500
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
}
}

Sorting and filtering elastic search top hits docs

lets say I have a simple product index for docs like this:
{
"product_name": "some_product",
"category": "some_cotegory",
"price": "200"
"sold_times": "5",
"store": "store1"
}
and I want to get the most expensive products in their category and per store that have been sold less than 3 times and I want them to be ordered by store, category and price.
I can use two terms aggregations and top hits aggregation to get the most expensive products in their category per store, but how I sort and filter these top hits result? I really need to filter the results after the top hits agg is performed, so the filter query is not the solution. How can I do this? Thx
EDIT:
Long story short - I need elastic equivalent for SQL:
SELECT p.*
FROM products AS p
INNER JOIN (
SELECT max(price) AS price, categroy, store
FROM products
GROUP BY category, store
) AS max_prices ON p.price = max_prices.price AND p.category = max_prices.category AND p.store = max_prices.store
WHERE p.sold_times < 3;
You could filter the search to only return products sold less than 3 times, then aggregate those by store and category, then finally apply a top hits aggregation to get the most expensive item in the category (for that store). Something like
{
"size": 0,
"query": {
"range": {
"sold_times": {
"lt": 3
}
}
},
"aggs": {
"store": {
"terms": {
"field": "store",
"size": 10
},
"aggs": {
"category": {
"terms": {
"field": "category",
"size": 10
},
"aggs": {
"most_expensive": {
"top_hits": {
"size": 1,
"sort": [
{
"price": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}
Well, after some search, I have found "possible" solution. I could use Bucket Selector aggregation, together with some script that would make accessible the top hits properties for filtering and similar approach for sorting using Bucket Sort aggregation (some info can be found here: How do I filter top_hits metric aggregation result [Elasticsearch])
But I'm facing another issue with aggregations. Because a lot of categories I want to use a pagination (as "scroll" or "size and from" used in common search query) but it cannot be done easily with aggregations. There's a Composite Aggregation which could do something similar, but after all the query would be so complicated so it scares me a lot so I decided to give it up and make the grouping outside of elastic.
It is sad that there is no an easy way to do such a common analytic query in elastic.

Elastic search, query based on another query

I want to perform a query and then use the results to perform another query
to ne clear:
I want to perform a query and select the ids and then use those ids to find some users somethink like
SELECT * FROMpostsWHERE user_id IN (SELECT id FROM users WHERE id IN (1,2,3,4,...)) in sql
You can filter the results using query string query DSL
In the following example, I have filtered the result based on result_type value
And then grouped the values
example
GET .ml-anomalies-.write-high_request_time/_search
{
"size": 0,
"query": {
"query_string": {
"query": "result_type: model_plot OR result_type:bucket"
}
},
"aggs": {
"NAME": {
"terms": {
"field": "result_type",
"size": 10
}
}
}
}
You can try the same command in the following demo box
https://demo.elastic.co/app/kibana#/dev_tools/console?_g=()

Resources