ElasticSearch pagination on top of field collapse with inner hits - elasticsearch

I'm new into ElasticSearch and I'm a bit stuck regarding a query that I need to create.
I have an index with dynamic mapping containing products with the following structure:
{
"category": "my-awesome-category",
"brandId": "0a4fcfb1-03b1-45f3-81e8-6c645021b165",
"productId": "85100e63-2582-4b50-aa54-35d2c32cb4fb",
"productScore": 0.9992707531499413
}
On top of this index I need to have a paginated query, with n products per page, that needs to return the products in a category sorted by productScore, but with the following requirements:
each page can only contain 3 products of the same brand
a brand can have products in multiple pages
So far I was able to come up with a query using field collapse that returns the top 3 products of a brand, but:
a brand it will only have results in 1 page, thus the 2nd requirement is not being fulfilled.
it is not guaranteed that the products are absolutely sorted by their productScore because I'm basically sorting the brands, by their top productScore, thus for instance if a brand has 3 products where 1 as the best score and the other 3 have the lowest scores of all products, then all of those 3 products are going to be displayed in the same page
My current query (for a page of 48 products) is:
{
"_source": [],
"from": 0, # this will increase by 16 for each page
"size": 16,
"query": {
"bool": {
"must": [
{
"match": {
"category": "my-awesome-category"
}
}
]
}
},
"collapse": {
"field": "brandId.keyword",
"inner_hits": {
"name": "brand_products_by_score",
"size": 3,
"_source": ["productId"],
"sort": [
{
"productScore": "desc"
}
]
}
},
"sort": [{
"productScore": "desc"
}]
}
Is it possible to have an ElasticSearch query on top of this index (or another holding the same data in different structure) that will give me the results that I need?

Related

Fetch other document value while framing DSL query

I'm adding a script_score to my DSL query, to boost results based on the price of a source product. something like this:
functions":[
{
"script_score":{
"script":"if (doc['price'].size() > 0 && doc['price'].value == 1000.0) {15}"
}
}
],
Here "1000.0" is the price of one document that is not part of the response. To achieve this, I have to query 2 times, first to get the price of that document and then frame the query and add the price to the query and boost the results, which is causing performance degradation.
There should be some way to do this using painless scripting to get the value of an id, but I'm not able to get it. It would be great if someone can help on this.
TIA!
Actually, there is a way. I've tried to reproduce your case below, I hope it's close enough to what you have.
Say, you have one index with prices, i.e. from where you first fetch the document with the price that should be boosted.
PUT prices/_doc/1
{
"price": 1000
}
Then, your main index is the one that contains the documents on which the second query runs, say it's for products and we have one product with price 1000 and another with price 500.
PUT products/_doc/q
{
"name": "product q",
"price": 1000
}
PUT products/_doc/2
{
"name": "product 2",
"price": 500
}
Now, the query would look like this. In the function_score query we give a boost of 15 (i.e. the hardcoded value in your script) to documents whose price matches the price of the document with ID 1 in the price index. The terms lookup query will fetch the price of 1000 from the specified document (i.e. with ID 1) and then will boost the documents having that price with a weight of 15.
GET products/_search
{
"query": {
"function_score": {
"functions": [
{
"weight": 15,
"filter": {
"terms": {
"price": {
"index": "prices",
"id": "1",
"path": "price"
}
}
}
}
]
}
}
}
I hope this solves your problem.

Apply filters on child level aggregations in Elasticsearch - performance issue

I am using parent-child relationship for product and it's availability/variants in stores. One product can have have 1000s of stores as child, and there are about 80k products. We have 3 shards and 3 powerful nodes.
In main query, I have a filter to fetch all products which are available in specific store as shown below.
"filter": {
"has_child": {
"inner_hits": {
"size": 12
},
"query": {
"term": {
"store_id": "m2s3"
}
},
"type": "stores"
}
}
Now I want to do aggregation on color and size. Those fields are available within child documents. Please see query below which does the child level aggregation only for the specific store.
{
"aggregations": {
"color_agg": {
"children": {
"type": "stores"
},
"aggregations": {
"color_agg_sub": {
"filter": {
"terms": {
"store_id": [
"m2s3"
]
}
},
"aggregations": {
"color_agg_sub_sub": {
"terms": {
"field": "color.raw",
"size": 25
}
}
}
}
}
}
}
}
This works, but the performance is very bad - it takes about 1.4 second to execute. If I remove the child level aggregation, it just returns response in less than 60ms. Looks like it's applying child level aggregation first across all stores and then it's filtering out only those which are within specific store. Is there any way to do filtering first to filter out child level documents and then do aggregation? That may help make it faster. Or if there are any other way to improve the performance, let me know.
Thanks!

Sorting and filtering elastic search top hits docs

lets say I have a simple product index for docs like this:
{
"product_name": "some_product",
"category": "some_cotegory",
"price": "200"
"sold_times": "5",
"store": "store1"
}
and I want to get the most expensive products in their category and per store that have been sold less than 3 times and I want them to be ordered by store, category and price.
I can use two terms aggregations and top hits aggregation to get the most expensive products in their category per store, but how I sort and filter these top hits result? I really need to filter the results after the top hits agg is performed, so the filter query is not the solution. How can I do this? Thx
EDIT:
Long story short - I need elastic equivalent for SQL:
SELECT p.*
FROM products AS p
INNER JOIN (
SELECT max(price) AS price, categroy, store
FROM products
GROUP BY category, store
) AS max_prices ON p.price = max_prices.price AND p.category = max_prices.category AND p.store = max_prices.store
WHERE p.sold_times < 3;
You could filter the search to only return products sold less than 3 times, then aggregate those by store and category, then finally apply a top hits aggregation to get the most expensive item in the category (for that store). Something like
{
"size": 0,
"query": {
"range": {
"sold_times": {
"lt": 3
}
}
},
"aggs": {
"store": {
"terms": {
"field": "store",
"size": 10
},
"aggs": {
"category": {
"terms": {
"field": "category",
"size": 10
},
"aggs": {
"most_expensive": {
"top_hits": {
"size": 1,
"sort": [
{
"price": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}
Well, after some search, I have found "possible" solution. I could use Bucket Selector aggregation, together with some script that would make accessible the top hits properties for filtering and similar approach for sorting using Bucket Sort aggregation (some info can be found here: How do I filter top_hits metric aggregation result [Elasticsearch])
But I'm facing another issue with aggregations. Because a lot of categories I want to use a pagination (as "scroll" or "size and from" used in common search query) but it cannot be done easily with aggregations. There's a Composite Aggregation which could do something similar, but after all the query would be so complicated so it scares me a lot so I decided to give it up and make the grouping outside of elastic.
It is sad that there is no an easy way to do such a common analytic query in elastic.

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

Using inner_hits inside aggregation script in Elasticsearch

I have a products index with a categories nested field that contains all categories the product is assigned to, plus their ascendants.
I need to write a query that returns the top products for a given list of category ids and group those products by their parent category.
Since a product can be in 2 or more categories, I need to show the product grouped only under the categories that matched the search query.
This is what I got so far:
GET products/_search
{
"query": {
"nested": {
"path": "categories",
"query": {
"terms": {
"categories.id": [1,2,3]
}
},
"inner_hits": {}
}
},
"aggs": {
"category_group": {
"terms": {
"script": "...return the category to group by, depending on the matched nested category..."
},
"aggs": {
"top_products": {
"top_hits": {}
}
}
}
},
"size": 0
}
The problem I hit is that the terms aggregation that builds the grouping category buckets needs to know which nested category matched the search query. It looks like that information is available in the inner_hits array in the results, but I need it within the terms aggregation script field.
So, I guess my question is - is that possible at all? Or is there any other way to get the desired result?
P.S: There is further complication that some categories cannot be used for grouping, and in such case, their parent category should be used, but this is something I can handle in the script.

Resources