Elasticsearch: Product-variant-price modelling and query problem - elasticsearch

I want to use Elasticsearch to improve performance on product search (duh) in an e-commerce solution. We have a data model where a product can have multiple variants and each variant can have one or more prices (sometime quite a substantial number of prices).
The user, query-time, chooses if (s)he wants to return products or variants and only one price should be returned (the lowest valid price, each price have a number of fields like valid from-to and valid customer groups).
My first approach was to denormalize product/variants and have prices as nested fields, but this was quite slow and I had a few problems sorting (I think on price, but the exact details eludes me right now).
Second approach was to totally denormalize so all product/variant/price combination is represented as a document. This approach is much faster (obviously), I can aggregate on productId or variantId and get the lowest price but the problem is that I cannot sort the aggregates on non-numeric or non-aggregate fields.
Denormalized documents (productId, variantId are keyword fields, price is numeric, validFrom/-To are date and the rest is text):
[
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ccc",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Green mega-product",
"variant_description": "Behold the awesomeness of the green magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 399
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 499
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-05T00:00:00Z",
"validTo": "2019-06-10T00:00:00Z",
"price": 399
}
]
An example of a working query where I sort on the aggregated price.
{
"size": 1,
"sort": {
"product_name_text_en.keyword": "asc"
},
"query": {
// All the query and filtering
},
"aggs": {
"by_product_id": {
"terms": {
"field": "product_id_string",
"order": {
"min_price": "desc"
}
},
"aggs": {
"min_price": {
"min": {
"field": "price_decimal"
}
}
}
}
}
}
However, using this approach I cannot find a way to sort on document fields. It is possible (I think) on numeric, boolean and date fields using bucket_sort, but I need to be able to sort on, for example, brand or title field (which are text). If it would've been possible to order on a top_hits aggregation I would be home free, but that's unfortunately not possible as I understand from the docs (I've also tried it just to make sure).
Can anyone guide me to a better solution? I don't mind if I have to do the query in two steps, but to make that work for sorting I likely need to have a few different "document types", like Product, Variant, ProductPrice and VariantPrice to use depending on the requested sort order. I'm not the far gone so remodelling is definitively on the table, I've considered using join fields, but I'm not sure that would be performant.
Since the number of products and variants (and prices) can be significant - a million products is definitively on the table, I think I will have problems getting Id's from a query (for example filtering on brand and sorting on title) and then sending them into a get-best-price-query.

I figured this out by accident when I was reading the docs for another case. It all became very simple when I found out about Field collapsing. I feel like I should've known about this...
The index have the same model as in my initial question but the query became much simpler:
{
"size": 10,
"query": {
// filter/match stuff, including filtering valid prices.
},
"collapse": {
"field": "productId",
"inner_hits": {
"name": "least_price",
"collapse": {
"field": "price"
},
"size": 1,
"sort": [
{
"price": "asc"
}
]
}
},
"sort": [
{
"brand.keyword": "asc"
}
]
}
And to return variants instead of products I just collapse on variantId
The collapsing is based on productId or variantId and the least_price for the inner_hits returns the document with the least price (asc sorted by price and picking the first) of the document matching my criterias. Works like a charm.

Related

Elasticsearch Rank based on rarity of a field value

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:
"name": "Red T-Shirt"
"store": "Zara"
"name": "Yellow T-Shirt"
"store": "Zara"
"name": "Red T-Shirt"
"store": "Bershka"
"name": "Green T-Shirt"
"store": "Benetton"
I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.
In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.
So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?
Many thanks in advance!
This is possible with a combination of diversified sampler aggregation and top hits aggregation, as learned from the Elastic forum. I don't know what the performance implications are, if used on a high-load production system. Here is a code example, use at your own risk:
{
"query": {}, // whatever query
"size": 0, // since we don't use hits
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 100,
"field": "store"
},
"aggs": {
"keywords": {
"top_hits": {
"_source": {
"includes": [ "name", "store" ]
},
"size": 100
}
}
}
}
}
}

Significant Terms Aggregation of "flat" structures

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:
Document 1
{
"lineItemId": 1,
"lineNo": 1,
"itemId": 1,
"productId": 1234,
"userId": 4711,
"salesQuantity": 2,
"productPrice": 0.99,
"salesGross": 1.98,
"salesTimestamp": 1234567890
}
Document 2
{
"lineItemId": 1,
"lineNo": 2,
"itemId": 1,
"productId": 1235,
"userId": 4711,
"salesQuantity": 1,
"productPrice": 5.99,
"salesGross": 5.99,
"salesTimestamp": 1234567890
}
I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.
The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of
{
"movie": [122,185,231,292,
316,329,355,356,362,364,370,377,420,
466,480,520,539,586,588,589,594,616
],
"user": 1
}
so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.
It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).
There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.
(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)
Since I don't have the amount of data that you do, try this:
get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
"query": {
"filtered": {
"filter": {
"term": {
"productId": 1234
}
}
}
},
"fields": [
"itemId"
]
}
Then
using this list create this query:
GET /sales/sales/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"terms": {
"itemId": [1,2,3,4,5,6,7,11]
}
}
}
},
"aggs": {
"most_sig": {
"significant_terms": {
"field": "productId",
"size": 0
}
}
}
}
If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).
That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.

Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

In relational db our data looks like this:
Company -> Department -> Office
Elasticsearch version of the same data (flattened):
{
"officeID": 123,
"officeName": "office 1",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}
},{
"officeID": 124,
"officeName": "office 2",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}}
We need to find department (or company) by providing office information (such as state).
For example, since all I need is a department info, I can specify it like this (we are using Nest)
searchDescriptor = searchDescriptor.Source(x => x.Include("department"));
and get all departments with qualifying offices.
The problem is - I am getting multiple "department" records with the same id (one for each office).
We are using paging and sorting.
Would it be possible to get paged and sorted Distinct results?
I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client. The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.
So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:
searchDescriptor = searchDescriptor.Aggregations(a => a
.Terms("myunique",
t =>
t.Field("department.departmentID")
.Size(10)
.Aggregations(
x=>x.TopHits("mytophits",
y=>y.Source(true)
.Size(1)
.Sort(k => k.OnField("department.departmentName").Ascending())
)
)
)
);
Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?
Update:
I can get results using top_hits (see below), but in this case I won't be able to use paging (looks like Elasticsearch aggregations feature doesn't support paging), so I am back to square one...
{
"from": 0,
"size": 33,
"explain": false,
"sort": [
{
"departmentID": {
"order": "asc"
}
}
],
"_source": {
"include": [
"department"
]
},
"aggs": {
"myunique": {
"terms": {
"field": "department.departmentID",
"order": {
"mytopscore": "desc"
}
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 5,
"_source": {
"include": [
"department.departmentID"
]
}
}
},
"mytopscore": {
"max": {
"script": "_score"
}
}
}
}
},
"query": {
"wildcard" : { "officeName" : "some office*" }
}
}

Adjusting Elasticsearch _score based on field value, relative to other matching document's field value

We're updating our search system from Solr to Elasticsearch. We've already improved lots of things, but something we haven't got right yet is boosting a document's (product's) score by the popularity of the product (it's an ecommerce website).
This is what we have currently (with lots of irrelevant bits stripped out):
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query": "renal dog food",
"fields": [ "family_name^20", "parent_categories^2", "description^0.2", "product_suffixes^8", "facet_values^5" ],
"operator": "and",
"type": "best_fields",
"tie_breaker": 0.3
}
},
"functions": [{
"script_score": {
"script": "_score * log1p(1 + doc['popularity_score'].value)"
}
}],
"score_mode": "sum"
}
},
"sort": [
{ "_score": "desc" }
],
}
The popularity_score field contains the total number of orders containing this item in the last 6 weeks. Some items will have never been ordered and some will have had up 30,000 (with potentially a lot more as we continue to grow the business). It's quite a bit range.
The problem we have is that a document (product) might be a really good match text-wise but not very popular. We then have another not-very-relevant product does just about match the query, but because it is very popular it jumps up the list. What we are looking for is something will allow the popularity_score to be taken relative to the popularity_score of other matching results and get some form of normalisation, rather than just being taken as is (log1p doesn't seem to be enough sometimes). Does anyone have any suggestions or ideas?
Thank you!

Term aggregation consider only the prefix to aggregate

In my elastic search documents I have users and some sort of representation of his place in the organization, for instance:
The CEO is position 1
The ones directly under the CEO will be 1/1, 1/2, 1/3, and so on
The ones under 1/1 will be 1/1/1, 1/1/2, 1/2/3, etc
I have an aggregration in which I want to aggregate by VP, so I want everybody under 1/1, 1/2, 1/3.
To do that I created a query like this one:
"aggs": {
"information": {
"terms":{
"field": "position",
"script": "_value.replaceAll('(1/1/[0/]*[1-9]).+', '$1')"
}
This would get the prefix and replace by the group in the regex, so everyone would have the same position, then I could make the aggregation. This has a poor performance.
I was thinking about using something like this
"aggs": {
"information": {
"terms":{
"field": "position",
"prefix": "1/1/.*'
}
So I would group by everyone that starts with 1/1 (1/1/1/1, 1/1/1/2, 1/1/1/3 would be one group, 1/1/2/1, 1/1/2/2, 1/1/2/3 would be a second group and so on).
Is it possible?
If you know beforehand that on how deep level you want to run this aggregation, you could simply store these levels at different fields:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "3",
"level_3": "2",
"level_4": null
}
But this would require many nested terms aggregations to reproduce the hierarchy. This version would make one such aggregation sufficient:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "1/3",
"level_3": "1/3/2",
"level_4": null
}
It also has simpler query filter if you want to focus on people under for example 1/1 by having a filter on field level_2 and terms aggregation on field level_3.
If you don't know the maximum level of the hierarchy you can use nested documents like this, but then queries and aggregations get a bit more complex:
{
"name": "Jack",
"own_level": 4,
"bosses": [
{
"level": 1,
"id": "1"
},
{
"level": 2,
"id": "1/3"
},
{
"level": 3,
"id": "1/3/2"
}
]
}

Resources