Significant Terms Aggregation of "flat" structures - elasticsearch

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:
Document 1
{
"lineItemId": 1,
"lineNo": 1,
"itemId": 1,
"productId": 1234,
"userId": 4711,
"salesQuantity": 2,
"productPrice": 0.99,
"salesGross": 1.98,
"salesTimestamp": 1234567890
}
Document 2
{
"lineItemId": 1,
"lineNo": 2,
"itemId": 1,
"productId": 1235,
"userId": 4711,
"salesQuantity": 1,
"productPrice": 5.99,
"salesGross": 5.99,
"salesTimestamp": 1234567890
}
I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.
The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of
{
"movie": [122,185,231,292,
316,329,355,356,362,364,370,377,420,
466,480,520,539,586,588,589,594,616
],
"user": 1
}
so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.

It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).
There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.
(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)

Since I don't have the amount of data that you do, try this:
get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
"query": {
"filtered": {
"filter": {
"term": {
"productId": 1234
}
}
}
},
"fields": [
"itemId"
]
}
Then
using this list create this query:
GET /sales/sales/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"terms": {
"itemId": [1,2,3,4,5,6,7,11]
}
}
}
},
"aggs": {
"most_sig": {
"significant_terms": {
"field": "productId",
"size": 0
}
}
}
}

If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).
That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.

Related

Fetch other document value while framing DSL query

I'm adding a script_score to my DSL query, to boost results based on the price of a source product. something like this:
functions":[
{
"script_score":{
"script":"if (doc['price'].size() > 0 && doc['price'].value == 1000.0) {15}"
}
}
],
Here "1000.0" is the price of one document that is not part of the response. To achieve this, I have to query 2 times, first to get the price of that document and then frame the query and add the price to the query and boost the results, which is causing performance degradation.
There should be some way to do this using painless scripting to get the value of an id, but I'm not able to get it. It would be great if someone can help on this.
TIA!
Actually, there is a way. I've tried to reproduce your case below, I hope it's close enough to what you have.
Say, you have one index with prices, i.e. from where you first fetch the document with the price that should be boosted.
PUT prices/_doc/1
{
"price": 1000
}
Then, your main index is the one that contains the documents on which the second query runs, say it's for products and we have one product with price 1000 and another with price 500.
PUT products/_doc/q
{
"name": "product q",
"price": 1000
}
PUT products/_doc/2
{
"name": "product 2",
"price": 500
}
Now, the query would look like this. In the function_score query we give a boost of 15 (i.e. the hardcoded value in your script) to documents whose price matches the price of the document with ID 1 in the price index. The terms lookup query will fetch the price of 1000 from the specified document (i.e. with ID 1) and then will boost the documents having that price with a weight of 15.
GET products/_search
{
"query": {
"function_score": {
"functions": [
{
"weight": 15,
"filter": {
"terms": {
"price": {
"index": "prices",
"id": "1",
"path": "price"
}
}
}
}
]
}
}
}
I hope this solves your problem.

Sorting a set of results with pre-ordered items

I have a list of pre-ordered items (order by score ASC) like:
[{
"id": "id2",
"score": 1
}, {
"id": "id12",
"score": 1
}, {
"id": "id8",
"score": 1.4
}, {
"id": "id9",
"score": 1.4
}, {
"id": "id14",
"score": 1.75
}, {
...
}]
Let's say I have an elasticsearch index with a massive of items. Note that there's no "score" field in indexed documents.
Now I want elasticsearch to return only those items with ids in the said list. Ok, this one is easy. I'm now stuck at sorting the result. That means I need the result to be sorted exactly as my pre-ordered list above.
Any suggestion for me to achieve that?
I'm not an English native speaker, so sorry for my grammar and words.
As version of 7.4, Elastic introduced pinned query that promotes selected documents to rank higher than those matching a given query. In your case this search query should return what you want:
GET /_search
{
"query": {
"pinned" : {
"ids" : ["id2", "id12", "id8"],
"organic" : {
other queries
}
}
}
}
For more information you can check Elasticsearch official documentation here.

Elastic search multi index query

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?
If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

Elasticsearch: Product-variant-price modelling and query problem

I want to use Elasticsearch to improve performance on product search (duh) in an e-commerce solution. We have a data model where a product can have multiple variants and each variant can have one or more prices (sometime quite a substantial number of prices).
The user, query-time, chooses if (s)he wants to return products or variants and only one price should be returned (the lowest valid price, each price have a number of fields like valid from-to and valid customer groups).
My first approach was to denormalize product/variants and have prices as nested fields, but this was quite slow and I had a few problems sorting (I think on price, but the exact details eludes me right now).
Second approach was to totally denormalize so all product/variant/price combination is represented as a document. This approach is much faster (obviously), I can aggregate on productId or variantId and get the lowest price but the problem is that I cannot sort the aggregates on non-numeric or non-aggregate fields.
Denormalized documents (productId, variantId are keyword fields, price is numeric, validFrom/-To are date and the rest is text):
[
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ccc",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Green mega-product",
"variant_description": "Behold the awesomeness of the green magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 399
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 499
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-05T00:00:00Z",
"validTo": "2019-06-10T00:00:00Z",
"price": 399
}
]
An example of a working query where I sort on the aggregated price.
{
"size": 1,
"sort": {
"product_name_text_en.keyword": "asc"
},
"query": {
// All the query and filtering
},
"aggs": {
"by_product_id": {
"terms": {
"field": "product_id_string",
"order": {
"min_price": "desc"
}
},
"aggs": {
"min_price": {
"min": {
"field": "price_decimal"
}
}
}
}
}
}
However, using this approach I cannot find a way to sort on document fields. It is possible (I think) on numeric, boolean and date fields using bucket_sort, but I need to be able to sort on, for example, brand or title field (which are text). If it would've been possible to order on a top_hits aggregation I would be home free, but that's unfortunately not possible as I understand from the docs (I've also tried it just to make sure).
Can anyone guide me to a better solution? I don't mind if I have to do the query in two steps, but to make that work for sorting I likely need to have a few different "document types", like Product, Variant, ProductPrice and VariantPrice to use depending on the requested sort order. I'm not the far gone so remodelling is definitively on the table, I've considered using join fields, but I'm not sure that would be performant.
Since the number of products and variants (and prices) can be significant - a million products is definitively on the table, I think I will have problems getting Id's from a query (for example filtering on brand and sorting on title) and then sending them into a get-best-price-query.
I figured this out by accident when I was reading the docs for another case. It all became very simple when I found out about Field collapsing. I feel like I should've known about this...
The index have the same model as in my initial question but the query became much simpler:
{
"size": 10,
"query": {
// filter/match stuff, including filtering valid prices.
},
"collapse": {
"field": "productId",
"inner_hits": {
"name": "least_price",
"collapse": {
"field": "price"
},
"size": 1,
"sort": [
{
"price": "asc"
}
]
}
},
"sort": [
{
"brand.keyword": "asc"
}
]
}
And to return variants instead of products I just collapse on variantId
The collapsing is based on productId or variantId and the least_price for the inner_hits returns the document with the least price (asc sorted by price and picking the first) of the document matching my criterias. Works like a charm.

Elastic Search - Having clause equivalent

I have written a faceted query which returns the faceted results (equivalent to Group by in the SQL World).
Now, I would like to get only the faceted results where the count is greater than a particular number (Equivalent to Having clause in SQL)
Any suggestions on the same?
Update : added the query.
I need only the locations where the count is greater than 5. For e.g. US has 7, UK has 5 , and rest has 3 each. So want to return US and UK only in the result.
"facets":
{
"locations":
{
"terms":
{
"field": "location"
},
"facet_filter":
{
"terms": { "location": [ "US", "UK", "DE", "FR", "JP" ]}
}
}
}
HAVING clauses are not implemented in Elasticsearch yet. You have to handle that client-side.
See https://github.com/elastic/elasticsearch/issues/8110
There are plans to add it, but it has not been done as of May 2015.

Resources