Automatically merge / rollup data in elastic search - elasticsearch

Is there an easy way to create a new index from aggregated results from another index (And maybe merge em).
I have a large index with products that are similar. They have a product ID to identify which products belong together, but they have a different URL / Price and a different title (that I want to preserve somehow in the merge so I can search it).
So if I enter 8 product lines I would love to have it all roll up into 1 product with a nested array with similar product data.
I tried the rollup API with the job below. But I couldn't get that going the way I wanted and Im getting the feeling that that is only for historical / log data. All my data has the same timestamp since I update all of this every morning.
PUT _xpack/rollup/job/product
{
"index_pattern": "products",
"rollup_index": "products_rollup",
"cron": "*/30 * * * * ?",
"page_size": 1000,
"groups": {
"date_histogram": {
"field": "timestamp",
"interval": "7d"
},
"terms": {
"fields": [
"product_id"
]
}
},
"metrics": [
{
"field": "total_price",
"metrics": [
"min",
"max",
"sum"
]
}
]
}
Thanks!

For now the rollup API is mainly intended to rollup numerical data in time. Not to merge documents. In your case I would merge the documents on application level and get one document with the "subdocuments" in a nested object.

Related

elasticsearch - how to combine results from two indexes

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.
As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

Fetch other document value while framing DSL query

I'm adding a script_score to my DSL query, to boost results based on the price of a source product. something like this:
functions":[
{
"script_score":{
"script":"if (doc['price'].size() > 0 && doc['price'].value == 1000.0) {15}"
}
}
],
Here "1000.0" is the price of one document that is not part of the response. To achieve this, I have to query 2 times, first to get the price of that document and then frame the query and add the price to the query and boost the results, which is causing performance degradation.
There should be some way to do this using painless scripting to get the value of an id, but I'm not able to get it. It would be great if someone can help on this.
TIA!
Actually, there is a way. I've tried to reproduce your case below, I hope it's close enough to what you have.
Say, you have one index with prices, i.e. from where you first fetch the document with the price that should be boosted.
PUT prices/_doc/1
{
"price": 1000
}
Then, your main index is the one that contains the documents on which the second query runs, say it's for products and we have one product with price 1000 and another with price 500.
PUT products/_doc/q
{
"name": "product q",
"price": 1000
}
PUT products/_doc/2
{
"name": "product 2",
"price": 500
}
Now, the query would look like this. In the function_score query we give a boost of 15 (i.e. the hardcoded value in your script) to documents whose price matches the price of the document with ID 1 in the price index. The terms lookup query will fetch the price of 1000 from the specified document (i.e. with ID 1) and then will boost the documents having that price with a weight of 15.
GET products/_search
{
"query": {
"function_score": {
"functions": [
{
"weight": 15,
"filter": {
"terms": {
"price": {
"index": "prices",
"id": "1",
"path": "price"
}
}
}
}
]
}
}
}
I hope this solves your problem.

Elasticsearch: Product-variant-price modelling and query problem

I want to use Elasticsearch to improve performance on product search (duh) in an e-commerce solution. We have a data model where a product can have multiple variants and each variant can have one or more prices (sometime quite a substantial number of prices).
The user, query-time, chooses if (s)he wants to return products or variants and only one price should be returned (the lowest valid price, each price have a number of fields like valid from-to and valid customer groups).
My first approach was to denormalize product/variants and have prices as nested fields, but this was quite slow and I had a few problems sorting (I think on price, but the exact details eludes me right now).
Second approach was to totally denormalize so all product/variant/price combination is represented as a document. This approach is much faster (obviously), I can aggregate on productId or variantId and get the lowest price but the problem is that I cannot sort the aggregates on non-numeric or non-aggregate fields.
Denormalized documents (productId, variantId are keyword fields, price is numeric, validFrom/-To are date and the rest is text):
[
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ccc",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Green mega-product",
"variant_description": "Behold the awesomeness of the green magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 399
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 499
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-05T00:00:00Z",
"validTo": "2019-06-10T00:00:00Z",
"price": 399
}
]
An example of a working query where I sort on the aggregated price.
{
"size": 1,
"sort": {
"product_name_text_en.keyword": "asc"
},
"query": {
// All the query and filtering
},
"aggs": {
"by_product_id": {
"terms": {
"field": "product_id_string",
"order": {
"min_price": "desc"
}
},
"aggs": {
"min_price": {
"min": {
"field": "price_decimal"
}
}
}
}
}
}
However, using this approach I cannot find a way to sort on document fields. It is possible (I think) on numeric, boolean and date fields using bucket_sort, but I need to be able to sort on, for example, brand or title field (which are text). If it would've been possible to order on a top_hits aggregation I would be home free, but that's unfortunately not possible as I understand from the docs (I've also tried it just to make sure).
Can anyone guide me to a better solution? I don't mind if I have to do the query in two steps, but to make that work for sorting I likely need to have a few different "document types", like Product, Variant, ProductPrice and VariantPrice to use depending on the requested sort order. I'm not the far gone so remodelling is definitively on the table, I've considered using join fields, but I'm not sure that would be performant.
Since the number of products and variants (and prices) can be significant - a million products is definitively on the table, I think I will have problems getting Id's from a query (for example filtering on brand and sorting on title) and then sending them into a get-best-price-query.
I figured this out by accident when I was reading the docs for another case. It all became very simple when I found out about Field collapsing. I feel like I should've known about this...
The index have the same model as in my initial question but the query became much simpler:
{
"size": 10,
"query": {
// filter/match stuff, including filtering valid prices.
},
"collapse": {
"field": "productId",
"inner_hits": {
"name": "least_price",
"collapse": {
"field": "price"
},
"size": 1,
"sort": [
{
"price": "asc"
}
]
}
},
"sort": [
{
"brand.keyword": "asc"
}
]
}
And to return variants instead of products I just collapse on variantId
The collapsing is based on productId or variantId and the least_price for the inner_hits returns the document with the least price (asc sorted by price and picking the first) of the document matching my criterias. Works like a charm.

Elasticsearch query speed up with repeated used terms query filter

I will need to find out the co-occurrence times between one single tag and another fixed set of tags as whole. I have 10000 different single tags, and there are 10k tags inside fixed set of tags. I loop through all single tags under a fixed set of tags context with a fixed time range. I have total 1 billion documents inside the index with 20 shards.
Here is the elasticsearch query, elasticsearch 6.6.0:
es.search(index=index, size=0, body={
"query": {
"bool": {
"filter": [
{"range": {
"created_time": {
"gte": fixed_start_time,
"lte": fixed_end_time,
"format": "yyyy-MM-dd-HH"
}}},
{"term": {"tags": dynamic_single_tag}},
{"terms": {"tags": {
"index" : "fixed_set_tags_list",
"id" : 2,
"type" : "twitter",
"path" : "tag_list"
}}}
]
}
}, "aggs": {
"by_month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": two_month_start_time,
"max": start_month_start_time}
}}}
})
My question: Is there any solution which can have a cache inside elasticsearch for a fixed 10k set of tags terms query and time range filter which can speed up the query time? It took 1.5s for one single tag for my query above.
What you are seeing is normal behavior for Elasticsearch aggregations (actually, pretty good performance given that you have 1 billion documents).
There are a couple of options you may consider: using a batch of filter aggregations, re-indexing with a subset of documents, and downloading the data out of Elasticsearch and computing the co-occurrences offline.
But probably it is worth trying to send those 10K queries and see if Elasticsearch built-in caching kicks in.
Let me explain in a bit more detail each of these options.
Using filter aggregation
First, let's outline what we are doing in the original ES query:
filter documents with create_time in certain time window;
filter documents containing desired tag dynamic_single_tag;
also filter documents who have at least one tag from the list fixed_set_tags_list;
count how many such documents there are per each month in certain time period.
The performance is a problem because we've got 10K of tags to make such queries for.
What we can do here is to move filter on dynamic_single_tag from query to aggregations:
POST myindex/_doc/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { ... } }
]
}
},
"aggs": {
"by tag C": {
"filter": {
"term": {
"tags": "C" <== here's the filter
}
},
"aggs": {
"by month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": "2019-01-01",
"max": "2019-02-01"
}
}
}
}
}
}
}
The result will look something like this:
"aggregations" : {
"by tag C" : {
"doc_count" : 2,
"by month" : {
"buckets" : [
{
"key_as_string" : "2019-01-01T00:00:00.000Z",
"key" : 1546300800000,
"doc_count" : 2
},
{
"key_as_string" : "2019-02-01T00:00:00.000Z",
"key" : 1548979200000,
"doc_count" : 0
}
]
}
}
Now, if you are asking how this can help the performance, here is the trick: to add more such filter aggregations, for each tag: "by tag D", "by tag E", etc.
The improvement will come from doing "batch" requests, combining many initial requests into one. It might not be practical to put all 10K of them in one query, but even batches of 100 tags per query can be a game changer.
(Side note: roughly the same behavior can be achieved via terms aggregation with include filter parameter.)
This method of course requires getting hands dirty and writing a bit more complex query, but it will come handy if one needs to run such queries at random times with 0 preparation.
re-index the documents
The idea behind second method is to reduce the set of documents beforehand, via reindex API. reindex query might look like this:
POST _reindex
{
"source": {
"index": "myindex",
"type": "_doc",
"query": {
"bool": {
"filter": [
{
"range": {
"created_time": {
"gte": "fixed_start_time",
"lte": "fixed_end_time",
"format": "yyyy-MM-dd-HH"
}
}
},
{
"terms": {
"tags": {
"index": "fixed_set_tags_list",
"id": 2,
"type": "twitter",
"path": "tag_list"
}
}
}
]
}
}
},
"dest": {
"index": "myindex_reduced"
}
}
This query will create a new index, myindex_reduced, containing only elements that satisfy first 2 clauses of filtering.
At this point, the original query can be done without those 2 clauses.
The speed-up in this case will come from limiting the number of documents, the smaller it will be, the bigger the gain. So, if fixed_set_tags_list leaves you with a little portion of 1 billion, this is the option you can definitely try.
Downloading data and processing outside Elasticsearch
To be honest, this use-case looks more like a job for pandas. If data analytics is your case, I would suggest using scroll API to extract the data on disk and then process it with an arbitrary script.
In python it could be as simple as using .scan() helper method of elasticsearch library.
Why not to try the brute force approach?
Elasticsearch will already try to help you with your query via request cache. It is applied only to pure-aggregation queries (size: 0), so should work in your case.
But it will not, because the content of the query will always be different (the whole JSON of the query is used as caching key, and we have a new tag in every query). A different level of caching will start to play.
Elasticsearch heavily relies on the filesystem cache, which means that under the hood the more often accessed blocks of the filesystem will get cached (practically loaded into RAM). For the end-user it means that "warming up" will come slowly and with volume of similar requests.
In your case, aggregations and filtering will occur on 2 fields: create_time and tags. This means that after doing maybe 10 or 100 requests, with different tags, the response time will drop from 1.5s to something more bearable.
To demonstrate my point, here is a Vegeta plot from my study of Elasticsearch performance under the same query with heavy aggregations sent with fixed RPS:
As you can see, initially the request was taking ~10s, and after 100 requests it diminished to brilliant 200ms.
I would definitely suggest to try this "brute force" approach, because if it works it is good, if it does not - it costed nothing.
Hope that helps!

Sorting and filtering elastic search top hits docs

lets say I have a simple product index for docs like this:
{
"product_name": "some_product",
"category": "some_cotegory",
"price": "200"
"sold_times": "5",
"store": "store1"
}
and I want to get the most expensive products in their category and per store that have been sold less than 3 times and I want them to be ordered by store, category and price.
I can use two terms aggregations and top hits aggregation to get the most expensive products in their category per store, but how I sort and filter these top hits result? I really need to filter the results after the top hits agg is performed, so the filter query is not the solution. How can I do this? Thx
EDIT:
Long story short - I need elastic equivalent for SQL:
SELECT p.*
FROM products AS p
INNER JOIN (
SELECT max(price) AS price, categroy, store
FROM products
GROUP BY category, store
) AS max_prices ON p.price = max_prices.price AND p.category = max_prices.category AND p.store = max_prices.store
WHERE p.sold_times < 3;
You could filter the search to only return products sold less than 3 times, then aggregate those by store and category, then finally apply a top hits aggregation to get the most expensive item in the category (for that store). Something like
{
"size": 0,
"query": {
"range": {
"sold_times": {
"lt": 3
}
}
},
"aggs": {
"store": {
"terms": {
"field": "store",
"size": 10
},
"aggs": {
"category": {
"terms": {
"field": "category",
"size": 10
},
"aggs": {
"most_expensive": {
"top_hits": {
"size": 1,
"sort": [
{
"price": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}
Well, after some search, I have found "possible" solution. I could use Bucket Selector aggregation, together with some script that would make accessible the top hits properties for filtering and similar approach for sorting using Bucket Sort aggregation (some info can be found here: How do I filter top_hits metric aggregation result [Elasticsearch])
But I'm facing another issue with aggregations. Because a lot of categories I want to use a pagination (as "scroll" or "size and from" used in common search query) but it cannot be done easily with aggregations. There's a Composite Aggregation which could do something similar, but after all the query would be so complicated so it scares me a lot so I decided to give it up and make the grouping outside of elastic.
It is sad that there is no an easy way to do such a common analytic query in elastic.

Resources