Elasticsearch past version document - elasticsearch

I want to maintain last 2 versions of documents in Elasticsearch.
I created, for example, first update for product123
PUT /products/_doc/product123
{ "name" : "toothPaste",
"price" : 10
}
Then for second update product123:
PUT /products/_doc/product123
{
"name" : "toothPaste",
"price" : 12
}
When I query using GET API - I am getting "price": 12 - Current Version
Is it possible that I will get "price": 10 (Last Version) of the same index

the only way to do this in Elasticsearch is to manage it yourself, as any updates applied to a document do not retain the previous version
you could do this using separate documents as MAZux mentioned above, or you could do it in different fields, eg price and previous_price

Related

Elastic Search - Sorting & Filtering on nested Documents

I am working on an E-Commerce application. Catalog Data is being served by Elastic Search.
I have document's for Product which is already indexed in Elastic Search.
Document Looks something like this (Excluded few fields for the purpose of better readability):
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
}
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
store object in the above document is the one where ES Query will look for price and to decide if item is in stock or Out Of Stock.
I would like to add more child objects to store (Basically data from multiple inventory). This can go up to more than 150 child objects for each product.
Eventually, A product document will look something like this with multiple inventory's data mapped to a particular document.
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 125.0,
"_id" : "1234_112",
"product_code" : "1234",
"warehouse_code" : 112,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 105.0,
"_id" : "1234_113",
"product_code" : "1234",
"warehouse_code" : 113,
"available_unit" : 100
}
Upto N no of stores
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it, Elastic Search query should look into the nested object and get the value which is lowest in all 50 stores if item is available.
Performance should not be degraded.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
What would be the efficient way to extract the lowest price from nested document?
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
We are using template to query ES and the Version of the Elastic Search is 6.0.
Thanks in Advance!!
First there are improvements to nested document search in version 7.x that are worth the upgrade.
As for version 6.x, there are a lot of factors there that I could not give you a concrete answer. It also seems you may not be understanding the way that nested documents work, they are not relational.
In particular when you say that each product might have 50 stores mapped to it that sounds like you are implying a relationship, which will not exist with a nested document. However, the values from those 50 stores would be stored within an index nested under the parent document. Having 50 stores under a product or category does not sound concerning.
ElasticSearch has not really talked in terms of facets since the introduction of the aggregation framework. Its not that they dont exist, just not how they are discussed.
So lets try this. ElasticSearch optimizes its search and query through a divide and conquer mechanism. The data is spread across several shards, a configurable number, and each shard is responsible for reviewing its own data. Further, those shards can be distributed across many machines so that there are many cpus and lots of memory for the search. So growing the data doesn't matter if you are willing to grow the cluster, as it is possible to maintain a situation where each machine is doing the same amount of work as it was doing before.
Unlike a relational database, filters search terms allow Elastic to drastically reduce the data that it is looking at and a larger number of filters will improve performance where on a relational database performance declines.
Now back to nested documents. They are stored as a separate index, but instead of mapping the results to the nested doc, the results map to the parent doc id. So you're nested docs arent exactly in the same index as the rest of the document, though they are not truly separate either. But that does mean that the nested documents should have minimal impact the performance of the queries against the parent documents. But if your data size grows beyond the capacity of your current system you will still need to increase its size.
As to how you would query, you would use Elastic aggregations. These will allow you to calculate your "facet" counts and identify the best prices. The Elastic aggregations are very powerful and very fast. There are caveats that are well documented, but in general they will work as you expect.
In version 6.x query string queries cannot access the search criteria in a nested document, and a complex query must be used.
To recap
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it,
ElasticSearch query should look into the nested object and get the
value which is lowest in all 50 stores if item is available.
Yes a nested aggregation will do this.
Performance should not be degraded.
Performance will continue to depend on the ratio of the size of the data to the overall cluster size.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
No this should not be a problem
What would be the efficient way to extract the lowest price from nested document?
Elastic Aggregations
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
Yes filtering can work with Aggregations very well. The aggregation will be based on the filtered data. In fact you could have an aggregation based on just minimum price, and in the same query then have an aggregation using your price ranges, which will give you the count of documents that have a store within that price range, and you could have a sub aggregation showing the stores under each price range.
We are using template to query ES and the Version of the Elastic Search is 6.0. Thanks in Advance!!
I know nothing about template. The ElasticSearch API is so dead simple I do not know why anyone uses additional tools on top of the API, they just add weight, and increase complexity and make key features not available because the wrapper author did not pass through the feature.

Elasticsearch 6.0 Removal of mapping types - Alternatives

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.
I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*
Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Sum field and sort on Solr

I'm implementing a grouped search in Solr. I'm looking for a way of summing one field and sort the results by this sum. With the following data example I hope it will be clearer.
{
[
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
},
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
},
{
"id" : 3,
"parent_id" : 33,
"valueToBeSummed": 1
},
{
"id" : 4,
"parent_id" : 5,
"valueToBeSummed": 21
}
]
}
If the search is made over this data I'd like to obtain
{
[
{
"numFound": 1,
"summedValue" : 21,
"parent_id" : 5
},
{
"numFound": 2,
"summedValue" : 4,
"parent_id" : 22
},
{
"numFound": 1,
"summedValue" : 1,
"parent_id" : 33
}
]
}
Do you have any advice on this ?
Solr 5.1+ (and 5.3) introduces Solr Facet functions to solve this exact issue.
From Yonik's introduction of the feature:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
So the suggestion would be to upgrade to the newest version of Solr (the most recent version is currently 5.2.1, be advised that some of the syntax that's on the above link will be landed in 5.3 - the current release target).
So you want to group your results on the field parent_id and inside each group you want to sum up the fields valueToBeSummed and then you want to sort the entire results (the groups) by this new summedvalue field. That is a very interesting use case...
Unfortunately, I don't think there is a built in way of doing what you have asked.
There are function queries which you can use to sort, there is a group.func parameter also, but they will not do what you have asked.
Have you already indexed this data? Or are you still in the process of charting out how to store this data? If its the latter then one possible way would be to have a summedvalue field for each documents and calculate this as and when a document gets indexed. For example, given the sample documents in your question, the first document will be indexed as
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
"summedvalue": 3
"timestamp": current-timestamp
},
Before indexing the second document id:2 with parent_id:22 you will run a solr query to get the last indexed document with parent_id:22
Solr Query q=parent_id:22&sort=timestamp desc&rows=1
and add the summedvalue of id:1 with valueToBeSummed of id:2
So the next document will be indexed as
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
"summedvalue": 4
"timestamp": current-timestamp
}
and so on.
Once you have documents indexed this way, you can run a regular solr query with &group=true&group.field=parent_id&sort=summedValue.
Please do let us know how you decide to implement it. Like I said its a very interesting use case! :)
You can add the below query
select?q=*:*&stats=true&stats.field={!tag=piv1 sum=true}valueToBeSummed&facet=true&facet.pivot={!stats=piv1 facet.sort=index}parent_id&wt=json&indent=true
You need to use Stats Component for the requirement. You can get more information here. The idea is first define on what you need to have stats on. Here it is valueToBeSummed, and then we need to group on parent_id. We use facet.pivot for this functionality.
Regarding sort, when we do grouping, the default sorting order is based on count in each group. We can define based on the value too. I have done this above using facet.sort=index. So it sorted on parent_id which is the one we used for grouping. But your requirement is to sort on valueToBeSummed which is different from the grouping attribute.
As of now not sure, if we can achieve that. But will look into it and let you know.
In short, you got the grouping, you got the sum above. Just sort is pending

Kibana: filter events for today

I'm using Kibana on top of logstash and I want to filter items in the index to today, or the last 24 hours is fine too.
So apparently this requires me to run a range query against the underlying ElasticSearch engine that would look like:
"range" : {
"timestamp" : {
"gte": "now-24h",
"lte": "now",
}
}
However - I can't put that in the filter box in Kibana 3:
This is a numeric range query and it doesn't work - but it shows the input box and the idea.
So my question: how can I create a filter that filters the events to a date range in Kibana 3?
Found it, it's in the top menu:
Clicking it generates the range filter as can be seen as the 2nd filter on the left.

ElasticSearch - maintain original timestamp across versions

I have enabled automatic _timestamp on my indexes but every time an index is updated during a _bulk request (or a regular update) the timestamp is also updated. This makes sense.
I want to know if there's a way to keep the original timestamp after an update? So we only ever see the timestamp for version 1 no matter how many times it is updated to a new version.
I have over 4 millions indexes and bulk update in chunks of 1000 so I'd rather not iterate through every single item to compare timestamps.
Any tips?
For anyone who comes across this issue in the future I ended up using a combination of bulk update with script to upsert a date. This only happens when an index is created and left alone during updates. I'm not sure if this is the most elegant solution but it works.
{ "update": {"_id" : "1"} },
{ "script": "", "upsert" : {"og_index_date" : 20130826}},
{ "update": {"_id" : "1"} },
{ "doc": {"field1" : "one", "field2": "two"}, "doc_as_upsert" : True }
Even though you're doubling your writes, it will maintain this date across versions.

Resources