Elastic Search document modeling for history - elasticsearch

I want to store products in elastic search
Each product has some fields (description, quantity, price, name). But every day the price and quantity could change.
How can I store this in elastic search so that I will be able to search for any product for all the past prices?
Should I have a document for the current value fields and another document which will have the product document as parent, and there will be some daily task to add the date and changed value in an array ?

Unfortunately, there's no built in way to deal with versioning in ElasticSearch. The built-in versioning isn't designed for the retrieval of previous versions. You will need to control versioning at the application layer.
What we've ultimately elected to do is store all the old copies of the documents like this:
{
"unversioned_prop1": "prop1",
"unversioned_prop2": "prop2",
...
"versions": [
{
"version": "version_x",
"version_metadata": { ... }
"document": {
"versioned_prop3": "prop3",
"versioned_prop4": "prop4"
...
}
},
{ "version": "version_y", "document": { ... versioned props ... } },
...
]
"current": { ... current versioned props ... }
}
Unversioned Properties
Having the unversioned properties outside of the array is useful because you may want to update some properties for ALL versions of the document. Additionally, it ensures that search weights behave predictably.
It has the downside of requiring us to seam some of the information together in the application layer.
Current Version
Breaking out the current version into a separate property allows you to use search filtering to only return the most recent version of the document.
Version metadata
This includes any versioning information that you might want to search on, such as dates.
Search
You can easily search the versioned properties just like you can subproperties. So search ends up looking like this:
...
{
"match": {"versions.document.versioned_prop": "query string"
}
This will search across ALL versions of the document, and return the combined document if there's a match.
Updates
When we need to create a new version, you can use a partial update to insert the new document and update the current document.
Alternative
The major downside with this approach is that you can't easily filter down some of the search results based on things inside of versions - you will likely want to filter them on the application side.
If you need your documents to behave independently, you will likely need to index them independently. To achieve that you can include a "collection id" on all the versions. The collection ID is unique to the document, and is shared across all versions.
The collection ID approach ended up having too many issues, and we moved to the approach outlined above, and have had a much higher level of success.
As a side note, I personally wouldn't recommend that you use ElasticSearch as the primary storage of important records. Only do it if you can live with the occasional data loss.

First thing you should not update existing document with new quantity/price.
I will suggest whenever there is a change in quantity/price , insert new document.There will be duplicate fields but you can have all information about that product on given date in a document.
You can also retrieve all documents for that product and it will have their own values(prices).Data will be duplicated in this modeling but i don't see this as an issue.

Related

How to project a new field in response in ElasticSearch?

I am using Elasticsearch 6.2.
I have an index products with index_type productA having data with following structure:
{
"id": 1,
"parts": ["part1", "part2",...]
.....
.....
}
Now during the query time, I want to add or project a field parts_count to the response which simply represents the number of parts i.e the length of parts array. Also, if possible, I would also like to sort the documents of productA based on the generated field parts_count.
I have gone through most of the docs but haven't found a way to achieve this.
Note:
I don't want to update the mapping and add dynamic fields. I am not sure if Elasticsearch allows it. I just wanted to mention it.
Did you read about Script Fields and on Script Based Sorting?
I think you should be able to achieve both things and this not require any mapping updates.

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

Updating filtered documents in elasticsearch

I want to know if there is a way to update elasticsearch documents after filtering them out.
Let's say I have a user collection with following documents:
[
{ "name":"u1","age":23},
{ "name":"u2","age":31},
{ "name":"u3","age":27},
{ "name":"u4","age":33}
]
Now what I need to do is update the names of all the users who have ages above 30.
Looking at a lot of documentation and searching for hours on google, including the following document
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_updating_documents.html
I couldn't find a way to do it. So if we look into the docs, we are providing the id of the document, so it doesn't suite my need. Is there a way to do this sort do this sort of stuff in Elasticsearch?
From the link you provided:
Note that as of this writing, updates can only be performed on a
single document at a time. In the future, Elasticsearch will provide
the ability to update multiple documents given a query condition (like
an SQL UPDATE-WHERE statement).
So, this is not supported at the moment. But you can consider taking a look at this plugin: https://github.com/yakaz/elasticsearch-action-updatebyquery/.

How to structure Elasticsearch indices/types?

How would you structure indices/types for an eshop application? Such an eshop would consist of domain objects like product, category, tag, manufacturer etc. The fulltext search results page should display intermixed list of all domain objects.
I can think of two options:
One index per whole application, every domain object as a type.
Every domain object has its own index, the type is the same - "item".
Which option will scale better?
The most of the "items" in the database are products. Some products aren't yet/anymore available. How to boost currently available products?
The fulltext should prefer to show categories/manufacturers on top of the page. How to boost certain types / objects from certain index?
For better performance i suggest first option is better one.
1)"One index per whole application, every domain object as a type."
2)Consider you create an index named "eshop".And types such as mobile,book etc
3)Because you can query according to your user input.Consider you create a shopping website like flipkart.In search user can search with plain keyword.
4)Now you can search in Elasticsearch with only mentioning index name.If user refer sum filter like mobile,range 1000-10000.you need to search inside mobile type,moreover we can easily filter in Elasticsearch.it will reduce your execution memory and CPU.
To boost available products.Add a field called "available" in your document.And while searching mentions boost value for available product.Example:
{
"query": {
"term": {
"available": true
}
}
"boost": 1.5
}
For more Boosting refer
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html
http://jontai.me/blog/2013/01/advanced-scoring-in-elasticsearch/

Resources