Index type in elasticsearch - elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?

I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

Related

Elasticsearch: Modeling product data with frequent updates

We're struggling with modeling our data in Elasticsearch, and decided to change it.
What we have today: single index to store product data, which holds data of 2 types -
[1] Some product data that changes rarely -
* `name, category, URL, product attributes(e.g. color,price) etc...`
[2] Product data that might change frequentley for past documents,
and indexed on a daily level - [KPIs]
* `product-family, daily sales, daily price, daily views...`
Our requirements are -
Store product-related data (for millions of products)
Index KPIs on a daily level, and store those KPIs for a period of 2 years.
Update "product-family" on a daily level, for thousands of products. (no need to index it daily)
Query and aggregate the data with low latency, to display it in our UI. aggregation examples -
Sum all product sales in the last 3 months, from category 'A' and sort by total sales.
Same as the above, but in-addition aggregate based on product-family field.
Keep efficient indexing rate.
Currently, we're storing everything on the same index, daily, meaning we store repetitive data such as name, category and URL over and over again. This approach is very problematic for multiple reasons-
We're holding duplicates for data of type [1], which hardly changes and causes the index to be very large.
when data of type [2] changes, specifically the product-family field(this happens daily), it requires updating tens of millions of documents (from more than a year ago), which causes the system to be very slow and timeout on queries.
Splitting this data into 2 different indices won't work for us since we have to filter data of type [2] by data of type [1] (e.g. all sales from category 'A'), moreover, we'll have to join that data somehow, and our backend server won't handle this load.
We're not sure how to model this data properly, our thoughts are -
Using parent-child relations - parent is product data of type [1] and children are KPIs of type [2]
Using nested fields to store KPIs (data of type [2]).
Both of these methods allow us to reduce the current index size by eliminating the duplicated data of type [1], and efficiently updating data of type [2] for very old documents.
Specifically, both methods allow us to store product-family for each product once in the parent/non-nested fields, which implies we can only update a single document per product. (these updates are daily)
We think parent-child relation is more suitable, due to the fact that we're adding KPIs on a daily level,
which per our understanding - will cause re-indexing for documents with new KPIs when using nested fields.
On the other side, we're afraid that parent-child relations will increase query latency dramatically, hence will cause our UI to be very slow.
We're not sure what is the proper way to model the data, and if our solutions are on the right path,
we would appreciate any help since we're struggling with it for a long time.
First off, I would recommend against indexing data that changes frequently in Elasticsearch. It is not designed for this and you will get poor performance as well as encounter difficulties when cleaning up old data.
Elasticsearch is best used for immutable data (once you insert it, you don't modify it). For time based data, I would recommend inserting measurements once with their timestamp, in e.g. daily indices (see: index templates), and leaving them alone. Each measurement document would look something like
{"product_family": "widget", # keyword
"timestamp": "2022-08-23", # date
"sales": 798137,
"price": "and so on"}
This document would be inserted into the index yourindex_20220823.
You can have Elasticsearch run roll-up jobs for aggregating historical data, and set up index lifecycle management so that indices older than your retention period get deleted. This is very fast, way faster than running delete-by-query requests to remove all documents with insertionDate > -2yrs.
Now, we have the issue of storing the product category metadata. As you might have found out, ES is better at denormalized data, but it does lead to repetition and you might find your index size blowing up.
For minimizing disk usage, the trick is to tweak individual field mappings (and no, you can't rely on dynamic mapping). You can avoid storing a lot of stuff in the inverted index. See https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html. I'd need to see your current mapping to check if there are any obvious gains to be made here.
Lastly, a feature that I've never tried out is to move older data (again, having daily indices helps here) to slower storage modes. See cold/frozen storage tiers.

Why do mappings exist in Elasticsearch?

From what I read, Elasticsearch is dropping support for types.
So, as the examples say indexes are similar to databases and documents are similar to rows of a relational database.
So now, everything is a top-level document right?
Then what is the need for a mapping, if we can store all sorts of documents in an index with whatever schema we want it to have.
I want to understand if my concepts are incorrect anywhere.
Elasticsearch is not dropping support for mapping types, they are dropping support for multiple mapping types within a single index. That's a slight, yet very important, difference.
Having a proper index mapping in ES is as much important as having a proper schema in any RDBMS, i.e. the main idea is to clearly define of which type each field is and how you want your data to be analyzed, sliced and diced, etc.
Without explicit mapping, it wouldn't be possible to do all the above (and much more), ES would guess the type of your fields and even though most of the time it gets it right, there are plenty of times where it is not exactly what you want/need.
For instance, some people store floating point values in string fields (see below), ES would detect that field as being text/keyword even though you want it to be double.
{
"myRatio": "0.3526472"
}
This is just once reason out of many others why it is important to define your own mapping and not rely on the fact that ES will guess it for you.

What should I know / concern when creating a index with 30 - 40 or more columns?

As mentioned in subject, I want to create a index with 30-40 or even more columns (mostly keyword and number).
What should I concern or know about this situation? Is it bad for performance? Is it bad for elasticsearch cluster stability?
For each document in Elasticsearch, there are some limitations to the number of fields and how they are organized.
You can check these limitations in the documentation (this might be different based on ES versions). These limitations can be changed and include the total number of fields that you can have (default to 1000) and the maximum depth for a field (default to 20).
Based on the documentation defining too many fields might not be a good idea, especially if you have many documents:
Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from
Also, be aware of the dynamic fields that you put into the document. Every new field will add a new definition to the document mapping settings.
In your situation, Considering the default maximum number of fields which is 1000, having 40 fields (column?) won't be a problem. Unless you have too many inner objects that might exceed some other mapping limitations like index.mapping.nested_fields.limit orindex.mapping.nested_objects.limit. And try to fix your document structure (mapping) before adding them.

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

ElasticSearch multiple types with same mapping in single index

I am designing an e-Commerce site with multiple warehouse. All the warehouses have same set of products.
I am using ElasticSearch for my search engine.
There are 40 fields each ES document. 20 out of them will differ in value per warehouse, rest 20 fields will contain same values for all warehouses.
I want to use multiple types (1 type for each warehouse) in 1 index. All of the types will have same mappings. Please advise if my approach is correct for such scenario.
Few things not clear to me,
Will the inverted index be created only once for all types in same index?
If new type (new warehouse) is added in future how it will be merged with the previously stored data.
How it will impact the query time if I would have used only one type in one index.
Depending on all types being assigned to the same index, it will only created once and
If a new type is added, its information is added to the existing inverted index as well - adding new terms to the index, adding pointers to existing terms in the index, adding data to doc values per new inserted document.
I honestly can't answer that one, though it is simple to test this in a proof of concept.
In my previous project, I experienced the same setting implementing a search engine with Elasticsearch on a multishop-platform. In that case we had all shops in one type and when searching per shop relevant filters were applied. Though, the approach to separate shop-data by "_type" seems pretty clean to me. We applied it the other way, since my implementation was already able to cover it by filters at the moment of the feature request.
Cheers, Dominik

Resources