How to calculate the size of an elasticsearch node? - elasticsearch

How can i calculate the needed size of the elasticsearch-node for my Shopware 6 instance when i known some KPIs.
For example:
KPI
Value
Customer
5.000
Products
10.000
SalesChannel
2
Languages
1
Categories
20
Is there a (rough) formula to calculate the number of documents or the required size of a node?

https://www.elastic.co/blog/benchmarking-and-sizing-your-elasticsearch-cluster-for-logs-and-metrics is relevant
however what you have provided there is only logical sizing of the data. you will need to figure out what this all means when you start putting documents into Elasticsearch

Related

Elasticsearch query over huge data

We have over 100 million data store in Elasticsearch.
Dataset are too much to be fully loaded into our service memory.
Each data has a column called amount. The search is to find out several (sometimes over 10 thousand) target data that their sum of the amount equals or close to an input value.
Below is out current solution:
We merge the 100 million data input 4000 buckets by using ES's bucket. Each bucket's amount is the sum of every data it contains.
We load the 4000 buckets into our service. Afterwards we find out the solution mentioned above based on the 4000 buckets.
The obvious disadvantage is the lack of accuracy. The difference between the sum of results we find and the input target is sometimes quite large.
We are three young guys lack of experience, we need some instructions.

Examples of 10 million plus Elastic clusters?

Unless I'm missing an obvious list that's provided somewhere, there doesn't seem to be a list that gives examples of large-ish Elastic clusters.
To answer this question I'd appreciate it if you could list a solution you know of and some brief details about it. Note that no organisational details need be shared unless these are already public.
Core info
Number of nodes (machines)
Gb of index size
Gb of source size
Number of documents / items (millions)
When the system was built (year)
Any of the follow information would be appreciated as well:
Node layout / Gb of memory in each node. Number of master nodes (generally smaller), number and layout of data nodes
Ingest and / or query performance (docs per second, queries per second)
Types of CPU - num cores, year of manufacture or actual CPU specifics
Any other relevant information or suggestions of types of additional info to request
And as always - many thanks for this, and I hope it helps all of us !
24 m4.2xlarge for data nodes,
separate masters and monitoring cluster
multiple indices (~30 per day), 1-2Tb of data per day
700-1000M documents per day
It is continiously building, changing, optimizing (since version 1.4)
hundreds of search requests per second, 10-30k documents per second

How to split multiple typed index out to prepare for ES upgrade (where multiple types are deprecated)

We are currently running a cluster with ES 2.3.2 that has one large index with the following properties:
762 GB (366 million docs)
25 data nodes; 3 master nodes; 3 client nodes
23 shards / 1 replica
This one index has 20+ types, each with a few common and many unique fields. I am redesigning the cluster with the following goals:
1) Remove multiple types in an index so that we can upgrade ES. Though multi-types are supported in v5, we want to do the work to prep for v6 now.
2) Break up the large index into more manageable smaller indexes
I have set up a new identical cluster. I modified the indexing so that I have one index per type. I allocated a shard count based on the relative size of the data with a minimum of 2, and a max of 5 shards. After indexing all of our data into this new cluster, I am finding that the same query against the new cluster is slower than against the old cluster.
I figured this was due to the explosion of shards (i.e. was 23 primary, and now it is 78). I closed all but one index (that has a shard count of 2), then ran a test where I targeted a single type against my old monolithic index, and the new single-typed index (using a homebrew tool to run requests in parallel and parse out the "took"). I find that if I do a "size: 0", my new cluster is faster. When I return 7 or 8 they seem to be in parity. It then goes downhill where our default query of 30 records returned is about twice as slow. I am guessing this is because there are fewer threads to do the actual retrieval in the smaller index with two shards vs the large one with 23.
What is the recommendation for moving away from multi-typed indexes when the following is true:
- There are many types
- The types have very different mappings
- There is a huge variance in size per type running from 4 mb to 154 gb
I am currently contemplating putting them all in one type with one massive mapping (I don't think there are any fields with the same name but different mappings), but that seems really ugly.
Any suggestions welcome,
Thanks,
~john
I don't know your data but you can try to lessen the indexes in the following way.
Those types that have similar mappings group in one index. In this index create property "type" and support this property by yourself in queries.
If every type has completely different structure I would put the smaller ones in one index. After all they were this way before.
Sharding is the way how elasticsearch scales, so it makes sense that you observe performance degradation for network/IO bound operation when it's executed against 2 shard vs. 23, as it essentially means it was run on 2 nodes in parallel as opposed to 23.
If you want to split the index, you need to go over all of the types and identify the minimum number of shards for each type for your target performance. It'll depend on multiple factors such as the number of documents, document size, request/indexing patterns. As you mentioned that types vary significantly in size, most likely the result will be less balanced than your initial set up (2-5 shards), i.e. some of the indexes will need a higher number of shards, while some will do fine with less, e.g. there is no need to split 4mb index (as in your example) into multiple shards, unless you expect it to grow significantly and have high update rate and you want to scale indexing, otherwise 1 shard is fine.

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

Logstash + ElasticSearch + Kibana combine results from different fields in different documents

We have Apache log analyzed by Elasticsearch (2.1.0) and Kibana (4.3.0).
Logs are parsed and shipped to Elasticsearch by Logstash running on web servers and reading Apache combined log format.
All works good but now we need analyze more complicated pattern.
We have documents with field “purchase_id” which has integer value (like 130012, 130016, 133552 etc).
We have OTHER documents which have integer field “view_id” with same values (like 130012, 130016, 133552 etc.)
Both fields never appear in same document, because those fields extracted from different URI in Apache log.
Our goal is calculate and visualize percentage of appearance in given time frame of values in “purchase_id” compared to values in “view_id”.
For example, lets say we want to see current purchase rate of item 130012. It may appear in last 30 seconds 1000 times in documents with field “purchase_id” and in same last 30 seconds it may appear 40000 times in documents with field “view_id”.
This is obvious because only small amount of people buy item compared to amount of people exposed to product. I need to calculate and visualize that in time frame there was 1000 times purchase_id of item 130012 and 40000 times view_id of item 130012 then divide 1000 by 40000 and multiply 100% so I get 2.5% visualized on dashboard (for item 130012).
Of course I have many such purchase_id=view_id=(some number):int pairs, so I need calculate percentage for all of them and display, lets say 20 with highest percentage.
This will allow me know the best selling items compared to advertisements we invest.
I would track this issue for kibana.

Resources