Index linear growth - Performance degradation - performance

We have 4 shards with 14GB index on each of them
Each shard has a master and 3 slaves (each of them with 32GB RAM)
We're expecting that the index size will grow to double or triple in near future.
So we thought of merging our indexes to 28GB index so that each shard has 28GB index and also increased our RAM on each slave to 48GB.
We made this changes locally and tested the server by sending same 10K realistic queries to each server with 14GB & 28GB index, we found that
For server with 14GB index (48GB RAM): search time was 480ms, number of index hits: 3.8G
For server with 28GB index (48GB RAM): search time was 900ms, number of index hits: 7.2G
So we saw that having the whole index in RAM doesn't help in sustaining the performance in terms of search time. Search time increased linearly to double when the index size was doubled.
We were thinking of keeping only 4 shards configuration but it looks like now we have to add another shard or another slave to each shard.
Is there any other way that we can configure our servers so that the performance isn't affected even when index size doubles or triples?

I'd hate to say it depends, but it... depends.
The total size of your index on each is 14GB, which basically doesn't mean much of anything to SOLR. To get a real feel for performance what is the uniqueness of the terms indexed? An index of 14GB worth of data with the single word "cat" in it over and over again will be really quick.
Also have you confirmed you need the following features, disabling them can boost performance large amounts:
Schema
Stored Fields
Do you need stored fields? Removing this can greatly increase performance (you can safely have an entire index without any stored fields and rely completely on facets, pivots, and other features in solr to drive a UX).
omitNorms
You can, in some instances, set this flag to false to reduce memory in general and increase performance.
omitTermFreqAndPositions
Can be turned off, reduced memory in general and increase in performance.
System
Optimize Core/Index (Segment Count)
Index optimization is important when dealing with larger index sizes. Ensure each core is optimized and that when you look at the core it says the segment count is = 1. What I found is that this play a more important role as you increase the index size (this plays into OS level file caching and the fact it's easier to read one large file, rather than multiple small files) And yes, that does say 171 million+ documents.
Term Index Interval/Frequency
Configuration of term index interval may be required (by default 256) if you have a field or multiple fields that contain very unique values (for example GUID/UUIDs or unique IDs in general). Typically, the lower the TIF the more memory you need, the higher the TIF the less memory you need but the more disk seeks you may have.
Allocation of too much Ram
Solr works best with a good split between OS level disk cache and RAM used when faceting, you'd be surprised that you could actually get better performance by tweaking other parameters which lower required ram usage and free up resources for disk.

Related

How to determine what causes ES's query API instability

Normally, my ES query API takes less than 1s.But sometimes these queries get slow.
cluster consists of three 32G machines (16G allocated to ES).The index consists of 20 primaries and 1 replica, 303,000,000 dos count and 500gb primaries storage size and 1tb storage size.
Here's kibana's monitoring data:
`
Personally, I think it's the result of GC. I want to add machines.But I need to find a reason to convince my leader.
Yes it could be a GC problem. But can you be more specific? What do you mean by slow?
Anyway it seems the allocated heap is way too large for your needs. You have a collection when the heap is at 12Go ( 75% of 16go ) and it goes back to 5go every time. Its generate huge garbage collection.
You should try to lower the heap to like 10Go and check the impact on performance GC count and GC duration.
I recommands you too read this article https://www.elastic.co/blog/a-heap-of-trouble especially the "Together We Can Prevent Forest Fires" part.

Is there a limit on the number of indexes that can be created on Elastic Search?

I'm using AWS-provided Elastic Search.
I have a signup page on my website, and on each signup; a new index for the new user gets created (to be used later by his work-group), which means that the number of indexes is continuously growing, (now it reached around 4~5k).
My question is: is there a performance limit on the number of indexes? is it safe (performance-wise) to keep creating new indexes dynamically with each new user?
Note: I haven't used AWS-Elasticsearch, so this answer may vary because they have started using open-distro of Elsticsearch and have forked the main branch. But a lot of principles should be the same. Also, this question doesn't have a definitive answer and it depends on various factors but I hope this answer will help the thought process.
One of the factors is the number of shards and replicas per index as that will contribute to the total number of shards per node. Each shard consumes some memory, so you will have to keep the number of shards limited per node so that they don't exceed maximum recommended 30GB heap space. As per this comment 600 to 1000 should be reasonable and you can scale your cluster according to that.
Also, you have to monitor the number of file descriptors and make sure that doesn't create any bottleneck for nodes to operate.
HTH!
If I'm not mistaken, the only limit is the disk space of your server, but if your index is growing too fast you should think about having more replica servers. I recomend reading this page: Indexing Performance Tips
Indexes themselves have no limit, however shards do, the recommended amount of shards per GB of heap is 20(JVM heap - you can check on kibana stack monitoring tab), this means if you have 5GB of JVM heap, the recommended amount is 100.
Remember that 1 index can take from 1 to x number of shards (1 primary and x secondary), normally people have 1 primary and 1 secondary, if this is you case then you would be able to create 50 indexes with those 5GB of heap

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

MongoDB insert performance with 2nd index

I'm trying to insert about 250 million documents that are each roughly 400 bytes into MongoDB 3.0 with WiredTiger. I need to search on only one short string key, _user_lower. Although I'm using WiredTiger now, which is much better than MMAPv1, I did use MMAPv1 first and had similar issues.
My server (a very cheap VPS) has:
250 GB magnetic disk
1 GB RAM
2 GB Swap
2.1 GHz single-core CPU
I know that this machine is really slow, and I'm asking it to do something a bit unrealistic. But I'm confused about how it started so fast with one index, and the second just ruined the performance:
I inserted all the data that I had at the time (about 250M rows) without any index except on _id. This performed very well, considering my awful hardware:
Approximately 5000 inserts per second (totally acceptable)
This rate was nearly constant for the 14 hours hours it took to complete
The index size on _id once complete was nearly 2.5GB. Note that this is more than double my physical RAM.
The RES of the process didn't exceed 450 MB according to mongostat.
No swapping
top seemed to indicate that CPU time wasn't all being spent waiting for the disk (so a significant amount was spent in userspace, presumably with WiredTiger in the snappy code)
Then I built a (non-unique) index on the only field I need to query by, _user_lower. This took 7.7 hours, which is fine since that's a one-time deal. The index ended up being 1.6 GB, which seems really low to me when compared to the _id index. The RES went up to about 750 MB.
Then, I downloaded a new data set to load. It was only 102 MB (238 K documents). I loaded it in the same way, using mongoimport, but this time:
Only 80 inserts per second (slower at times)
RES stayed at around 750 MB
top says almost 100% of the CPU was spent waiting for IO
Of course, load went through the roof.
I could understand a sizable performance hit, since that index has to be updated. But I didn't expect this much. I've read all over the place that my indexes should fit in RAM, but the performance was great during the initial insert, where the index quickly outgrew my memory.
Can I optimize the _user_index index at all? I don't know what this would even mean, but maybe only index the first few characters? I'm definitely willing to halve the query performance in exchange for tripling the insert performance.
What accounts for the massive performance hit? How do I fix it without new hardware? I'm not really attached to MongoDB, so alternatives that don't have these performance characteristics are fine. I have an idea that just uses flat files which would probably work but I don't want to write all that code.
When adding new items to a collection, the database will have to keep the index up-to-date. Since the index in MongoDB is a B-Tree by default, that means it will have to insert an item in the tree. While that isn't a particularly expensive operation in the best case, it comes with two potential performance problems:
performance jitter: from time to time, the B-Tree bucket might be full, requiring a bucket split and hence a lot more operations than the 'simple' insert
the insert destination must be readily available
In this case, the latter is likely to cause trouble: because the insertion of a name hits a random node in the tree (i.e, the name insertion doesn't follow a pattern) and your RAM is smaller than the index, chances are high that the destination must be fetched from disk. Unfortunately, the performance of disk seeks is orders of magnitude lower than main memory references. If you're unlucky, the first ref location requires another disk seek such that for a single insert multiple disk reads are required before MongoDB can even begin writing. That can take hundreds of milliseconds, with spinning disks or some contention on typical IaaS infrastructure even seconds.
Because ObjectIds are generated monotonically (the timestamp is the most significant part), the insertion always happens at the end and it is possible to keep the destination largely in RAM. Performance jitter, i.e. problem 1 might still be an issue since a bucket split might require a disk seek, but it happens so rarely compared to the first case that it doesn't wreck average performance, which should explain the observed behavior.
Also, when the bucket is filled by a monotonically increasing value, MongoDB will split the bucket when it is 90% filled; with random insertion, splits will happen a lot earlier, at 50%, so the tree is a little more 'dense' in that case.

Elastic Search - Maximum Shard Size

I came across and couldn't reach a final conclusion during learning ElasticSearch.
What is the maximum shard size for ElasticSearch?
How many shards can an index have? Is there any maximum limit?
After reading multiple articles and blogs and running my own load tests, I came to the conclusion that
number of shards and maximum size of each shard depends upon many factors like:
Size of the data inserted
Rate at which the data is inserted
Whether data retrieval / search is happening at the same time? If yes, what is the frequency of search? How many concurrent searches are done?
Server configuration details, like number of cores in CPU, hard disk size, Memory size etc
So, to find out the optimized size for each shard and optimized number of shards for a deployment, one good way is to run tests using various combinations of parameters & loads and arrive at a conclusion.
Simple : Don’t Cross 4 Billions documents
Think about the limit of 32 bits systems of the Heap Size (still valid for 64 bits systems). ES recommand half memory up to 32 GB even for 64 bits systems, as it's concern memory handeling limit and optimization. If you have more than 64 GB of memory, you can keep further memory for Lucene?
For further details : https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html and https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index .
As others have said, the theoretical maximum is very large, however depending on your system, there can be practical limits.
I've found that shards start to become less performant around 150GB. I've had 50GB shards that perform reasonably well. In both cases, the shard was the only shard on the node, and the node had 54GB of system memory, with 31GB devoted to elasticsearch. At 50GB, I was getting results from relatively heavy-duty queries around 100ms, and at 150GB it was taking 500ms or longer.
I'm sure this depends on the mappings I've used, and a host of other factors, but perhaps it's useful if you're polling for datapoints.

Resources