Caching in elasticsearch - elasticsearch

In my index I indicated index.store.preload: ["*"]
In order to preload data into the file system cache. The entire index occupies 27 Gb, about 45 Gb is allocated to the cache in the system, and all this memory is full, it turns out that not only 27 Gb is crawled into the cache, but also something else. Is it possible to somehow find out how much the total index space in the cache will occupy? Also, I don’t understand the difference between the file system cache and the use of indices.fielddata.cache. Which one will be more practical for a faster search? Does it make sense to use both options?

The data on the disk has been compressed markedly before being synced. So there's always relatively apparent inflation after being loaded into memory. As the specific inflation rate is related to data type, field count or index options, it's fundamentally impossible to give an accurate estimation about the final size in the memory.
FieldData is docValue of Lucene, which is used to sort or agg. If you just want to search by single query, this cache is of little help. By comparison, file system cache is always used by elasticsearch and constructs the foundation of ES, both in query and filter context.

Related

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

What is the ideal bulk size formula in ElasticSearch?

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
      The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
      Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
      It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.

Does this BIG monogdb storage cause low performance?

We have a mongodb with 336GB data on it.
Unfortunately there is only 8GB memory on that server.
Is it true to say that this will slow the db down, especially when I try to traverse the entire collection?
What can I do to improve performance?
To get things right, this isn't a "BIG" production setup; it is actually relatively small.
That aside:
Is it true to say that this will slow the db down, especially when I try to traverse the entire collection?
It is true yes. As you iterate the collection MongoDB will need to page in your data, this is true even if you have indexes on the collection.
The exception to this is when you use indexOnly cursors whereby all the data comes only from the index, including the returned document; these are otherwise known as covered queries.
The problem you have here is that your dataset is 42x greater than your RAM amount, assuming you are allowed to use all your RAM (this is not true of course, the OS and other programs will reserve amounts off for themselves). This means that if you expect to iterate the entire collection you will not be able to do it performantly, instead MongoDB could be page thrashing its allocated memory.
What can I do to improve performance?
Get a little more RAM.
You could also try a bit of sharding if getting too much RAM on that one server is a pain.
I would aim for about 20x more data than RAM, that shouldn't be too bad in most cases.
You should index your collection http://docs.mongodb.org/manual/applications/indexes/ to improve performance, but bear in mind that memory is utilised by mongodb when querying indexes so make sure each index you create can fit within the memory you have on your server.
You could also shard your collection but you will need more servers to do this. http://docs.mongodb.org/manual/sharding/
And I know it's obvious but get more memory - its cheap!
Mongodb uses memory-mapped files to map the data in to the systems virtual memory. If you try to access more data than the available memory of the system, the performance will be poor. You'll have to consider other options like sharding, indexing, increasing RAM etc. Indexing may improve the performance but not by much if done on a large data set, because indexes also need memory. A few references:
First 3 questions talk about memory-mapped files: http://docs.mongodb.org/manual/faq/storage/
On sharding: http://docs.mongodb.org/manual/faq/sharding/
Ensuring index fit into the RAM: http://docs.mongodb.org/manual/applications/indexes/#ensure-indexes-fit-ram
The other answers say either "have enough memory to fit your data" or "have enough memory for each index" or "have some multiple of your RAM in data". None of those are very effective nor very precise for capacity planning.
You need to know what your access patterns will be and then decide what indexes you will need to effectively be able to use your data. If all of your indexes fit in available RAM with some room to spare for most recently touched documents, then you should be okay.
When your working set (accessed data + indexes) cannot fit in RAM then your performance will be correlated more with disk access speed than anything else. Depending on how fast your disks are and on your throughput and latency requirements, it may work out okay or it may not.
While there is not enough information to say with certainty whether you will succeed or fail on this particular machine, you should be able to collect enough information to determine that for yourself by analyzing your indexing needs, etc.

Index linear growth - Performance degradation

We have 4 shards with 14GB index on each of them
Each shard has a master and 3 slaves (each of them with 32GB RAM)
We're expecting that the index size will grow to double or triple in near future.
So we thought of merging our indexes to 28GB index so that each shard has 28GB index and also increased our RAM on each slave to 48GB.
We made this changes locally and tested the server by sending same 10K realistic queries to each server with 14GB & 28GB index, we found that
For server with 14GB index (48GB RAM): search time was 480ms, number of index hits: 3.8G
For server with 28GB index (48GB RAM): search time was 900ms, number of index hits: 7.2G
So we saw that having the whole index in RAM doesn't help in sustaining the performance in terms of search time. Search time increased linearly to double when the index size was doubled.
We were thinking of keeping only 4 shards configuration but it looks like now we have to add another shard or another slave to each shard.
Is there any other way that we can configure our servers so that the performance isn't affected even when index size doubles or triples?
I'd hate to say it depends, but it... depends.
The total size of your index on each is 14GB, which basically doesn't mean much of anything to SOLR. To get a real feel for performance what is the uniqueness of the terms indexed? An index of 14GB worth of data with the single word "cat" in it over and over again will be really quick.
Also have you confirmed you need the following features, disabling them can boost performance large amounts:
Schema
Stored Fields
Do you need stored fields? Removing this can greatly increase performance (you can safely have an entire index without any stored fields and rely completely on facets, pivots, and other features in solr to drive a UX).
omitNorms
You can, in some instances, set this flag to false to reduce memory in general and increase performance.
omitTermFreqAndPositions
Can be turned off, reduced memory in general and increase in performance.
System
Optimize Core/Index (Segment Count)
Index optimization is important when dealing with larger index sizes. Ensure each core is optimized and that when you look at the core it says the segment count is = 1. What I found is that this play a more important role as you increase the index size (this plays into OS level file caching and the fact it's easier to read one large file, rather than multiple small files) And yes, that does say 171 million+ documents.
Term Index Interval/Frequency
Configuration of term index interval may be required (by default 256) if you have a field or multiple fields that contain very unique values (for example GUID/UUIDs or unique IDs in general). Typically, the lower the TIF the more memory you need, the higher the TIF the less memory you need but the more disk seeks you may have.
Allocation of too much Ram
Solr works best with a good split between OS level disk cache and RAM used when faceting, you'd be surprised that you could actually get better performance by tweaking other parameters which lower required ram usage and free up resources for disk.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources