What is the maximum Elasticsearch document size? - elasticsearch

I read notes about Lucene being limited to 2Gb documents. Are there any additional limitations on the size of documents that can be indexed in Elasticsearch?

Lucene uses a byte buffer internally that uses 32bit integers for addressing. By definition this limits the size of the documents. So 2GB is max in theory.
In ElasticSearch:
There is a max http request size in the ES GitHub code, and it is set against Integer.MAX_VALUE or 2^31-1. So, basically, 2GB is the maximum document size for bulk indexing over HTTP. And also to add to it, ES does not process an HTTP request until it completes.
Good Practices:
Do not use a very large java heap if you can help it: set it only as large as is necessary (ideally no more than half of the machine’s RAM) to hold the overall maximum working set size for your usage of Elasticsearch. This leaves the remaining (hopefully sizable) RAM for the OS to manage for IO caching.
In client side, always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
For further study refer to these links:
Performance considerations for elasticsearch indexing
Document maximum size for bulk indexing over HTTP

Think things have changed slightly over the years with Elasticsearch. In the 7.x documentation referenced here - General Recommendations:
Given that the default http.max_content_length is set to 100MB, Elasticsearch will refuse to index any document that is larger than that. You might decide to increase that particular setting, but Lucene still has a limit of about 2GB.
So it would seem that ES has a limit of ~100MB and Lucene's is 2GB as the other answer stated.

Related

Lucene and Elasticsearch going past the document limit

What happens when we try to ingest more documents into 'Lucene' instance past its max limit of 2,147,483,519?
I read that as we approach closer to 2 billion documents we start seeing performance degradation.
But does 'Lucene' just stop accepting new documents past its max limit.
Also, how does 'Elasticsearch' handle the same scenario for one of its shard when it's document limit is reached.
Every elasticsearch shard under the hood is Lucene Index, so this limit is applicable to Elasticsearch shard as well, and based on this Lucene issue it looks like it stops indexing further docs.
Performance degradation is subject to several factors like the size of these docs, JVM allocated to the Elasticsearch process (~32 GB is a max limit), and available file system cache which is used by Lucene and no of CPU, network bandwidth etc.

Elasticsearch - Physical size of a bulk request

We use BULK api to index multiple docs. We try to control the batch size through various parameters indirectly. But I wanted to know if there is any clean and recommended way to get the physical size of the batch prepared before sending the BULK index request to ES.
Note:Language - C# using NEST
TLDR
entire bulk query should be loaded in RAM
after certain size performance no longer improves
it is different for different hardware -- experiment to find your size
https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html
The entire bulk request needs to be loaded into memory by the node
that receives our request, so the bigger the request, the less memory
available for other requests. There is an optimal size of bulk
request. Above that size, performance no longer improves and may even
drop off. The optimal size, however, is not a fixed number. It depends
entirely on your hardware, your document size and complexity, and your
indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical
documents in batches of increasing size. When performance starts to
drop off, your batch size is too big. A good place to start is with
batches of 1,000 to 5,000 documents or, if your documents are very
large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk
requests. One thousand 1KB documents is very different from one
thousand 1MB documents. A good bulk size to start playing with is
around 5-15MB in size.

Elastic Search - Maximum Shard Size

I came across and couldn't reach a final conclusion during learning ElasticSearch.
What is the maximum shard size for ElasticSearch?
How many shards can an index have? Is there any maximum limit?
After reading multiple articles and blogs and running my own load tests, I came to the conclusion that
number of shards and maximum size of each shard depends upon many factors like:
Size of the data inserted
Rate at which the data is inserted
Whether data retrieval / search is happening at the same time? If yes, what is the frequency of search? How many concurrent searches are done?
Server configuration details, like number of cores in CPU, hard disk size, Memory size etc
So, to find out the optimized size for each shard and optimized number of shards for a deployment, one good way is to run tests using various combinations of parameters & loads and arrive at a conclusion.
Simple : Don’t Cross 4 Billions documents
Think about the limit of 32 bits systems of the Heap Size (still valid for 64 bits systems). ES recommand half memory up to 32 GB even for 64 bits systems, as it's concern memory handeling limit and optimization. If you have more than 64 GB of memory, you can keep further memory for Lucene?
For further details : https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html and https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index .
As others have said, the theoretical maximum is very large, however depending on your system, there can be practical limits.
I've found that shards start to become less performant around 150GB. I've had 50GB shards that perform reasonably well. In both cases, the shard was the only shard on the node, and the node had 54GB of system memory, with 31GB devoted to elasticsearch. At 50GB, I was getting results from relatively heavy-duty queries around 100ms, and at 150GB it was taking 500ms or longer.
I'm sure this depends on the mappings I've used, and a host of other factors, but perhaps it's useful if you're polling for datapoints.

What is the ideal bulk size formula in ElasticSearch?

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
      The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
      Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
      It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.

Index linear growth - Performance degradation

We have 4 shards with 14GB index on each of them
Each shard has a master and 3 slaves (each of them with 32GB RAM)
We're expecting that the index size will grow to double or triple in near future.
So we thought of merging our indexes to 28GB index so that each shard has 28GB index and also increased our RAM on each slave to 48GB.
We made this changes locally and tested the server by sending same 10K realistic queries to each server with 14GB & 28GB index, we found that
For server with 14GB index (48GB RAM): search time was 480ms, number of index hits: 3.8G
For server with 28GB index (48GB RAM): search time was 900ms, number of index hits: 7.2G
So we saw that having the whole index in RAM doesn't help in sustaining the performance in terms of search time. Search time increased linearly to double when the index size was doubled.
We were thinking of keeping only 4 shards configuration but it looks like now we have to add another shard or another slave to each shard.
Is there any other way that we can configure our servers so that the performance isn't affected even when index size doubles or triples?
I'd hate to say it depends, but it... depends.
The total size of your index on each is 14GB, which basically doesn't mean much of anything to SOLR. To get a real feel for performance what is the uniqueness of the terms indexed? An index of 14GB worth of data with the single word "cat" in it over and over again will be really quick.
Also have you confirmed you need the following features, disabling them can boost performance large amounts:
Schema
Stored Fields
Do you need stored fields? Removing this can greatly increase performance (you can safely have an entire index without any stored fields and rely completely on facets, pivots, and other features in solr to drive a UX).
omitNorms
You can, in some instances, set this flag to false to reduce memory in general and increase performance.
omitTermFreqAndPositions
Can be turned off, reduced memory in general and increase in performance.
System
Optimize Core/Index (Segment Count)
Index optimization is important when dealing with larger index sizes. Ensure each core is optimized and that when you look at the core it says the segment count is = 1. What I found is that this play a more important role as you increase the index size (this plays into OS level file caching and the fact it's easier to read one large file, rather than multiple small files) And yes, that does say 171 million+ documents.
Term Index Interval/Frequency
Configuration of term index interval may be required (by default 256) if you have a field or multiple fields that contain very unique values (for example GUID/UUIDs or unique IDs in general). Typically, the lower the TIF the more memory you need, the higher the TIF the less memory you need but the more disk seeks you may have.
Allocation of too much Ram
Solr works best with a good split between OS level disk cache and RAM used when faceting, you'd be surprised that you could actually get better performance by tweaking other parameters which lower required ram usage and free up resources for disk.

Resources