Elasticsearch and Lucene document limit - elasticsearch

Document count in our elasticsearch installation from stats api shows about 700 million when the actual document count is about 27 million from the count api. We understand that this difference is from nested documents count - stats api shows all.
In Lucene documentation, we read that there is 2 billion hard document count limit for a shard. Should I worry that elasticsearch is about to hit the document limit? Or should I monitor the data from the count api?

Yes there is limit to the number of docs per shard of 2 billion, which is a hard lucene limit.
There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents.
You should consider scaling horizontally.

Related

Scalling up ElasticSearch

We have a problem with scaling up Elasticsearch.
We have a simple structure with multiple indexes each index has 2 shards, 1 primary and 1 duplicate.
Around now we have ~ 3000 indexes which mean 6000 shards and we think that we will hit limits with shards soon. (currently running 2 nodes 32gb of ram, 4 cores top usage 65%, we are working on changing servers to 3 nodes to have 1 primary shard and 2 duplicates)
Refresh interval is set to 60s.
Some indexes have 200 documents, others have 10 mil documents most of them are less than 200k.
The total amount of documents is: 40 mil. (amount of documents can increase fast)
Our search requests are hitting multiple indexes at the same time (we might be searching in 50 indexes or 500 indexes, we might need in the future to be able to search in all indexes).
Searching need to be fast.
Currently, we are daily synchronizing all documents by bulk in chunks of 5000 of documents ~ 7mb because from tests that is working best ~ 2,3 seconds per request of 5000 ~ 7mb, done by 10 async workers.
We sometimes hit the same index with workers at same time and request with bulk is taking longer extending even to ~ 12,5 sec per request of 5000 ~ 7mb.
Current synchronization process takes about ~1hour / 40 mil of documents.
Documents are stored by uuids (we are using them to get direct hits of documents from elasticsearch), documents values can be modified daily, sometimes we only change synchronization_hash which determins which documents were changed, after all the synchronization process we run a delete on documents with old synchronization_hash.
Other thing is that we think that our data architecture is broken, we have x clients ~ 300 number can increase, each client is assigned to be only allowed to search in y indexes (from 50 to 500), indexes for multiple clients can repeat in search (client x has 50 indexes, client y has 70 indexes, client y,x most often clients need to have access to same documents in indexes, amount of clients can increase) that's why we store data in separated indexes so we don't have to update all indexes where this data is stored.
To increase a speed of indexing we are thinking even moving to 4 nodes (with each index 2 primary, 2 duplicates), or moving to 4 nodes (with each index only having 2, 1 primary, 1 duplicate), but we need to test things out to figure out what would work for us the best. We might be needing to double amount of documents in next few months.
What do you think can be changed to increase a indexing speed without reducing an search speed?
What can be changed in our data architecture?
Is there any other way that our data should be organized to allow us for fast searching and faster indexing?
I tried many sizes of chunks in synchronization / didn't try to change the architecture.
We are trying to achive increased indexing speed without reducing search speed.

Search performance in full text query

In an 14 node, RAID 5 based Elasticsearch cluster, the search performance does not improve no matter how simple the full-text query is.
There are two indices: one is 30 shard, 1 replica, 1.1tb and 110M document size. The other is 200 shard, 1 replica, 8.4tb and 663M document size. In both of them, the average query time does not seem to be lower than 700 ms.
The query is basically a multi_match query in few different fields.
What could be the reason?

Lucene and Elasticsearch going past the document limit

What happens when we try to ingest more documents into 'Lucene' instance past its max limit of 2,147,483,519?
I read that as we approach closer to 2 billion documents we start seeing performance degradation.
But does 'Lucene' just stop accepting new documents past its max limit.
Also, how does 'Elasticsearch' handle the same scenario for one of its shard when it's document limit is reached.
Every elasticsearch shard under the hood is Lucene Index, so this limit is applicable to Elasticsearch shard as well, and based on this Lucene issue it looks like it stops indexing further docs.
Performance degradation is subject to several factors like the size of these docs, JVM allocated to the Elasticsearch process (~32 GB is a max limit), and available file system cache which is used by Lucene and no of CPU, network bandwidth etc.

Elasticsearch Document Count Doesn't Reflect Indexing Rate

I've indexing data from Spark into Elasticsearch, and according the Kibana, I'm indexing at a rate of 6k/s for the primary shards. However, if you look at the Document Count graph in the lower right, you'll see that it doesn't increase proportionately. How can this index have only 1.3k documents when it's indexing at 5 times that per second?

ElasticSearch handling for max shard size

I learnt that, an ES Shard is nothing but a lucene index and that Max items in Lucene Index can be INT.MAX -128 (Approx 2Billion), but I could not find anywhere on ES reference how is this scenario handled? Does ES fail or assign another shard to documents with same route?
or is it something that we need to plan in advance, while designing the indexing strategies?

Resources