Elasticsearch Document Count Doesn't Reflect Indexing Rate - elasticsearch

I've indexing data from Spark into Elasticsearch, and according the Kibana, I'm indexing at a rate of 6k/s for the primary shards. However, if you look at the Document Count graph in the lower right, you'll see that it doesn't increase proportionately. How can this index have only 1.3k documents when it's indexing at 5 times that per second?

Related

Search performance in full text query

In an 14 node, RAID 5 based Elasticsearch cluster, the search performance does not improve no matter how simple the full-text query is.
There are two indices: one is 30 shard, 1 replica, 1.1tb and 110M document size. The other is 200 shard, 1 replica, 8.4tb and 663M document size. In both of them, the average query time does not seem to be lower than 700 ms.
The query is basically a multi_match query in few different fields.
What could be the reason?

Elasticsearch- Single Index vs Multiple Indexes

I have more than 4000 different fields in one of my index. And that number can grow larger with time.
As Elasticsearch give default limit of 1000 field per index. There must be some reason.
Now, I am thinking that I should not increase the limit set by Elasticsearch.
So I should break my single large index into small multiple indexes.
Before moving to multiple indexes I have few questions as follows:
The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index?
Is there really a need to break my single large index into multiple indexes because of a large number of fields?
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). Each index would have 5 shards(default number, which I don't want to change). Search on these multiple indexes would be searching on these 250 shards at once. Will this affect my search performance? Note: These shards might increase in time as well.
When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards?
It strongly depends on your infrastructure. If you run a single node with 50 Shards a query will run longer than it would with only 1 Shard. If you have 50 Nodes holding one shard each, it will most likely run faster than one node with 1 Shard (if you have a big dataset). In the end, you have to test with real data to be sure.
When there is a massive amount of fields, ES gets a performance problem and errors are more likely. The main problem is that every field has to be stored in the cluster state, which takes a toll on your master node(s). Also, in a lot of cases you have to work with lots of sparse data (90% of fields empty).
As a rule of thumb, one shard should contain between 30 GB and 50 GB of data. I would not worry too much about overloading shards in your use-case. The opposite is true.
I suggest testing your use-case with less shards, go down to 1 Shard, 1 Replica for your index. The overhead from searching multiple Shards (5 primary, multiply by replicas) then combining the results again is massive in comparison to your small dataset.
Keep in mind that document_type behaviour changed and will change further. Since 6.X you can only have one document_type per Index, starting in 7.X document_type is removed entirely. As the API listens at _doc, _doc is the suggested document_type to use in 6.X. Either move to one Index per _type or introduce a new field that stores your type if you need the data in one index.

ElasticSearch handling for max shard size

I learnt that, an ES Shard is nothing but a lucene index and that Max items in Lucene Index can be INT.MAX -128 (Approx 2Billion), but I could not find anywhere on ES reference how is this scenario handled? Does ES fail or assign another shard to documents with same route?
or is it something that we need to plan in advance, while designing the indexing strategies?

Elasticsearch and Lucene document limit

Document count in our elasticsearch installation from stats api shows about 700 million when the actual document count is about 27 million from the count api. We understand that this difference is from nested documents count - stats api shows all.
In Lucene documentation, we read that there is 2 billion hard document count limit for a shard. Should I worry that elasticsearch is about to hit the document limit? Or should I monitor the data from the count api?
Yes there is limit to the number of docs per shard of 2 billion, which is a hard lucene limit.
There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents.
You should consider scaling horizontally.

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

Resources