Reason behind “Index creation no longer defaults to five shard”

Reason behind “Index creation no longer defaults to five shard” - elasticsearch

What was the reasoning behind ""Index creation no longer defaults to five shard but one shard"
So far, the assumption was , more shards = more scalability = more parallelism
Isnt that change defeating the whole purpose of distributed systems like ES ?

Yeah, you can relate to more shards= more scalability = more parallelism but this only happens when this is only useful when these shards utilize the multi-cores or more machines(data-nodes) in the cluster.
This is the default config, which is created for the basic workloads and obviously needs more fine-tuning for the advance use cases, which is the sole purpose of making it extensible, it's very difficult to design the perfect Elasticsearch cluster and as it depends on various factors, Elasticsearch tends to provides some default values which works more for general use-cases.
Either you start with a modest workload and then gradually your workload tends to increase, or you start with the huge workload in the begining itself(in which case, any way you will have more shards to get the benefit listed in the first line and this is for advanced use-case).
But first use is more common and the beauty of Elasticsearch is that with little knowledge you can get started and these default settings work quite well for modest workload and oftentimes you don't have to change them and even don't have to understand them in details.
Having more number of shards for a small number of documents with huge search traffic created issues(creation of 5 threads for a single search as default shards were 5) and this is the common use for most of the basic and modest applications out there.
So it makes sense to change the default shards to 1 as its more common use-case and beyond that any way you need to go deep to scale your cluster which would require fine-tuning Elasticsearch further.

Related

Elasticsearch - Sharding and Performance

I think I've finally gotten a grasp of the fundamental understanding of how to allocate shards for Elasticsearch. Please correct me if I'm wrong, this is what I've pieced together:
Ideally, there should only exist one shard per index, per node.
The only reason why we would ever want to configure more than
one shard IS to over-allocate for future growth (i.e. adding more
nodes to physically support the data).
Now, assuming what I have above is correct, I then wonder if there are any performance issues or differences if I only had one node with 1 shard versus one node with 5 shards. Can anyone enlighten me on this subject?

"The only reason why we would ever want to configure more than one shard IS to over-allocate for future growth (i.e. adding more nodes to physically support the data)."
Not necessarily so. Having more shards helps parallelise your queries and helps them finish faster, but after a bit it can be counterproductive as too many shards will mean overheads in merging the individual shard responses and time spent in queuing and such things.
"one node with 1 shard versus one node with 5 shards"
It depends on what your use case is but you should see some performance gain for bigger queries, with 5 shards.

I believe it depends on the size of the shards. For instance, on the elastic website, they say the following:
"Querying lots of small shards will make the processing per shard
faster, but as many more tasks need to be queued up and processed in
sequence, it is not necessarily going to be faster than querying a
smaller number of larger shards. Having lots of small shards can also
reduce the query throughput if there are multiple concurrent queries."
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
In practice I have found that using some exploratory testing with realistic queries helps me determine more definitively how I should move forward with my architecture. It really depends on the use case. As was stated previously however, there comes a point where you can sort of "over optimize" and it ends up cancelling out any noticible gains you may have otherwise obtained by doing the opposite solution.
To be succinct, one shard per index, per node is a fine practice. But if you find yourself needing more, then just assess your use case first and determine if additional shards are truly necessary.

ElasticSearch Analytical queries

I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.
For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.

Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.
With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.
In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.
I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.
I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.
Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.

Google Analytics Pros:
Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection
Google Analytics Cons:
Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues
ElasticSearch Pros:
Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying
ElasticSearch Cons:
Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system

Elasticsearch shard allocation for small indices

I have an elasticsearch setup with 192 active indices ranging from a few hundred mb to possibly 5gb each. I read that for a logstash use case with 1gb indices you should only use 1 shard. The difference with my setup is that I will be having more users (estimate of up to 100) expecting a quick response time. I intend to have 1 replica for reliability.
Will having 1 shard per index still be appropriate for my use case?

In a word: yes.
The need to create multiple primary shards derives from the need to isolate documents, extreme counts (e.g., when you're in the billions of documents volume), or to improve write throughput (write documents across more places, thereby reducing individual burden).
In practice, you want to shard based on your use case, unless you're one of those first two scenarios (isolation or extreme counts).
Are you read heavy?
Are you write heavy? (Less common, but it does happen)
If you're read heavy, as most use cases are, then having fewer shards will help you by limiting the request size (fewer places to look). Given that your shard sizes are also relatively small (I'd consider anything under 5 GB to be relatively small), you can easily get away with having a single primary shard and it should benefit your search performance by doing so.
Indexes that share the same mappings, but are also tiny ("few hundred MBs"), should likely be combined if you search across them. If they're independent, then it really makes no difference and the isolation sounds like good practice at the expense of slightly bloating your cluster state (with each index).

Have a look at this blog: https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index. He has a lot of good pointers to sharding and shard sizing.
However, the question you really should be asking yourself is: How easy is it to change? When it comes to sizing and scalability, the answer often is "it depends" - and the real question is: How quickly can you reconfigure?
This could e.g. mean that you design you application in a way, that allows quick re-spooling of data into a new index, that you use aliases so that you can in fact change these things, where your data lies (not just in Elastic, I hope) etc.
By building a system - from the start - so that you can quickly rebuild indicies enables you to experiment with sizes - and more importantly - change them as your need changes.

Elasticsearch with different java heaps, does it matter?

Say, I've got 2 servers. One of which has -xmx and -xms set to 4G and one to 2G.
Will ElasticSearch handle those performance differences in the balancing mode? Or will both the servers be called purely based on indices, resulting in a (much more) likely OOM for the latter than the former?
By the way, I've set the properties indices.fielddata.cache.size, indices.breaker.fielddata.limit, indices.breaker.request.limit, and indices.breaker.total.limit on both servers as ElasticSearch is suggesting
This is important, to me, because if it does, I'd have to change the index sharding on guessed index strain, which will be a hassle (if not impossible)

Elasticsearch treats every nodes as the same and equally balances the documents between them. This means that Elasticsearch wont readjust based on hardware and get you the optimal performance.
One thing to remember here is that a herd of bulls is only as fast as its slowest bull. The same gets applied here. But then if the load is small enough that it does not eat up all the hardware for 2 GB machine ,then we should not be seeing any issue. Otherwise you should see difference in memory aggressive operations like aggregations.

Which server parameters to tweak in Solr if I expect heavy writes and light reads?

I am facing scalability issues designing a new Solr cluster and I need to master to be able to handle a relatively high rate of updates with almost no reads - they can be done via slaves.
My existing Solr instance is occupying a huge amount of RAM, in fact it started swapping at only 4.5mil docs. I am interested in making the footprint as little as possible in RAM, even if it affects search performance.
So, which Solr config values can I tweak in order to accomplish this?
Thank you.

It's hard to say without knowing the specifics of your enviroment (like the schema, custom indexers, queryfunctions etc...) and whats a huge amount of ram? but you could start by
setting filterCache, queryResultCache and documentCache to 0 in solrconfig.xml. This will severely impact the performance of queries executed in SOLR.
set compression to true TextField and StrField types that you store. Then set compressThreshold to a low integer value. This will decrease the size of the documents at the cost of increased CPU usage. (see http://wiki.apache.org/solr/SchemaXml#head-73cdcd26354f1e31c6268b365023f21ee8796613 for more details
turn off all autowarming queries and don't do any read queries
make sure you commit often enough
obviously these are all things to do on the master not on the slaves.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio