I'm having an odd issue in where I set up a DSE 4.0 cluster with 1 Cassandra node and 1 Solr node (using DseSimpleSnitch) and performance is great. If I add additional nodes to have 3 Cassandra nodes and 3 Solr nodes, then the performance of my Solr queries goes downhill dramatically. Anyone have any idea what I might be doing wrong? I have basically all default options for DSE and have tried wiping all data and recreating everything from scratch several times with the same result. I've also tried creating the keyspace with replication factors of 1 and 2 with the same results.
Maybe my use case is a bit odd but I'm using Solr for OLTP type queries(via SolrJ with binary writers/readers) which is why the performance is critical. With a very light workload of say 5 clients making very simple Solr queries the response times go up about 50% from a single Solr node to 3 Solr nodes with only a few hundred small documents seeded for my test(~25ms to ~50ms). The response times get about 2 to 3 times slower with 150 clients against 3 nodes compared to a single node. The response times for Cassandra are unchanged, its only the Solr queries that get slower.
Could there be something with my configuration causing this?
Solr queries need to fan out to cover the full range of keys for the column family. So, when you go from one node to three nodes, it should be no surprise that the total query time would rise to three times a query that can be satisfied with a single node.
You haven't mentioned the RF for the Search DC.
For more complex queries, the fan out would give a net reduction in query latency since only a fraction of the total query time would occur on each node, while for a small query the overhead of the fanout and aggregation of query results dwarfs the time to do the actual Solr core query.
Generally, Cassandra queries tend to be much simpler than Solr queries, so they are rarely comparable.
Problem solved. After noticing the documentation mentioning not to use virtual nodes for Solr nodes (and not saying why) I checked my configuration and noticed I was using virtual nodes. I changed my configuration to not use virtual nodes and the performance issue disappeared. I also upgraded from 4.0.0. to 4.0.2 at the same time but I'm pretty sure it was the virtual nodes causing the problem.
Related
We regularly encounter several issues that crop up from time to time with Elasticsearch. They seem to be as follows:
Out of disk space
Slow query evaluation time
Slow/throttled data write times
Timeouts on queries
There are various areas of an Elasticsearch cluster that can be configured:
Cluster disk space
Instance type/size
Num data nodes
Sharding
It can sometimes be confusing which areas of the cluster you should be tuning depending on the problems outlined above.
Increasing the ES cluster total disk space is easy enough. Boosting the ES instance type seems to help when we experience slow data write times and slow query response times. Implementing sharding seems to be best when one particular ES index is extremely large. But it's never quite clear when we should boost the number of data nodes vs boosting the instance size.
Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue
I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured
I started creating my new search application. In my earlier application I used Apache solr. Now I want to know which better in terms of performance and usability.
Personally I want to know the performance benchmark of Elastic search and solr. If there are other alternatives suggestions are most welcome.
Disclaimer: I work at elasticsearch.com
I would just say: give elasticsearch a try. I think that after some hours (minutes?), you will have somehow an opinion.
Start 2 or 3 or 4 nodes, and you will see how things are rebalanced nicely.
About performance, I'd say that elasticsearch will give you a constant query throughput even if you are doing massive index operations.
I have used both quite a bit, and much prefer ElasticSearch. The API is more flexible and accessible. It is easier to get started with. Replication happens automatically by default. In general all the defaults are easier to work with. Everything generally works out of the box (safe defaults) and you only need to tune what you find needs to work better.
I have not worked much with SOLR 4, only with 3.x. Once I switched I never looked back, but I hear that there are many improvements in 4 with regards to replication and clustering that make it a usable competitor.
With regards to performance, I think that generally they are comparable as they both rely on Lucene. That is why there is a lack of valid benchmarks that make this general comparison. That said, there are certainly use cases where one will perform better than the other.
If you look at the trends of utilization while there are many more people currently using SOLR, it is in decline. That decline is very correlated to the increase in users of Elasticsearch which is very much on the rise. As Dadoonet said, give ElasticSearch a try, it won't take long and you won't want to use SOLR again.
UPDATE
I just spent two weeks on a client site consulting on a SOLR Cloud installation. I am now much more familiar with the updates to SOLR, and say quite confidently, I still prefer ElasticSearch, but it seems SOLR has some momentum again.
ElasticSearch, is hands down more elastic. That is, having an elastic cluster where nodes come and go, or even where you just need to add nodes is much much easier in ElasticSearch than SOLR. Anyone who tells you it is easy in SOLR, has not done it in ElasticSearch. ElasticSearch will automatically join a cluster and assume an active role in that cluster, taking over serving available shards and replicas. Over the last week I decommissioned a 2 node cluster, replacing it with two new nodes. I simply added the 2 new nodes, and one at a time, marked the other two nodes as non-data nodes. Once the shard migration completed I decommissioned the nodes. I had set minimum_master_nodes = 2 ((2/2)+1), and had no issue with split brain.
During the same week, I had to add a node to a SOLR cluster. The process was poorly documented, especially considering the changes from 4.1 to 4.3 and the mishmash of existing documentation, much of which says you can't even do it based on old versions of SOLR. I finally found documentation which clarified. It requires manually adding a core to the collection and then adding replicas to existing shards within the cluster. Finally you manually decommission the redundant shards on some other node. At some point this node may become master for one of those shards but not immediately.
With SOLR If you do not have sufficient shards to distribute, you can just add replicas or you can go through a shard split to create two new shards. Again this is a poorly documented feature, but is functionality that does not exist in ElasticSearch. You must split and then remove the original shard, something none of the documentation clearly explains.
SolrCloud has a couple other advantages as well if integrating with Hadoop. If you are indexing data in HDFS or HBase, there are now both Map-Reduce, and real time methods of ingesting data into SOLR. This provides some real power to your Big Data platform and allows you to do full text search over data that is otherwise barely accessible.
While you can index Hadoop data into ElasticSearch, the implementation is not as clean as the SolrCloud/Cloudera Search implementations. Having the MapReduce directly build the shards is a far superior solution with significant performance benefits. Reducers talking directly to a cluster works, but it is not the same. I do not know if anything similar to the Lily connector for HBase exists for ElasticSearch, if not I may look into writing one. This allows indexing directly from the HBase replication logs.
So in summary there are certainly situations where either is beneficial. If you are looking for tight integration with Hadoop, SOLR, ClouderaSearch specifically, is a good option. If you are looking for ease in managing an Elastic cluster, Elasticsearch will be a much better option. For me, I'll continue with my hacky Hadoop integrations to make it work with Elasticsearch, until something better emerges.
I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.