ELK Deployment; CPU,Memory,Disk - elasticsearch

I would like to deploy Elasticsearch, logstash and kibana in 3 different flavors, the characteristics of each one:
o 32vCPU, 384Go RAM et 900Go HDD
I would like to supervise 100 servers so approximately 33 servers in each flavors.
Do you think it's a good idea to use this configuration? and it's not a problem to use this huge capacity of memory?
Another question how many nodes should I use?

without details its hard to give you global advice but Elasticsearch recommend to never cross 31Gb for RAM. Here are the reasons why
You should read all the page, they explain why it is generally far better to have a lot of small/medium hosts instead of a few big ones.
I also recommend you to read this post, it will give you some insight on how to design an Elastic Cluster especially the distinction between roles in a cluster and the difference in hardware needed.
For your question :
Another question how many nodes should I use?
There is no good answer without knowing the volume of data, read/write etc etc...
And last, I hardly doubt that using the same configuration for kibana / logstash / elastic hosts is a good idea. They just don't do the same sort of processing. You should start with small configuration and update it incrementally when you will have real data.

Related

Elasticsearch configuration and best practices in production

I'm new on working with the ELK stack and I'm working on 10 TB stocked on physical servers, so if there is recommendation on how many data nodes, Master nodes .. should I need to use , the best practice to configure our cluster to work smoothly in production and if there is other tools or technologies used with Elasticsearch for to improve performance
#ameur you can refer to these pages :
https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
Regarding master nodes, you should have minimum 3 nodes(Go for 5 nodes if possible).
For data nodes , there are multiple factors involved -
for ex:
resources like RAM,CPU, disk
throughput like qpa,wps etc.
so there is no straightforward answer to that, you will need to do some performance test to get the right number.
don't forget to read about sharding strategy https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html

When to create or reuse an Elasticsearch cluster?

My team has been using a minimal Elasticsearch implementation for a year now, and we'd now like to additionally use ES for a new and totally different use case, using different and essentially unrelated data. While I have been reading about Clusters, Nodes, and ES in general, I do not intuitively understand whether or not we should create a new cluster for this, or add the data into our existing cluster. Where is a good place to look to better understand the factors involved in this decision? We're using ES hosted by Elastic Cloud, v5.2.x for the record.
If you have the resources available, it does not hurt to use the same cluster for multiple types of data/use cases.
If I were you I would just take a look at the Monitoring page to check out your usage statistics like storage, search rates, indexing rates, etc. to see if you have the resources available. If so, you don't really need to have separate clusters.

ELK Stack and scaling

Bear with me here. I have spent the last week or so familiarising myself with the ELK Stack.
I have a working single box solution running the ELK stack, and I have the basics down on how to forward more than one type of log, and how to put them into different ES indexes.
This is all working pretty well, I would like to expand operations.
My question is more how to scale the solution out to cover more data needs/requirements.
The current solution is handling a smaller subset of data, and working fine, but I would like to aggregate a lot more data. For example I am currently pushing message tracking logs from 4 mailbox servers, I want to do the same but for 40 mailbox servers, and much, much busier ones.
I would also like to push over IIS Log files from the Client Access servers, there are 18 CAS servers, and around 30 mins of IIS logs per server during peak time were 120MB in size, with almost 1 million records.
This volume of data would most likely collapse a single box running ELK.
I haven't really looked into it but I read that ES allows for some form of clustering to add more instances, does the same apply to Logstash as well? Should Kibana be run on more than one server? or a different server to both Logstash and ES?
You will hit limits with logstash if you're doing a lot of processing on the records - groks, conditionals, etc. Watch the cpu utilization of the machine for hints.
For elasticsearch itself, it's about RAM and disk IO. Having more nodes in a cluster should provide both.
With two elasticsearch nodes, you'll get redundancy (a copy on both machines). Add a third, and you can start to realize an IO benefit (writing two copies to three machines spreads the IO).
The ultimate data node will have 64GB of RAM on the machine, with 31GB allocated to elasticsearch.
You'll probably want to add non-data nodes, which handle the routing of data to be indexed and the 'reduce' phase when running queries. Put two of them behind a load balancer.
As Alain mentioned, adding more ES nodes will improve performance (and give you redundancy).
On the logstash front, we have two logstash servers feeding into ES - at the moment we just direct different servers to log to the different logstash servers, but we're likely to be adding a HA-Proxy layer in front to do this automatically, and again provide redundancy.
With Kibana, I wouldn't worry too much - as far as I'm aware most of the processing is done in the client browser, and that that isn't is more dependent on the performance of the ES cluster.

Which is better Apache solr or Elastic search?

I started creating my new search application. In my earlier application I used Apache solr. Now I want to know which better in terms of performance and usability.
Personally I want to know the performance benchmark of Elastic search and solr. If there are other alternatives suggestions are most welcome.
Disclaimer: I work at elasticsearch.com
I would just say: give elasticsearch a try. I think that after some hours (minutes?), you will have somehow an opinion.
Start 2 or 3 or 4 nodes, and you will see how things are rebalanced nicely.
About performance, I'd say that elasticsearch will give you a constant query throughput even if you are doing massive index operations.
I have used both quite a bit, and much prefer ElasticSearch. The API is more flexible and accessible. It is easier to get started with. Replication happens automatically by default. In general all the defaults are easier to work with. Everything generally works out of the box (safe defaults) and you only need to tune what you find needs to work better.
I have not worked much with SOLR 4, only with 3.x. Once I switched I never looked back, but I hear that there are many improvements in 4 with regards to replication and clustering that make it a usable competitor.
With regards to performance, I think that generally they are comparable as they both rely on Lucene. That is why there is a lack of valid benchmarks that make this general comparison. That said, there are certainly use cases where one will perform better than the other.
If you look at the trends of utilization while there are many more people currently using SOLR, it is in decline. That decline is very correlated to the increase in users of Elasticsearch which is very much on the rise. As Dadoonet said, give ElasticSearch a try, it won't take long and you won't want to use SOLR again.
UPDATE
I just spent two weeks on a client site consulting on a SOLR Cloud installation. I am now much more familiar with the updates to SOLR, and say quite confidently, I still prefer ElasticSearch, but it seems SOLR has some momentum again.
ElasticSearch, is hands down more elastic. That is, having an elastic cluster where nodes come and go, or even where you just need to add nodes is much much easier in ElasticSearch than SOLR. Anyone who tells you it is easy in SOLR, has not done it in ElasticSearch. ElasticSearch will automatically join a cluster and assume an active role in that cluster, taking over serving available shards and replicas. Over the last week I decommissioned a 2 node cluster, replacing it with two new nodes. I simply added the 2 new nodes, and one at a time, marked the other two nodes as non-data nodes. Once the shard migration completed I decommissioned the nodes. I had set minimum_master_nodes = 2 ((2/2)+1), and had no issue with split brain.
During the same week, I had to add a node to a SOLR cluster. The process was poorly documented, especially considering the changes from 4.1 to 4.3 and the mishmash of existing documentation, much of which says you can't even do it based on old versions of SOLR. I finally found documentation which clarified. It requires manually adding a core to the collection and then adding replicas to existing shards within the cluster. Finally you manually decommission the redundant shards on some other node. At some point this node may become master for one of those shards but not immediately.
With SOLR If you do not have sufficient shards to distribute, you can just add replicas or you can go through a shard split to create two new shards. Again this is a poorly documented feature, but is functionality that does not exist in ElasticSearch. You must split and then remove the original shard, something none of the documentation clearly explains.
SolrCloud has a couple other advantages as well if integrating with Hadoop. If you are indexing data in HDFS or HBase, there are now both Map-Reduce, and real time methods of ingesting data into SOLR. This provides some real power to your Big Data platform and allows you to do full text search over data that is otherwise barely accessible.
While you can index Hadoop data into ElasticSearch, the implementation is not as clean as the SolrCloud/Cloudera Search implementations. Having the MapReduce directly build the shards is a far superior solution with significant performance benefits. Reducers talking directly to a cluster works, but it is not the same. I do not know if anything similar to the Lily connector for HBase exists for ElasticSearch, if not I may look into writing one. This allows indexing directly from the HBase replication logs.
So in summary there are certainly situations where either is beneficial. If you are looking for tight integration with Hadoop, SOLR, ClouderaSearch specifically, is a good option. If you are looking for ease in managing an Elastic cluster, Elasticsearch will be a much better option. For me, I'll continue with my hacky Hadoop integrations to make it work with Elasticsearch, until something better emerges.

getting close to real-time with hadoop

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.
You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).
Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.
Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra
If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.
Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.
FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.
This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.
It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net

Resources