Creating Elasticsearch cluster from three servers - elasticsearch

We have three physical servers. Each server has 2 CPUs (32 cores), 96 TB HDD, and 768 GB RAM. We would like to use these servers in an Elasticsearch cluster.
Each server will be located in a different data center, connecting each server using a private connection.
How can be optimize our configuration for high performance? Also, how should we best run Elasticsearch on these machines. For example, should we use virtualization to create multiple nodes per machine, or not?

As you have huge RAM(768) available on each physical server and according to ES documentation on heap setting it shouldn't cross 32 GB, so you will have to use virtualization to create multiple nodes per physical server for better ultization of your infra.
Apart from these there are various cluster settings and node settings which you can optimize but as you have not provided them, its difficult to provide recommendation on them.
Another thing to note is that you have huge RAM and disk but CPU is not in proportion to it, so if you can increase them as well, it would be good.

Related

What are the resource requirements to run Logstash in a k8s pod?

I was noticing that running a ELK stack on a Raspberry Pi running a Kubernetes Cluster. I noticed that it didnt have the resources to run all three containers. I was looking up that with Kubernetes you can put limits and requests on your resources CPU and Memory, and it got me thinking. What are the minimum requirements? To me, applications are greedy, so is there a way to cut down the requirements for Logstash, to emphasize resources for Elasticsearch?
Right now, I am running a Raspberry Pi 4, 4g RAM, 32G disk.
If I can put min and max requirements on the container it will better allow me manage the resources. The think though that I noticed is that there was no insight from what I can tell as to minimum requirements for the different containers.
https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html
The above link i believe tells me that the CPU consumption is greedy, but the default MEMORY for Elastic and Kibana 2Gi and 1Gi respectively. It mentioned nothing about logstash though, and whether or not there is a Minimum requirement for CPUs.
I wasnt sure if I should set each ELK container to 1CPU, 1Gi RAM, and I can try it to see if it functions, but since the concept of it throttling down makes me curious what the happy medium would be.
Logstash is not part of the Elastic Cloud, that is why there is no mention of it in the Elastic Cloud on Kubernetes documentation link that you shared.
Logstash is way more CPU bound than memory bound, but how much memory does it needs is completely dependent on your pipelines.
In Logstash the memory depends on the pipelines, the batch size, the filters used, the number of events per seconds, the queue type etc. If you are running a dev or lab environment I think that you can try to give Logstash 1 CPU and 512 MB of RAM and see if it feets your use case.
But I would say that 4GB is pretty small for a full stack since you need to have memory for the applications and still have some memory left for the sytems.

Does Mesos really treat all your resources as a single pool?

Mesos is advertised as a system that lets you program against your datacenter like it's a single pool of resources (See the Mesos Website). But is this really true that you don't need to consider the configuration of the individual machines? Using Mesos, can you request more resources for a task than are available on a single machine?
For example, if you have 10 machines each with 2 cores and 2g of RAM and 20g HD, can you really request 10 cores, 15g of RAM and 100g of disk space for a single task?
If so, how does this work? Is Mesos able to address memory across machines for you, and use other CPUs as local threads and create a single filesystem from a number of distributed nodes?
How does it accomplish this without suffering from the Fallacies of distributed computing, especially those related to network latency and transport cost?
According to this Mesos architecture you can't aggregate resources from different slaves (agents / machines) to use them for one task.
As you can see there is strict "taks per agent" situation
Also their example says pretty much same
Let’s walk through the events in the figure.
Agent 1 reports to the master that it has 4 CPUs and 4 GB of memory
free. The master then invokes the allocation policy module, which
tells it that framework 1 should be offered all available resources.
The master sends a resource offer describing what is available on
agent 1 to framework 1. The framework’s scheduler replies to the
master with information about two tasks to run on the agent, using <2
CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the
second task. Finally, the master sends the tasks to the agent, which
allocates appropriate resources to the framework’s executor, which in
turn launches the two tasks (depicted with dotted-line borders in the
figure). Because 1 CPU and 1 GB of RAM are still unallocated, the
allocation module may now offer them to framework 2.

VMWare ESXi, RHEL, LUKS and network latency

My company is running into a network performance problem that seemingly has all of the "experts" we're working with (VMWare support, RHEL support, our managed services hosting provider) stumped.
The issue is that network latency between our VMs (even VMs residing on the same physical host) increases--up to 100x or more!--with network throughput. For example, without any network load, latency (measured by ping) might be ~0.1ms. Start transferring a couple 100MB files, and latency grows to 1ms. Initiate a bunch (~20 or so) concurrent data transfers between two VMs, and the latency between the VMs can increase to upwards of 10ms.
This is a huge problem for us because we have application server VMs hosting processes that might issue 1 million or so queries against a database server (different VM) per hour. Adding a millisecond or two to each query therefore increases our runtime substantially--sometimes doubling or tripling our expected durations.
We've got what I would think is a pretty standard environment:
ESXi 6.0u2
4 Dell M620 blades with 2x Xeon E5-2650v2 processors and 128GB RAM
SolidFire SAN
And our base VM configuration consists of:
RHEL7, minimal install
Multiple LUNs configured for mount points at /boot, /, /var/log, /var/log/audit, /home, /tmp and swap
All partitions except /boot encrypted with LUKS (over LVM)
Our database server VMs are running Postgres 9.4.
We've already tried the following:
Change the virtual NIC from VMNETx3 to e1000 and back
Adjust RHEL ethernet stack settings
Using ESXi's "low latency" option for the VMs
Upgrading our hosts and vCenter from ESX 5.5 to 6.0u2
Creating bare-bones VMs (setup as above with LUKS, etc., but without any of our production services on them) for testing
Moving the datastore from the SSD SolidFire SAN to local (on-blade) spinning storage
None of these improved network latency. The only test that showed expected (non-deteriorating) latency is when we set up a second pair of bare-bones VMs without LUKS encryption. Unfortunately, we need fully encrypted partitions (for which we manage the keys) because we are dealing with regulated, sensitive data.
I don't see how LUKS--in and of itself--can be to blame here. Rather, I suspect that LUKS running with some combination of ESX, our hosting hardware, and/or our VM hardware configuration is to blame.
I performed a test in a much wimpier environment (MacBook Pro, i5, 8GB RAM, VMWare Fusion 6.0, Centos7 VMs configured similarly with LUKS on LVM and the same testing scripts) and was unable to reproduce the latency issue. Regardless of how much network traffic I sent between the VMs, latency remained steady at about 0.4ms. And this was on a laptop with a ton of the things going on!
Any pointers/tips/solutions will be greatly appreciated!
After much scrutiny and comparing the non-performing VMs against the performant VMs, we identified the issue as a bad selection for the advanced "Latency Sensitivity" setting.
For our poorly performing VMs, this was set to "Low". After changing the setting to "Normal" and restarting the VMs, latency dropped by ~100x and throughput (which we hadn't originally noticed was also a problem) increased by ~250x!

Running Hadoop in virtual environment

I would like to know whether I should expect problems when having Hadoop cluster on virtual instead of physical machines?
I'm mostly worried about using the same hard drive, I read that I should count for 1-2 containers per drive,but in my case only one drive will exist. Could that be a problem?
I think it depends upon how much size are you allocating for containers. Of course there would be limitation to number of containers if you have restriction to the memory.
I can highlight few points which can be considered while running hadoop cluster in virtual environment:
Network configuration in case of multi node cluster
Obvious the performance of application
Affect on scalability as limited resources if you are planning to run the cluster on host which has low configuration hardware

The best memory configuration for ElasticSearch

I have one linux server with 128G memory and 32 cpu cores. I would run an ElasticSearch instance on this server, the server is exclusively only for running ES. So how many memory I should configure for ES. How could I get the best performance of ES please. Is the server too luxurious for ES? Thanks!
I suggest you run two ES instances in each server. Since your linux server pretty powerful, if you set the ES memory as 60g or 80g it may encounter GC problem. Try to run two or three ES instances in one server and monitor the CPU and Memory usage, btw, change the http port of ES for running multiple nodes in one server.

Resources