Running Hadoop in virtual environment - hadoop

I would like to know whether I should expect problems when having Hadoop cluster on virtual instead of physical machines?
I'm mostly worried about using the same hard drive, I read that I should count for 1-2 containers per drive,but in my case only one drive will exist. Could that be a problem?

I think it depends upon how much size are you allocating for containers. Of course there would be limitation to number of containers if you have restriction to the memory.
I can highlight few points which can be considered while running hadoop cluster in virtual environment:
Network configuration in case of multi node cluster
Obvious the performance of application
Affect on scalability as limited resources if you are planning to run the cluster on host which has low configuration hardware

Related

k8s tasks slowdown with no excess CPU or RAM usage

I have a small virtualised k8s cluster running on top of KVM on 2 physical machines. After deploying Ceph (a storage framework), all the k8s tasks like creating containers or starting containers became insufferably slow, like taking over a minute to get from creating to starting a container.
I checked the nodes for excess CPU or RAM usages, both work nodes and the master node is well below consuming half the assigned resources. I have about 10-20 pods running on each node at the moment.
I am not sure what to google and given my level of k8s knowledge am completely out of ideas. Anyone with similar experience or could point me in the right direction would be much appreciated!

Creating Elasticsearch cluster from three servers

We have three physical servers. Each server has 2 CPUs (32 cores), 96 TB HDD, and 768 GB RAM. We would like to use these servers in an Elasticsearch cluster.
Each server will be located in a different data center, connecting each server using a private connection.
How can be optimize our configuration for high performance? Also, how should we best run Elasticsearch on these machines. For example, should we use virtualization to create multiple nodes per machine, or not?
As you have huge RAM(768) available on each physical server and according to ES documentation on heap setting it shouldn't cross 32 GB, so you will have to use virtualization to create multiple nodes per physical server for better ultization of your infra.
Apart from these there are various cluster settings and node settings which you can optimize but as you have not provided them, its difficult to provide recommendation on them.
Another thing to note is that you have huge RAM and disk but CPU is not in proportion to it, so if you can increase them as well, it would be good.

What are the resource requirements to run Logstash in a k8s pod?

I was noticing that running a ELK stack on a Raspberry Pi running a Kubernetes Cluster. I noticed that it didnt have the resources to run all three containers. I was looking up that with Kubernetes you can put limits and requests on your resources CPU and Memory, and it got me thinking. What are the minimum requirements? To me, applications are greedy, so is there a way to cut down the requirements for Logstash, to emphasize resources for Elasticsearch?
Right now, I am running a Raspberry Pi 4, 4g RAM, 32G disk.
If I can put min and max requirements on the container it will better allow me manage the resources. The think though that I noticed is that there was no insight from what I can tell as to minimum requirements for the different containers.
https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html
The above link i believe tells me that the CPU consumption is greedy, but the default MEMORY for Elastic and Kibana 2Gi and 1Gi respectively. It mentioned nothing about logstash though, and whether or not there is a Minimum requirement for CPUs.
I wasnt sure if I should set each ELK container to 1CPU, 1Gi RAM, and I can try it to see if it functions, but since the concept of it throttling down makes me curious what the happy medium would be.
Logstash is not part of the Elastic Cloud, that is why there is no mention of it in the Elastic Cloud on Kubernetes documentation link that you shared.
Logstash is way more CPU bound than memory bound, but how much memory does it needs is completely dependent on your pipelines.
In Logstash the memory depends on the pipelines, the batch size, the filters used, the number of events per seconds, the queue type etc. If you are running a dev or lab environment I think that you can try to give Logstash 1 CPU and 512 MB of RAM and see if it feets your use case.
But I would say that 4GB is pretty small for a full stack since you need to have memory for the applications and still have some memory left for the sytems.

When one node is very slow in Hadoop Cluster?

I've hadoop cluster of 5 nodes.
I've two concerns
1) What can be done when one of the node is running or processing data very slow (Not stopped) comapre to other nodes .. ?
2) I've set up log4j to capture logs, but How can I keep logs of all nodes at Name node or at one main server .. ?
Please suggest ...!
Thanks
To question one, it's not clear which service is slow... Datanode? Namenode? Maybe you need to increase the heap sizes of these processes, or the Dataset you've stored is heavily skewed onto that server.
You would need to install monitoring software to capture IO, CPU, network, etc metrics to really diagnose any hardware bottlenecks. From there, make sure that that one server is running latest OS patches, has latest drivers, and a similar hardware profile of other machines you're comparing against. Maybe the hard drive or NIC is failing, but without hardware diagnostic software, it'd be hard to know
For question 2, you'd again need additional software, such as Elasticsearch, to centrally collect and index your logs from many systems

Mesos master configuration on DC/OS

I am in the process of creating DC/OS cluster on AWS for running Kafka->Spark->Cassandra workloads.
I am interested what is the minimum specification for master node in DC/OS environment. I see that DC/OS suggests m3.xlarge instances, but I do not know why do I need 4 processors and 15GB of RAM, when master is only runnning processes described on: https://docs.mesosphere.com/overview/architecture/
-> There is no actual data processing performed by the master.
I would maybe go with m3.large or r3.large instances.
Kindest regards,
Srdjan
DC/OS masters are not used to run any high computation loads, but their memory usage tends to be pretty exhaustive, and so a large instance is recommended.
There might be a way to use smaller instances and compensate for the missing memory using swap files, but straying from the supplier's recommendations should only be done after careful consideration of the potential consequences.

Resources