Concerning health cloudera hadoop (network interface speed) - hadoop

I am new in that area and have some problems with my hadoop cluster.
I fixed many health issues but health of my hosts is still "concerning" (still yellow and not "green", unfortunately). Can this depend on the fact that my hosts are connected through an old switch with a speed of 100 Mbps? Network cards of almost all servers support 1000 Mbps.
In the recommendation for resolving this error, it's was advised checking the duplex options. It would have "concerning health" if the all of my hosts working on half-duplex mode, but i've check it and they all have full-duplex mode.
Screen of Network Interface speed issue:
About Cluster installation options that i choose (if need):
1) Use Packages;
2) CDH5;
3) CDH 5.13.0.
P.S. How much does "concerning health" of hosts affect their work? Can I run complex tasks on them? I've just used the test WordCount and calculating the Pi - and they were successful.

+1 to #cricket_007's comments. If you consider the Reference Architecture for Deploying
CDH 5.X ON RED HAT OSP 11 as a typical deployment you'll see that the ideal network configuration is 10 Gigabit Ethernet (10GbE).
Most of Cloudera Manager's warnings are meant for production-level clusters.

Related

Cannot obtain cAdvisor container metrics on Windows Kubernetes nodes

I have configured a mixed node Kubernetes cluster. Two worker nodes are Unbuntu Server 18.04.4 and two worker nodes are Windows Server 2019 Standard. I have deployed several Docker containers as deployments/pods to each set of worker nodes (.NET Core apps on Ubuntu and legacy WCF apps on Windows). Everything seems to work as advertised.
I am now at the point where I want to monitor the resources of the pod/containers. I have deployed Prometheus, kube-state-metrics, metrics-server. I have Prometheus scraping the nodes. For container metrics, the kubelet/cAdvisor is returning everything I need from the Ubunutu nodes, such as "container_cpu_usage_seconds_total, container_cpu_cfs_throttled_seconds_total, etc". But the kubelet/cAdvisor for the Windows nodes only give me some basic information:
http://localhost:8001/api/v1/nodes/[WINDOWS_NODE]/proxy/metrics/cadvisor
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="",kernelVersion="10.0.17763.1012",osVersion="Windows Server 2019 Standard"} 1
# HELP container_scrape_error 1 if there was an error while getting container metrics, 0 otherwise
# TYPE container_scrape_error gauge
container_scrape_error 0
# HELP machine_cpu_cores Number of CPU cores on the machine.
# TYPE machine_cpu_cores gauge
machine_cpu_cores 2
# HELP machine_memory_bytes Amount of memory installed on the machine.
# TYPE machine_memory_bytes gauge
machine_memory_bytes 1.7179398144e+10
So while the cAdvisor on the Ubuntu nodes gives me everything I ever wanted about containers and more, the cAdvisor on the Windows nodes only gives me the above.
I have examined the Powershell scripts that install/configure kubelet on the Windows nodes, but don't see/understand how I might configure a switch or config file if there is a magical setting I am missing that would enable container metrics to be published when kubelet/cAdvisor is scraped. Any suggestions?
There is metrics/resource/v1alpha1 endpoint. But it provides only 4 basic metrics.
Documentation
I think that cAdvisor doesn't support windows nodes properly that you see is just a n emulated interface with limited metrics
Github issue

Ambari scaling memory for all services

Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.

hadoop port configuration within a firewall

I have a customer where we have hadoop installation managed by us. In the current setup all the nodes in the cluster have all the ports open for each other. But the customer is quite reluctant to keep all the ports open. Can anyone let me know if any such configuration is at all possible where we instruct hadoop to use only restricted number of ports.
My Findings : I have been able to configure a test setup where I have opened only the required port as per the mentioned blog
https://hadoop.apache.org/docs/r2.6.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
But I still see the MR jobs are not executed in distributed manner.

Multi-node hadoop cluster installation

Sorry if my question appears to be naïve. We are planning to use CDH 5.3.0 or 5.4.0. We want to implement a multi-node cluster.
The example multi-node installations that I have seen/read on different blogs/resources have master and slaves on different hosts.
However, we are restrained by the number of hosts. We have only 2 powerful hosts ( 32 cores 400+ GB RAM), so if we decide to have master on one and slave on other, we will end up with only one slave. My questions are :
Is it possible to have master and slave on the same hosts?
Can I have more than one slave node on a single host.
Also does one need to pay to use Cloudera Manager or it is open-source like the rest of the components.
If you can point me in the direction of some resource which would help me understand above scenarios it would be helpful.
Thanks for your help.
Regards,
V
old question but no and wrong answer:
yes, it is possible to install Master & Worker services on a single host.
e.g. HDFS (NameNode and Datanode). You can even install a full cloudera or Hortonworks installation with ALL services on a single host if it is powerfull enough, but i would only recommend it for POC or testcases.
If you use cloudera or hortonworks without virtualization it is not possible to run multiple instances of the SAME worker services e.g. datanode on the same host. 1 Host 1 worker instance. everything else would not make sense.
Cloudera is a package of multiple open source projekt (Hadoop,Spark....) and other closed source parts like cloudera manager and other enterprise closed source features. But everything you need is free even for commercial use with the community licence.
Right now (2017): only cloudera navigator is the big feature which is not part of the community edition
Yes you can configure namenode and datanode both on a single node.
You cannot have more than two datanodes on a single machine.
Cloudera is open-source hadoop distribution.

vSphere Cluster creation requirements

I've been searching around but haven't found a clear answer on this.
We're using VMware ESXI with vSphere to manage a handful of VMs (about 15 right now)
However, these are all spread over three separate machines. I'm looking for a way to cluster these together so their resources can be pooled or dynamically allocated. I found vSphere DRS Cluster information, but I'm having a really hard time finding out what I need to get that set up.
Does it require a separate vCenter license to hook into vSphere? And at that point, how do I create a database to group all the server hosts together? Every tutorial I find already has 2+ host machines already grouped together in the vSphere client, and I'm not sure how to go about achieving that.
If you just want to create a failover cluster, then you need VMware HA. VMware DRS is the option for resource dynamical allocation. To manage these two options, you need a vCenter Server. vCenter Server Foundation can manage up to 3 hosts (which is your case). For more information about vCenter, see this link.
For VMware HA and DRS to work, you must have a shared storage (NFS, iSCSI, or Fiber Channel). To know how to create VMware HA cluster using vSphere Client (connected to vCenter Server), see this link.
VMware DRS is an option after you created VMware HA. See this link

Resources