Default Metrics in Ganglia - metrics

Can I have the list of default metrics in Ganglia?
Example: CPU and memory are two of the default metrics collected by Ganglia. Is there a complete list of all the default metrics supported by Ganglia?
I could not get any documentation on the same.

Ganglia provides the following group of default system metrics stored in rrd files per cluster directories:
CPU statistics - Report processors related statistics. Example: cpu user, cpu nice, cpu system, cpu io wait etc.
Disk statistics - Report mount points information. Example: mount points disk size, disk size available etc.
Memory statistics - Reports a large amount of valuable information about the Linux system’s memory. Read about linux meminfo here
Network statistics - Reports network information and data is available in the /proc/net/ directory. Read about linux /proc/net directory
NFS statistics - Reports nfs server and client statistics.
Network interface traffic - Reports interface statistics from /proc/net/dev.
All the above statistics exposes run time system information which exist in /proc virtual file system.Read about proc here
NOTE: Default python metrics files are located under /usr/lib64/ganglia/python_modules directory in python extension.

Related

list out hadoop yarn jobs that are using highest resources

i want to know how to list out jobs that are using highest memory and CPU, is there any command to list out highest memory used jobs? like i want to know how many vcores and memory a particular job is using.
You should be able to gather this data partially from the YARN UI, but I think you'd be better off installing Prometheus node/process exporters, or similar agents, directly on each machine that can gather information about the Linux process usage themselves

tracking memory usage or consumption for apache spark 2.0.2

I m beginner in apache spark and have installed a prebuilt distribution of apache spark with hadoop. I look to get the consumption or the usage of memory while running the example PageRank implemented within spark. I have my cluster standalone mode with 1 maser and 4 workers (Virtual machines)
I have tried external tools like ganglia and graphite but they give the memory usage at resource or system level (more general) but what i need exactly is "to track the behavior of the memory (Storage, execution) while running the algorithm does it means, memory usage for a spark application-ID ". Is there anyway to get it into text-file for further exploitation? Please help me on this, Thanks

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

Optimal settings for apache spark based on the hardware

is there a mapping/translation for the number of hardware systems, cpu cores and their associated memory to the spark-submit tunables of:
executor-memory
executor-cores
num-executors
The application is certaionly bound to have something to do with these tunables, I am however looking for a "basic rule of thumb"
Apache spark is running on yarn with hdfs in cluster mode.
Not all the hardware systems in the spark/hadoop yarn cluster have the same number of cpu cores or RAM.
There is no thumb rule, but after considering
off heap memory
Number of applications and other hadoop dameons running
Resource manager needs
HDFS IO
etc.
You can derive a suitable configuration. Please check this url

Google cloud click to deploy hadoop

Why does google cloud click to deploy hadoop workflow requires picking size for local persistent disk even if you plan to use the hadoop connector for cloud storage? The default size is 500 GB .. I was thinking if it does need some disk it should be much smaller in size. Is there a recommended persistent disk size when using cloud storage connector with hadoop in google cloud?
"Deploying Apache Hadoop on Google Cloud Platform
The Apache Hadoop framework supports distributed processing of large data sets across a clusters of computers.
Hadoop will be deployed in a single cluster. The default deployment creates 1 master VM instance and 2 worker VMs, each having 4 vCPUs, 15 GB of memory, and a 500-GB disk. A temporary deployment-coordinator VM instance is created to manage cluster setup.
The Hadoop cluster uses a Cloud Storage bucket as its default file system, accessed through Google Cloud Storage Connector. Visit Cloud Storage browser to find or create a bucket that you can use in your Hadoop deployment.
Apache Hadoop on Google Compute Engine
Click to Deploy Apache Hadoop
Apache Hadoop
ZONE
us-central1-a
WORKER NODE COUNT
CLOUD STORAGE BUCKET
Select a bucket
HADOOP VERSION
1.2.1
MASTER NODE DISK TYPE
Standard Persistent Disk
MASTER NODE DISK SIZE (GB)
WORKER NODE DISK TYPE
Standard Persistent Disk
WORKER NODE DISK SIZE (GB)
"
The three big uses of persistent disks (PDs) are:
Logs, both daemon and job (or container in YARN)
These can get quite large with debug logging turned on and can result in many writes per second
MapReduce shuffle
These can be large, but benefit more from higher IOPS and throughput
HDFS (image and data)
Due to the layout of directories, persistent disks will also be used for other items like job data (JARs, auxiliary data distributed with the application, etc), but those could just as easily use the boot PD.
Bigger persistent disks are almost always better due to the way GCE scales IOPS and throughput with disk size [1]. 500G is probably a good starting point to start profiling your applications and uses. If you don't use HDFS, find that your applications don't log much, and don't spill to disk when shuffling, then a smaller disk can probably work well.
If you find that you actually don't want or need any persistent disk, then bdutil [2] also exists as a command line script that can create clusters with more configurability and customizability.
https://cloud.google.com/developers/articles/compute-engine-disks-price-performance-and-persistence/
https://cloud.google.com/hadoop/

Resources