Is there any option to get Memory usage & CPU Utilization for each flow file running at NiFi instance?
Related
I m beginner in apache spark and have installed a prebuilt distribution of apache spark with hadoop. I look to get the consumption or the usage of memory while running the example PageRank implemented within spark. I have my cluster standalone mode with 1 maser and 4 workers (Virtual machines)
I have tried external tools like ganglia and graphite but they give the memory usage at resource or system level (more general) but what i need exactly is "to track the behavior of the memory (Storage, execution) while running the algorithm does it means, memory usage for a spark application-ID ". Is there anyway to get it into text-file for further exploitation? Please help me on this, Thanks
We have a 4 node cluster with Hue 3.9.0 installed. Namenode has 24 GB of RAM and each DataNode has 20 GB. We have several jobs running consuming resources:
15/24 GB (NameNode)
14/20 GB (DataNode)
13/20 GB (DataNode)
6/20 GB (DataNode)
We also run queries in Impala and Hive. Time after time those queries consume all available RAM (on NameNode) which everytime causes Hue to crash. When it happens Cloudera Manager (CM) shows Hue health as bad (process status: "Bad : This role's process exited. This role is supposed to be started.") while all the other services such as HBase, HDFS, Impala, Hive and so on have good health. After restarting Hue service via CM it works fine again. How can we prevent Hue from crashing because of lack of RAM?
I think what I am looking for is a means (option) to reserve enough RAM space for Hue service, but all I could find so far were Impala configuration options set via Hue configuration tab (with our current values in brackets):
Impala Daemon Memory Limit (mem_limit):
Impala Daemon Default Group (1701 MB)
Impala Daemon Group 1 (2665 MB)
Java Heap Size of Impala Llama ApplicationMaster in Bytes (2 GB)
But anyway running a sequence of queries (parallelly or one after another) eventually consumes all available RAM. It seems that it doesn't free RAM after a query is done. I'd rather expect Impala and Hive to say that they don't have enough RAM to continue and not crash other services such as Hue in this case.
I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured
I am trying to build the cluster on different sizes, and that is why I want the formulas from which I can calculate the RAM, CPU & disk memory of namenode, yarn & Resource Manager.
also want to know that how RAM, CPU & Disk related to each other.
You can use Cloudera Guide Download
is there a mapping/translation for the number of hardware systems, cpu cores and their associated memory to the spark-submit tunables of:
executor-memory
executor-cores
num-executors
The application is certaionly bound to have something to do with these tunables, I am however looking for a "basic rule of thumb"
Apache spark is running on yarn with hdfs in cluster mode.
Not all the hardware systems in the spark/hadoop yarn cluster have the same number of cpu cores or RAM.
There is no thumb rule, but after considering
off heap memory
Number of applications and other hadoop dameons running
Resource manager needs
HDFS IO
etc.
You can derive a suitable configuration. Please check this url