Limiting Hue resources with Static pools cgroup - hadoop

I understood you can limit Hadoop services via cgroups in Static Service pools. I would like to limit the Hue service, because sometimes, it eats up all the memory we have on the Edge node and hurts our loading processes.
However I wasnt able to find Hue in the static service pool configuration - it only gives me options - HDFS, Impala, YARN, Hbase.
Can the Hue setting be done here, or I would need to do it somewhere else?
Thank you.

In this case, you can try to set cgroup memory soft limit in Cloudera Manager Hue configuration page, but I do not believe it will help much. It is a known issue in Hue due to Python memory fragmentation. There are a few common operations in Hue that might trigger it, such as download large query result set (more than 10M) or use HDFS file browser to browse an HDFS directory with a large number of files(1000+). Ask your users to refrain from doing these operations.
If this memory problem keeps happening, you can use the script at https://github.com/cloudera/hue/blob/master/tools/ops/hue_mem_cron.sh to setup a cron job. The script monitors the Hue process memory usage and kill it if it uses too much. You need to configure Cloudera Manager to restart Hue automatically.
Of course, killing the Hue is not an ideal solution. What you can do is to setup Hue HA with a load balancer in front of multiple Hue instances to alleviate the problem. You can follow documentation https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_hag_hue_config.html to set it up

You can activate cgroup memory options (Cgroup Memory Soft Limit, Cgroup Memory Hard Limit) as you want.

Related

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

Pig script runs fine on Sandbox but fails on a real cluster

Environments:
Hortonworks Sandbox running HDP 2.5
Hortonworks HDP 2.5 Hadoop cluster managed by Ambari
We are facing a tricky situation. We run Pig script from Hadoop tutorial. Script is working with tiny data. It works fine on a Sandbox. But fails in real cluster where it complains about insufficient memory for the container.
container is running beyond physical memory limit
message can be seen in the logs.
The tricky part is - Sandbox has way less memory available than real cluster (about 3 times less). Also most memory settings in Sandbox (MapReduce memory, Yarn memory, Yarn container sizes) allow much less memory than corresponding settings in a real cluster. Still it is sufficient for Pig in Sandbox but not sufficient in a real cluster.
Another note - Hive queries doing the similar job also work good (in both environements), they do not complain about memory.
Apparently there is some setting somewhere (within Environment 2), which makes Pig to request too much memory? Can please anybody recommend what parameter should be modified to stop Pig script to request too big memory?

Ambari scaling memory for all services

Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.

Does Mesos Overwrite Hadoop Memory Settings?

My company runs hadoop on mesos, and I’m new to mesos. The current limiting rate of the hadoop application I’m in charge of is the speed of reducer tasks, so I was hoping to play around with mesos and hadoop memory settings to speed up the reducer.
Unfortunately, I don’t understand the relationship between hadoop memory settings and mesos memory configuration, and I suspect that mesos may be overriding some of my hadoop memory settings.
Is changing the value of mapreduce.reduce.java.opts or mapreduce.reduce.memory.mb (in /etc/hadoop/conf/mapred-site.xml) affected by mesos? Does mesos limit the amount of memory that I can allocate to the reducer?
If so, where are the config files in mesos so I can change those settings?
Thanks!
9/30/2015 Update:
The file at https://github.com/mesos/hadoop/blob/master/configuration.md lists parameters that you can put in your mapred-site.xml file.
I'm still not sure how those parameters affect the memory-associated hadoop configuration parameters in mapred-site.xml.
The configuration is described in the respective GitHub repo mesos/hadoop.

Should the HBase region server and Hadoop data node on the same machine?

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
What makes a "file" server is that it's running a file-serving daemon;
What makes a "web" server is that it's running a web-serving daemon;
AND
What makes a "data node" server is that it's running the HDFS data-serving daemon;
What makes a "region" server then is that it's running the HBase region-serving daemon (program);
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program)
and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

Resources