My company runs hadoop on mesos, and I’m new to mesos. The current limiting rate of the hadoop application I’m in charge of is the speed of reducer tasks, so I was hoping to play around with mesos and hadoop memory settings to speed up the reducer.
Unfortunately, I don’t understand the relationship between hadoop memory settings and mesos memory configuration, and I suspect that mesos may be overriding some of my hadoop memory settings.
Is changing the value of mapreduce.reduce.java.opts or mapreduce.reduce.memory.mb (in /etc/hadoop/conf/mapred-site.xml) affected by mesos? Does mesos limit the amount of memory that I can allocate to the reducer?
If so, where are the config files in mesos so I can change those settings?
Thanks!
9/30/2015 Update:
The file at https://github.com/mesos/hadoop/blob/master/configuration.md lists parameters that you can put in your mapred-site.xml file.
I'm still not sure how those parameters affect the memory-associated hadoop configuration parameters in mapred-site.xml.
The configuration is described in the respective GitHub repo mesos/hadoop.
Related
we have a 100-node hadoop cluster. Currently I write a Flink App to write many files on HDFS by BucktingSink. When I run Flink App on yarn I found that all task managers is distributed on the same nodemanager which means all subtasks is running on this node. It opens many file descriptors on the datanode of this busy node. (I think flink filesystem connector connect to local datanode in precedence) This leads to high pressure on that node which easily fails the job.
Any good idea to solve this problem? Thank you very much!
This sounds like a Yarn scheduling problem. Please take a look at Yarn's capacity scheduler which allows you to schedule containers on nodes based on the available capacity. Moreover you could tell Yarn to also consider virtual cores for scheduling. This allows to define a different resource dimension compared to memory only.
Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.
I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.
I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through the web-GUI and the gcloud command line utility from my local machine:
15/11/08 21:27:16 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (38281+2679 MB) is above the max threshold (20480 MB) of this cluster! Please increase the value of 'yarn.s
cheduler.maximum-allocation-mb'.
I tried modifying the value in /etc/hadoop/conf/yarn-site.xml but it didn't change anything. I don't think it pulls the configuration from that file.
I've tried with multiple cluster combinations, at multiple sites (mainly Europe), and I only got this to work with the low memory version (4-cores, 15 gb memory).
I.e. this is only a problem on the nodes configured for memory higher than the yarn default allows.
Sorry about these issues you're running into! It looks like this is part of a known issue where certain memory settings end up computed based on the master machine's size rather than the worker machines' size, and we're hoping to fix this in an upcoming release soon.
There are two current workarounds:
Use a master machine type with memory either equal to or smaller
than worker machine types.
Explicitly set spark.executor.memory and spark.executor.cores either using the --conf flag if running from an SSH connection like:
spark-shell --conf spark.executor.memory=4g --conf spark.executor.cores=2
or if running gcloud beta dataproc, use --properties:
gcloud beta dataproc jobs submit spark --properties spark.executor.memory=4g,spark.executor.cores=2
You can adjust the number of cores/memory per executor as necessary; it's fine to err on the side of smaller executors and letting YARN pack lots of executors onto each worker, though you can save some per-executor overhead by setting spark.executor.memory to the full size available in each YARN container and spark.executor.cores to all the cores in each worker.
EDIT: As of January 27th, new Dataproc clusters will now be configured correctly for any combination of master/worker machine types, as mentioned in the release notes.
I'm running hadoop on mesos, and I'm unsure how to configure memory for mesos. Specifically, a parameter like mapred.mesos.slot.mem (from https://github.com/mesos/hadoop/blob/master/configuration.md) would get configured in what file?
I know that other parameters in the configuration.md file can be placed in hadoop's mapred-site.xml file, but I don't know where to put other mesos configuration parameters. Any help would be appreciated.
Thanks.
I have a job (handles data of 4 GB) and I checked the CPU usage and memory usage they both are under 10%.
Your job most likely doesn't need more than that. You could try stress-testing your cluster with TeraSort (included in the examples jar), if your nodes are still operating at a very low amount of usage it may be a problem with your configuration.
Hadoop comes with Benchmark utility to verify Hadoop cluster settings.
Check Hadoop TestDFSIO.