Configuring memory allocation in mesos - hadoop

I'm running hadoop on mesos, and I'm unsure how to configure memory for mesos. Specifically, a parameter like mapred.mesos.slot.mem (from https://github.com/mesos/hadoop/blob/master/configuration.md) would get configured in what file?
I know that other parameters in the configuration.md file can be placed in hadoop's mapred-site.xml file, but I don't know where to put other mesos configuration parameters. Any help would be appreciated.
Thanks.

Related

Druid + Hadoop (for both uses, deep-store & indexing)

If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka
Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Does Mesos Overwrite Hadoop Memory Settings?

My company runs hadoop on mesos, and I’m new to mesos. The current limiting rate of the hadoop application I’m in charge of is the speed of reducer tasks, so I was hoping to play around with mesos and hadoop memory settings to speed up the reducer.
Unfortunately, I don’t understand the relationship between hadoop memory settings and mesos memory configuration, and I suspect that mesos may be overriding some of my hadoop memory settings.
Is changing the value of mapreduce.reduce.java.opts or mapreduce.reduce.memory.mb (in /etc/hadoop/conf/mapred-site.xml) affected by mesos? Does mesos limit the amount of memory that I can allocate to the reducer?
If so, where are the config files in mesos so I can change those settings?
Thanks!
9/30/2015 Update:
The file at https://github.com/mesos/hadoop/blob/master/configuration.md lists parameters that you can put in your mapred-site.xml file.
I'm still not sure how those parameters affect the memory-associated hadoop configuration parameters in mapred-site.xml.
The configuration is described in the respective GitHub repo mesos/hadoop.

Configure Hadoop to use S3 requester-pays-enabled

I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.
The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.
Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")

Deploying hdfs core-site.xml with cloudera manager

I'm trying to add lzo support to my configuration files using the cloudera manager (CDH5b2).
If I add the io.compression.codecs to the service-wide hdfs configuration, and deploy the configuration file, /etc/hadoop/conf.cloudera.hdfs/core-site.xml now contains the new value.
However, /etc/hadoop/conf.cloudera.yarn/core-site.xml has a higher priority (update-alternatives --display hadoop-conf), the hdfs core-site.xml values are not used when I start a MR job.
Obviously, I can simply modify the yarn core-site.xml file manually, but I don't understand how to do deploy the hdfs core-site.xml file properly using cloudera manager.
There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.)
Having said that, details can be found on Cloudera's site at: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_cdh5_install.html

Resources