Mesos DCOS Spark with different role - mesos

I installed Spark service on Mesosphere with the following json config.
I was hoping that Spark will use role slave_public.
{
"service": {
"name": "my-spark",
"role": "slave_public"
}
}
dcos package install --options=my-spark.json spark
But I see in /var/log/message that role * is being used and offers are rejected. Marathon UI show the framework in "waiting" state.
Is it possible to override default role?

Related

How to add Spark worker nodes on cloudera with Yarn

We have cloudera 5.2 and the users would like to start using Spark with its full potential (in distributed mode so it can get advantage of data locality with HDFS), the service is already installed and available in cloudera manager Status(in home page) but when clicking the service and then "Instances" it just shows a History Server role and in other nodes a Gateway server role. From my understanding of Spark's architecture you have a master node and worker nodes(which lives together with HDFS datanodes) so in cloudera manager i tried "Add role instances" but there's only "Gateway" role available . How do you add Sparks worker node(or executor) role to the hosts where you have HDFS datanodes? Or is it unnecessary (i think that because of yarn ,yarn takes charge of creating the executor and application master )? And what's the case of the masternode? Do i need to configure anything so the users can use Spark at its full distributed mode?
The master and worker roles are part of Spark Standalone service. You can either choose Spark to run with YARN (in which Master and Worker nodes are irrelevant) or the Spark (Standalone).
As you have started the Spark service instead of Spark (Standalone) in Cloudera Manager, Spark is already using YARN. In Cloudera Manager 5.2 and higher, there are two separate Spark services (Spark and Spark (Standalone)). The Spark service runs Spark as a YARN application with only gateway roles in addition to the Spark History Server role.
How do you add Sparks worker node(or executor) role to the hosts where
you have HDFS datanodes?
Not required. Only Gateway roles are required on these hosts.
Quoting from CM Documentation:
In Cloudera Manager Gateway roles take care of propagation of client configurations to the other hosts in your cluster. So, ensure that you assign the gateway roles to hosts in the cluster. If you do not have gateway roles, client configurations are not deployed.

Apache Drill - Slow Queries

I have the following storage plugin set-up in Drill:
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://hivemetastore.hostname.com:9083",
"hive.metastore.sasl.enabled": "false"
}
}
however, a simple
SELECT * FROM hive.table LIMIT 5;
...
5 rows selected (35.383 seconds)
0: jdbc:drill:>
is taking over 30 seconds to respond. What am I missing / where should I begin the troubleshoot?
The Hive metastore server is the same as Drill right now. And there are less than 20,000 records in the table.
Only MapR Drill on the MapR sandbox should use the sparse storage plugin config you are using. In the sandbox, things are configured under the covers.
EMBEDDED METASTORE SERVICE
Assuming you are using a Drill installation, not the sandbox, and you're using the embedded metastore service (the default), configProps needs to look something like this (per the docs):
"configProps": {
"hive.metastore.uris": "",
"javax.jdo.option.ConnectionURL": "jdbc:<database>://<host:port>/<metastore database>",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
Remove the "hive.metastore.uris": "thrift://:" from your storage plugin config. That is for use with a remote hive metastore service.
The "javax.jdo.option.ConnectionURL" might be a MySQL database. The Hive metastore service provides access to the physical DB like MySQL. MySQL stores the metadata. The "fs.default.name" is the file system location where the data is located.
The embedded metastore configuration is for testing only, and not for use in production systems per the docs. For performance improvement, please configure the remote metastore. Also, check the compatibility of the version of Hive you are using. Open source Apache Drill 1.0 supports Hive 0.13. Drill 1.1 and later supports Hive 1.0.
REMOTE METASTORE SERVICE
If you're using the remote metastore, the "fs.default.name" should point to the main control node. Point to a NameNode, for example. If you're using MapR Drill, "fs.default.name" should be maprfs:///. The MapR FileClient figures out CLDB locations from mapr-clusters.conf. Start the metastore service, which is installed on top of Hive as a separate package:
hive --service metastore
The remote metastore config should look something like this if you're using open source Apache Drill:
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://mfs41.mystore:9083",
"hive.metastore.sasl.enabled": "false",
"fs.default.name": "maprfs://10.10.10.41/"
}
}

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

does Mesos provide its service like cluster management UI as OSS project?

I loved DCOS demos on Azure. No I wonder - having a private OpenStack based clud how to install Mesos with that UI manually? Is it possible or it is a part of DCOS they do not provide as OpenSource product?
The DCOS Dashboard is pretty cool :-). Currently it is just available via the DCOS beta on AWS and Azure. There will be on prem packages later on as well, potentially even a community edition. Feel free to contact/follow Mesosphere for updates.
Until then you can use the standard Mesos, Marathon, and Chronos UIs as Alex pointed out.
You can use Mesos and Marathon WebUI, by default they are available on ports 5050 and 8080 respectively.

Scheduling A Job on AWS EC2

I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this functionality. I've considered the following:
1) Generate a request to the web server that triggers it to do this task
I don't like this approach because it ties up a server thread and uses cpu cycles on the host
2) Create a cron job on the machine the web server is running on to execute this task
Again, I don't like this approach because it takes cpu cycles away from the web server
3) Create another EC2 instance and set up a cron job to run the task
This solves the web server resource issues, but why pay for an additional EC2 instance to run a job for <5 minutes? Waste of money!
Are there any other options? Is this a job for ElasticMapReduce?
If I were in your shoes, I'd probably start by trying to run the cron job on the web server each night at low tide and monitor the resource usage to make sure it doesn't interfere with the web server.
If you find that it doesn't play nicely, or you have high standards for the elegance of your architecture (I can admire that), then you'll probably need to run a separate instance.
I agree that it seems like a waste to run an instance 24 hours a day for a job you only need to run once a night.
Here's one aproach: The cron job on your primary machine (currently a web server) could fire up a new instance to run the task. It could pass in a user-data script that gets run when the instance starts, and the instance could shut itself down when it completes the task (where instance-initiated-shutdown-behavior was set to "terminate").
Unfortunately, this misses your desire to enforce separation of concerns, it gets complicated when you start scaling to multiple web servers, and it requires your web server to be alive in order for the job to run.
A couple months ago, I came up with a different approach to run an instance on a cron schedule, relying entirely on existing AWS features and with no requirement to have other servers running.
The basic idea is to use Amazon's Auto Scaling with a recurring action that scales the group from "0" to "1" at a specific time each night. The instance can terminate itself when the job is done, and the Auto Scaling can clean up much later to make sure it's terminated.
I've provided more details and a working example in this article:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
Amazon has just released[1] new features for Elastic Beanstalk. You can now create a worker environment containing cron.yaml that configures scheduling tasks calling an URL with the CRON syntax: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks
[1] http://aws.amazon.com/about-aws/whats-new/2015/02/17/aws-elastic-beanstalk-supports-environment-cloning-periodic-tasks-and-1-click-iam-role-creation/
Assuming you are running on a *nix version of EC2, I would suggest that you run it in cron using the nice command.
nice changes the priority of the job. You can make it a much lower priority, so if your webserver is busy, the cron job will have to wait for the CPU.
The higher the nice number, the lower the priority.
Nicenesses range from -20 (most favorable scheduling) to 19 (least favorable).
AWS DataPipeline
You can use AWS Data Pipeline to schedule a task with a given period. The action can be any command when you configure your Pipeline with the ShellCommandActivity.
You can even use your existing EC2 instance to run the command: Setup Task Runner on your EC2 instance and set the workerGroup field when setting the ShellCommandActivity (doc) on your DataPipeline:
{
"pipelineId": "df-0937003356ZJEXAMPLE",
"pipelineObjects": [
{
"id": "Schedule",
"name": "Schedule",
"fields": [
{ "key": "startDateTime", "stringValue": "2012-12-12T00:00:00" },
{ "key": "type", "stringValue": "Schedule" },
{ "key": "period", "stringValue": "1 hour" },
{ "key": "endDateTime", "stringValue": "2012-12-21T18:00:00" }
]
}, {
"id": "DoSomething",
"name": "DoSomething",
"fields": [
{ "key": "type", "stringValue": "ShellCommandActivity" },
{ "key": "command", "stringValue": "echo hello" },
{ "key": "schedule", "refValue": "Schedule" },
{ "key": "workerGroup", "stringValue": "yourWorkerGroup" }
]
}
]
}
Limits: Minimum scheduling interval is 15 minutes.
Pricing: About $1.00 per month.
You should consider CloudWatch Event and Lambda (http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html). You only pay for the actual runs. I assume the workers maintained by Elastic beanstalk still cost some money even when they are idle.
Update: found this nice article (http://brianstempin.com/2016/02/29/replacing-the-cron-in-aws/)
If this task can be accomplished with one machine, i recommend booting up an instance programmatically using the fog gem written in ruby.
After you start an instance, you can run a command via ssh. Once completed you can shutdown with fog as well.
Amazon EMR is also a good solution if your task can be written in a map reduce manner. EMR will take care of starting/stopping instances. The elastic-mapreduce-ruby cli tool can help you automate it
You can use AWS Opswork to setup cron jobs for your application. For more information read their user guide on AWS OpsWork. I found a page explaining how to setup cron jobs: http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-extend-cron.html

Resources