Running Hadoop/Storm tasks on Apache Marathon - mesos

I recently came across Apache Mesos and successfully deployed my Storm topology over Mesos.
I want to try running Storm topology/Hadoop jobs over Apache Marathon (had issues running Storm directly on Apache Mesos using mesos-storm framework).
I couldn't find any tutorial/article that could list steps how to launch a Hadoop/Spark tasks from Apache Marathon.
It would be great if anyone could provide any help or information on this topic (possibly a Json job definition for Marathon for launching storm/hadoop job).
Thanks a lot

Thanks for your reply, I went ahead and deployed a Storm-Docker cluster on Apache Mesos with Marathon. For service discovery I used HAProxy. This setup allows services (nimbus or zookeeper etc) to talk to each other with the help of ports, so for example adding multiple instances for a service is not a problem since the cluster will find them using the ports and loadbalance the requests between all the instances of a service. Following is the GitHub project which has the Marathon recipes and Docker images: https://github.com/obaidsalikeen/storm-marathon

Marathon is intended for long-running services, so you could use it to start your JobTracker or Spark scheduler, but you're better off launching the actual batch jobs like Hadoop/Spark tasks on a batch framework like Chronos (https://github.com/airbnb/chronos). Marathon will restart tasks when the complete/fail, whereas Chronos (a distributed cron with dependencies) lets you set up scheduled jobs and complex workflows.
While a little outdated, the following tutorial gives a good example.
http://mesosphere.com/docs/tutorials/etl-pipelines-with-chronos-and-hadoop/

Related

Can I run mesos/marathon application at specific host?

I wanna use marathon as cluster monitoring and management. Bellow scenario is possible?
My Scenario
Cassandra 5EA was already deployed and are running.
Cassandra hosts are physical machine.
I want to run script that verifies healthness of cassandra each host. ex) cassandra process, disk usage, number of file, ..
If problem found at host, than run correcting script on that host. Script launched manually.
Each script can be run by marathon application. But I couldn't found run application on (specific) error host.
No restriction of adding machines and installing mesos components.
And if you know more suitable tool, please recommend!!
If you are not running Cassandra on Mesos I think Marathon is not the best choice. From your description, it looks like you need a monitoring tool (e.g., Nagios) rather than service Orchestration.
Please extend your question with more information. It's not clear what you are asking.

Airflow setup for high availability

How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability?
I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration.
My main focus is the scheduler - is there something special needs to be done?
After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.
The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good).
This is described here:
https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.
Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).
Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).
My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:
https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.
In my scenario, I have 2 schedulers (on 2 separate docker swarms), with the standby cluster scheduler turned off (using docker swarm service scale=0). I needed to make sure the primary scheduler had stopped fully before I started up the standby scheduler. What I found was that having 2 running schedulers (even for a brief time period) resulted in an occasional DAG scheduled to run on both clusters leading to duplicate reports generated from two different cluster zone.

Is spark standalone scheduler or Yarn scheduler better for a Cloudera 5.4 hadoop cluster?

In regards to being able to run machine learning jobs with Spark. Which is a better choice the Yarn scheduler or the Spark Standalone scheduler?
There is no difference when it comes to run the actual spark job.
Yarn/Mesos helps you to schedule resources if you have different spark applictions running and/or other components running in your cluster (which support Yarn/Mesos of course).
The Spark standalone cluster cannot manage resources. That is if you start a Spark application and it uses all the ressources, the second application will not find any resources left. That means you have to do this by yourself (e.g. adapting Spark config accordingly)

Provision to start group of applications on same Mesos slave

I have cluster of 3 Mesos slaves, where I have two applications: “redis” and “memcached”. Where redis depends on memcached and the requirement is both of the applications/services should start on same node instead of different slave nodes.
So I have created the application group and added the dependency properly in the JSON file. After launching the JSON file via “v2/groups” REST API, I observe that sometime both application group will start on same node but sometimes it will start on different slaves which breaks our requirement.
So intent/requirement is; if any application fails to start on a slave both the application should failover to other slave node. Also can I configure the JSON file to tell Marathon to start the application group on slave-1 (specific slave first) if it is available else start it on other slave in a cluster. Due to some reason if this application group will start on other slave can Marathon relaunch the application group to slave-1 if it is available to serve the request.
Thanks in advance for help.
Edit/Update (2):
Mesos, Marathon, and DC/OS support for PODs is available now:
DC/OS: https://dcos.io/docs/1.9/usage/pods/using-pods/
Mesos: https://github.com/apache/mesos/blob/master/docs/nested-container-and-task-group.md
Marathon: https://github.com/mesosphere/marathon/blob/master/docs/docs/pods.md
I assume you are talking about marathon apps.
Marathon application groups don't have any semantics concerning co-location on the same node and the same is the case for dependencies.
You seem to be looking for a Kubernetes like Pod abstraction in marathon, which is on the roadmap but not yet available (see update above :-)).
Hope this helps!
I think this should be possible (as a workaround) if you specify the correct app contraints within the group's JSON.
Have a look at the example request at
https://mesosphere.github.io/marathon/docs/generated/api.html#v2_groups_post
and the constraints syntax at
https://mesosphere.github.io/marathon/docs/constraints.html
e.g.
"constraints": [["hostname", "CLUSTER", "slave-1"]]
should do. Downside is that there will be no automatic failover to another slave that way. Still, I'd be curious why both apps need to specifically run on the same slave node...

What is best practice to monitor storm cluster with some UI tool like Cloudera Manager?

This time, we build a storm cluster and don't have a tool like cloudera manager to monitor the status of cluster except the storm ui, and send alert notice when the cluster is in bad status.
Please write linux scripts with some basic storm commands and zookeeper commands to ensure the health of your cluster.

Resources