What is the command to list all jobs in Apache Aurora scheduler? - apache-aurora

I have set up an Apache Aurora cluster and managed to schedule hello world tasks.
What command should I use to list/view the tasks in the cluster?

Apache Aurora client
Apache Aurora has a client. Among other things, it can be used to list jobs running on the cluster:
aurora job list cluster[/role[/env[/name]]]
Example
For example to list all jobs scheduled in a cluster with say name
clustertest
this command
aurora job list clustertest
would return the list of jobs:
clustertest/root/test/python-flask-api
clustertest/root/test/rabbitpy
clustertest/root/devel/python-hello-world

Related

How can I setup Rstudio, sparklyR on an auto scale cluster managed by slurm?

I have an aws HPC auto scale cluster managed by slurm, I can submit jobs using sbatch, however I want to use spraklyr on this cluster so that slurm increases the cluster size based on the workload of the sparklyr code in the R script. Is this possible?
Hi Amir is there a reason you are using slurm here? Sparklyr has better integration with Apache Spark and it would be advisable to run it over a spark cluster. You can follow this Blog to know the steps to setup this up with Amazon EMR which is a Service to run Spark cluster on AWS - https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/

How to launch an EMR Cluster in AWS Data PipeLine only after completion of an activity in pipeline flow

Is it possible to launch an EMR Cluster only after completion of one my activity in the AWS Data Pipeline flow.
Unload some data from Redshift (which might take an hour or hour+).
Start EMR Cluster
Execute a SPARK job in EMR cluster
Execute some other activity
Terminate the cluster
So, I want to have a dependency like "Start EMr Cluster" should depend on "unload data from Redshift to S3", and "Terminate Cluster" should depend on "Execute Spark job, Execute some other activity".
Can some one help me on this.
-Krish
You can do it by utilizing Precondition
Here is the docs for more details
Also you can do it by using copy operation and put copy operation as precondition
So you can put dependency as precondition once that satisfies then create EMR cluster.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-preconditions.html

Apache Airflow Distributed Processing

My mind is confused over the architecture over Apache Airflow.
If I know,
When you executed a hql or sqoop statement in oozie, oozie is directing the request to the data nodes.
I want to achieve same thing in Apache Airflow. I want to execute a shell script, hql or sqoop command, and I wanna be sure that my command is being executed distributely by data nodes.
Airflow have different executor types. What should I do in order to run commands in different data nodes concurrently?
It seems you want to execute your tasks on distributed workers. In that case, consider using CeleryExecutor.
CeleryExecutor is one of the ways you can scale out the number of
workers. For this to work, you need to setup a Celery backend
(RabbitMQ, Redis, …) and change your airflow.cfg to point the executor
parameter to CeleryExecutor and provide the related Celery settings.
See: https://airflow.apache.org/configuration.html#scaling-out-with-celery
Oozie is tightly coupled with Hadoop nodes and all the scripts need to be uploaded to HDFS, whereas Airflow with Celery Executor has a better architecture. With Celery executor the same script, hql can be executed concurrently in multiple nodes as well as specific nodes by using correct queues and some workers can listen to the specific queues to perform those actions.

How to add Spark worker nodes on cloudera with Yarn

We have cloudera 5.2 and the users would like to start using Spark with its full potential (in distributed mode so it can get advantage of data locality with HDFS), the service is already installed and available in cloudera manager Status(in home page) but when clicking the service and then "Instances" it just shows a History Server role and in other nodes a Gateway server role. From my understanding of Spark's architecture you have a master node and worker nodes(which lives together with HDFS datanodes) so in cloudera manager i tried "Add role instances" but there's only "Gateway" role available . How do you add Sparks worker node(or executor) role to the hosts where you have HDFS datanodes? Or is it unnecessary (i think that because of yarn ,yarn takes charge of creating the executor and application master )? And what's the case of the masternode? Do i need to configure anything so the users can use Spark at its full distributed mode?
The master and worker roles are part of Spark Standalone service. You can either choose Spark to run with YARN (in which Master and Worker nodes are irrelevant) or the Spark (Standalone).
As you have started the Spark service instead of Spark (Standalone) in Cloudera Manager, Spark is already using YARN. In Cloudera Manager 5.2 and higher, there are two separate Spark services (Spark and Spark (Standalone)). The Spark service runs Spark as a YARN application with only gateway roles in addition to the Spark History Server role.
How do you add Sparks worker node(or executor) role to the hosts where
you have HDFS datanodes?
Not required. Only Gateway roles are required on these hosts.
Quoting from CM Documentation:
In Cloudera Manager Gateway roles take care of propagation of client configurations to the other hosts in your cluster. So, ensure that you assign the gateway roles to hosts in the cluster. If you do not have gateway roles, client configurations are not deployed.

Running Hadoop/Storm tasks on Apache Marathon

I recently came across Apache Mesos and successfully deployed my Storm topology over Mesos.
I want to try running Storm topology/Hadoop jobs over Apache Marathon (had issues running Storm directly on Apache Mesos using mesos-storm framework).
I couldn't find any tutorial/article that could list steps how to launch a Hadoop/Spark tasks from Apache Marathon.
It would be great if anyone could provide any help or information on this topic (possibly a Json job definition for Marathon for launching storm/hadoop job).
Thanks a lot
Thanks for your reply, I went ahead and deployed a Storm-Docker cluster on Apache Mesos with Marathon. For service discovery I used HAProxy. This setup allows services (nimbus or zookeeper etc) to talk to each other with the help of ports, so for example adding multiple instances for a service is not a problem since the cluster will find them using the ports and loadbalance the requests between all the instances of a service. Following is the GitHub project which has the Marathon recipes and Docker images: https://github.com/obaidsalikeen/storm-marathon
Marathon is intended for long-running services, so you could use it to start your JobTracker or Spark scheduler, but you're better off launching the actual batch jobs like Hadoop/Spark tasks on a batch framework like Chronos (https://github.com/airbnb/chronos). Marathon will restart tasks when the complete/fail, whereas Chronos (a distributed cron with dependencies) lets you set up scheduled jobs and complex workflows.
While a little outdated, the following tutorial gives a good example.
http://mesosphere.com/docs/tutorials/etl-pipelines-with-chronos-and-hadoop/

Resources