ipython parallel management tool for cluster/engines/tasks - parallel-processing

How can I manage an iPython parallel cluster?
I've successfully created a cluster with about 35 engines running on several machines with varying hardware/os. After a few days executing several tasks I would like to be able to view the state of my cluster. More specifically I would like to know:
Which engines are (still) running and on what machines?
What resources did/do those engines consume (memory/cpu) on the hosts?
Which tasks are currently executing/have executed on which engine and what was the result/status of those tasks (success/failure)?
Preferably I would like to use some kind of (web-based) user interface to answer those questions.

Related

Manually launch OpenMPI jobs

I have a cluster that does not allow direct ssh access, but does permit me to submit commands through a proprietary interface.
Is there any way to either manually launch OpenMPI jobs, or documentation on how to write a custom launcher?
I don't think you can do it without breaking some kind of agreement.
I assume you have some kind of a web-based interface that allows you to fill certain fields and maybe upload data. Or something similar. What this interface will probably do - is it's going to generate a request/file for a scheduler. Most likely, SGE or PBS. The direct access to the cluster is limited in order to
organize task priorities and order
prevent users from hogging the machines
make it easier to launch complicated tasks requiring complicated machine(s) configuration
So, you, effectively, want to go around a scheduler. I don't think you can or you should.
However, usually, the clusters have so-called, head nodes which would allow SSH access to them. These nodes would serve as a place to submit scheduler requests from them and, maybe, do small compilation/result processing (with very limited resources). Such configuration would eliminate the web interface but still leaves a very important scheduler for a cluster that is used by many people concurrently.

Quartz with centralized scheduling and monitoring

We are trying to revamp our batch job scheduling and monitoring process over the entire enterprise. Currently all our batch jobs are scheduled using Unix crontab and are monitored using log files generated by shell scripts.
This process has lot of disadvantages and as the number of applications grow this gets really complicated.
Two copies of applications need to be deployed one to App-Server and one as standalone(since business logic is shared between both). This is complicating our build process too.
There is no easy of use web-ui for us to see the status of jobs and manually run failed jobs remotely without getting onto the unix box.
There is no fail over or load balanced batch processing.
So I was thinking of using Quartz (with our existing Spring apps) in our applications and deploy them to App-Servers and no longer rely on the unix crontab.
Is there a way I can write a centralized web application from where I can schedule and monitor jobs running on different quartz schedulers on different app servers?
P.S: I know quartzdesk.com is one solution, but I don't want to enable RMI on my JVM.
You could use SpringBoot scheduler as an Orchestrator and call REST APIs for the remote (or local, if you are small) execution. This way, as your app grows you could easily leverage a load balancer.
If you have the possibility of using cloud services (like Amazon, Azure or Google Cloud), this could be done easily using their own load balancers. They also support docker and could take care of any peaks of utilization.

deploy bolts/spout to a specific supervisor

We are running a storm application using a single type on instance in AWS and a single topology to run our system.
This is causing some resource limitation issues.
The way we want to address this is by splitting our IO intense bolts into a cluster of a few dozens t1.small machines (for example) and all our CPU intense bolts to two large machines with lots of cpu & memory.
Basically what i am asking is, is there a way to start all this supervisors and then deploy one topology that include cpu intense bolts on the big machines and to the small machines the deploy IO bolts?
You can implement a custom scheduler using interface IScheduler.
See
http://www.exogeni.net/2015/04/enabling-site-aware-scheduling-for-apache-storm-in-exogeni/
https://dcvan24.wordpress.com/2015/04/07/metadata-aware-custom-scheduler-in-storm/
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/DemoScheduler.java

Confused on Mesos Terminologies

I went through the video on introduction of DCOS. It was good but got me somewhat confused in terms of classification of component definitions in Mesosphere.
I get that DCOS is an ecosystem and Mesos is like a kernel. Please correct me if I am wrong. For eg. It's like Ubuntu and Linux kernel I presume.
What is marathon? Is it a service or framework or is it something else that falls in neither category? I am bit confused in terms of service vs framework vs application vs Task definition in Mesosphere's context.
Are the services(Cassandra, HDFS, Kubernetes, etc..) that he launches in the video can safely be also called as frameworks?
From 3, are these "services" running as executors in the slaves?
What should rails-app's type be here? Is it a task? So will it also have an executor?
Who makes the decision of autoscaling the rails-app to more nodes, when he increases the traffic using marathon.
1) I get that DCOS is an ecosystem and Mesos is like a kernel. Please
correct me if I am wrong. For eg. It's like Ubuntu and Linux kernel I
presume.
Correct!
2) What is marathon? Is it a service or framework or is it something
else that falls in neither category? I am bit confused in terms of
service vs framework vs application vs Task definition in Mesosphere's
context.
In Apache Mesos terminology, Marathon is a framework. Every framework consists of a framework scheduler and an executor. Many frameworks reuse the standard executor rather than providing their own. An app is a Marathon specific term, meaning the long-running task you launch through it. A task is the unit of execution, running on a Mesos agent (in an executor). In DC/OS (the product, Mesosphere is our company) we call frameworks in general services. Also, in the context of DC/OS, Marathon plays a special role: it acts as a sort of distributed initd, launching other services such as Spark or Kafka.
3) Are the services(Cassandra, HDFS, Kubernetes, etc..) that he
launches in the video can safely be also called as frameworks?
See above.
4) From 3), are these "services" running as executors in the slaves?
No. See above.
5) What should rails-app's type be here? Is it a task? So will it also
have an executor?
The Rails app may have one or more (Mesos) tasks running in executors on one or more agents.
6) Who makes the decision of autoscaling the rails-app to more nodes,
when he increases the traffic using marathon.
Not nodes but instances of the app. Also as #air suggested, with Marathon autoscaling is simple, see also this autoscaling example.

MapReduce on AWS

Anybody played around with MapReduce on AWS yet? Any thoughts? How's the implementation?
It's easy to get started.
Here's a FAQ: http://aws.amazon.com/elasticmapreduce/faqs/
And here's the Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.
I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.
The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.
I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.
Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.
For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I'll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.
You also have the possibility to run MapReduce (Hadoop) on AWS with StarCluster. This tool configures the cluster for you and has the advantage that you don´t have to pay the extra Amazon Elastic MapReduce Price (if you want to reduce your costs) and you could create your own Image (AMI) with your tools (this could be good if the installation of the tools can´t be done by a bootstrap script).
It is very convenient because you don't have to administer your own cluster. You just pay per use so I think it is a good idea if you have a job that needs to run once in a while. We are running Amazon MapReduce just once a month so, for our usage, it is worth it.
However, as far as I can tell, a drawback of Amazon Map Reduce is that you can't tell which Operating System is running, or even its version. This caused me problems running c++ code that compiled with g++ 4.44, some of the OS images does not support cUrl library, etc.
If you don't need any special libraries for your use case, I would say go for it.
Good answer by MB.
To be clear: you can run Hadoop clusters in two ways:
1) Run it on Amazon EC2 instances. This means that you have to install it, configure it, terminate it, etc.
2) Run it using Elastic MapReduce, or EMR: this is an automated way to run an Hadoop cluster on Amazon Web Services. You pay a little extra on top of the basic cost for EC2, but you don't need to manage anything: just upload your data, then your algorithm, then crunch. EMR will shut down the instances automatically once your jobs are finished.
Best,
Simone
EMR is the best way to use available resources with a very little added cost over EC2 however you will how time saving and easy it is. Most of the MR implementation on Cloud are using this model i.e. Apache Hadoop on Windows Azure, Mortar Data etc.. I have worked on both Amazon EMR and Apache Hadoop on Windows Azure and found incredible to use.
Also, depending on the type / duration of jobs you plan to run, you can use AWS spot instances with EMR to get better pricing.
I am working with AWS EMR. It is pretty neat. I mean once you start up their cluster and login into their Master node. You can play around with the hadoop directory structure. And do pretty cool things.. If you have a edu account don;t forget to apply for a research grant. They give unto 100$ free credits to use their AWS.
AWS EMR is a good choice when you use S3 storage for your data.
It provides out of the box integration with S3 for loading files and posting processed files.
In use cases where you need to run the job on demand, you are saved from the cost of running the whole cluster all the time, this really helps you save on instance hours.
Leveraging the above advantage, one can use AWS lambda to spawn event driven clusters.

Resources