How to use AWS sbatch (SLURM) inside docker on an EC2 instance? - amazon-ec2

I am trying to get OpenFOAM to run on an AWS EC2 cluster using AWS parallelCluster.
One possibility is to compile OpenFOAM. Another is to use a docker container. I am trying to get the second option to work.
However, I am running into trouble understanding how I should orchestrate the various operations. Basically what I need is :
copy an OpenFOAM case from S3 to FSx file system on the master node
run the docker container containing OpenFOAM
Perform OpenFOAM operations, some of them using the cluster (running the computation in parallel being the most important one)
I want to put all of this into scripts to make it reproducible. But I am wondering how should I structure the scripts together to have SLURM handle the parallel side of things.
My problem at the moment is that the Master node shell knows the command e.g. sbatch but when I launch docker to access the OpenFOAM command, it "forgets" the sbatch commands.
How could I export all SLURM related commands (sbatch, ...) to docker easily ? Is this the correct way to handle the problem ?
Thanks for the support

for the first option there is a workshop that walks you through:
cfd-on-pcluster.
For the second option; I created a container workshop that uses HPC container runtimes containers-on-pcluster.
I incorporated a section about GROMACS but I am happy to add OpenFOAM as well. I am using Spack to create the container images. While I only documented single-node runs, we can certainly add multi-node runs.
Running Docker via sbatch is not going to get you very far, b/c docker is not a user-land runtime. For more info: FOSDEM21 Talk about Containers in HPC
Cheers
Christian (full disclosure: AWS Developer Advocate HPC/Batch)

Related

Docker container running Mesos cluster and running other docker containers on cluster (using Marathon)

I'm just starting off with Mesos, Docker and Marathon but I can't find anywhere where this specific question is answered.
I want to set up a Mesos cluster running on Docker - there are a couple of internet resources to do this, but then I want to run Docker containers on top of Mesos itself. This would then mean Docker containers running inside other Docker containers.
Is there a problem with this? It doesn't intuitively seem right somehow but would seem like it would be really handy to do so. Ideally I want to run Mesos cluster (with Marathon, Chronos etc.) and then run Hadoop within Docker containers on top of that. Is this possible or a standard way of doing things? Any other suggestions as to what good practice is would be appreciated.
Thanks
You should be able to run it, taking care of some issues when running the mesos (with Docker) containers, like running in privileged mode. Take a look to jpetazzo/dind to see how you can install and run docker in docker. Then you can setup mesos in that container to have one container with mesos and docker installed.
There are some references over the Internet similar to what you want to do. Check this article and this project that I think you will find very interesting.
There are definitely people running Mesos in docker containers, but you'll need to use privileged mode and set up some volumes if you want mesos to access the outer docker binary (see this thread).
Current biggest caveat: don't name your mesos-slave containers "mesos-*" or MESOS-2016 will bite you. See epic
MESOS-2115 for other remaining issues related to running mesos-slave in docker containers.

Run Go script inside Docker Container or cron job?

I have Go application deployed over Docker. Other than running the main program, I want to run periodic job for updating my data.
Which is better?
Run periodic job using concurrency (channel) while being run on main program.
Crontab to register periodic job on system. But I don't know how to do this inside Docker
In Dockerfile or in docker what is the best way to run a separate cronjob?
Please help me. Thanks!
If you are developing the application and all you need is basic periodical execution of one "job" , I would go and implement it in your app. If things get more complicated I would build on an image such as https://github.com/phusion/baseimage-docker which brings support for management of multiple container processes (including cron).

Running a script on an AWS server

I have a script that I need to run once a day that requires a lot of memory. I would like to run it on a dedicated amazon box.
Is there some automated way to build a box, download all required software (like ruby) and then run my script. After the script is ran, I would like to shutdown the box.
The two options I can think of are:
I am thinking about hacking EMR to do this. (My script is a mapper against an empty directory)
Chef - This seemed like too much for one simple script.
You can accomplish setting up a new EC2 instance on startup using the official Ubuntu AMIs, the official Amazon Linux AMIs, and any other AMI that supports the concept of a user-data script.
Create a script (bash, Perl, Python,
whatever) that starts with #!
Pass this script as the user-data when running the EC2 instance.
The script will automatically be run as root on the first boot.
Here's the article where I introduced the concept of a user-data script:
Automate EC2 Instance Setup with user-data Scripts
http://alestic.com/2009/06/ec2-user-data-scripts
Your user-data script can install the required software, configure it, install your work script, and set up a cron job that runs the work script once a day.
ENHANCEMENT:
If the installation script don't take a long time to run (e.g., under an hour or few) then you don't even have to run a single dedicated instance 24 hours a day. You can instead use an approach that lets AWS start an instance for you on a regular schedule.
Here's an article I wrote that provides details on this approach with sample commands:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
The general approach is to use Auto Scaling to start an instance with your user-data script on a regular schedule. Your job will terminate the instance when it has completed. They key is to suspend Auto Scaling's normal desire to re-start instances that terminate so that you don't pay for a running instance until the next time your job starts.

Celery, Resque, or custom solution for processing jobs on machines in my cloud?

My company has thousands of server instances running application code - some instances run databases, others are serving web apps, still others run APIs or Hadoop jobs. All servers run Linux.
In this cloud, developers typically want to do one of two things to an instance:
Upgrade the version of the application running on that instance. Typically this involves a) tagging the code in the relevant subversion repository, b) building an RPM from that tag, and c) installing that RPM on the relevant application server. Note that this operation would touch four instances: the SVN server, the build host (where the build occurs), the YUM host (where the RPM is stored), and the instance running the application.
Today, a rollout of a new application version might be to 500 instances.
Run an arbitrary script on the instance. The script can be written in any language provided the interpreter exists on that instance. E.g. The UI developer wants to run his "check_memory.php" script which does x, y, z on the 10 UI instances and then restarts the webserver if some conditions are met.
What tools should I look at to help build this system? I've seen Celery and Resque and delayed_job, but they seem like they're built for moving through a lot of tasks. This system is under much less load - maybe on a big day a thousand hundred upgrade jobs might run, and a couple hundred executions of arbitrary scripts. Also, they don't support tasks written in any language.
How should the central "job processor" communicate with the instances? SSH, message queues (which one), something else?
Thank you for your help.
NOTE: this cloud is proprietary, so EC2 tools are not an option.
I can think of two approaches:
Set up password-less SSH on the servers, have a file that contains the list of all machines in the cluster, and run your scripts directly using SSH. For example: ssh user#foo.com "ls -la". This is the same approach used by Hadoop's cluster startup and shutdown scripts. If you want to assign tasks dynamically, you can pick nodes at random.
Use something like Torque or Sun Grid Engine to manage your cluster.
The package installation can be wrapped inside a script, so you just need to solve the second problem, and use that solution to solve the first one :)

Setting up an init script in Amazon AMI

I want to run some scripts to run at the boot of the instance (server by node, and a redis database).
Since Amazon AMI is based on debian, I thought I could use update-rc.d to manage scripts.
Hwoever, when I type update-rc.d it says the command is not found.
What is the correct way of adding a service to the init script?
I know about CloudFront, but that is for the case when one wants to start up the instance for the first time and install some basic programs, right? For my case, I just want my instance to run some programs when it starts running from the reboot.
The image that I am using is amzn-ami-2011.09.2.x86_64.ext4.

Resources