Manually launch OpenMPI jobs - cluster-computing

I have a cluster that does not allow direct ssh access, but does permit me to submit commands through a proprietary interface.
Is there any way to either manually launch OpenMPI jobs, or documentation on how to write a custom launcher?

I don't think you can do it without breaking some kind of agreement.
I assume you have some kind of a web-based interface that allows you to fill certain fields and maybe upload data. Or something similar. What this interface will probably do - is it's going to generate a request/file for a scheduler. Most likely, SGE or PBS. The direct access to the cluster is limited in order to
organize task priorities and order
prevent users from hogging the machines
make it easier to launch complicated tasks requiring complicated machine(s) configuration
So, you, effectively, want to go around a scheduler. I don't think you can or you should.
However, usually, the clusters have so-called, head nodes which would allow SSH access to them. These nodes would serve as a place to submit scheduler requests from them and, maybe, do small compilation/result processing (with very limited resources). Such configuration would eliminate the web interface but still leaves a very important scheduler for a cluster that is used by many people concurrently.

Related

File sync between n web servers in cluster

There are n nodes in a web cluster. Files may be uploaded to any node and then must be distributed to every other node. This distribution does not have to happen in a transaction (in fact it must not, distributed transactions don't scale) and some latency is acceptable, although must be minimal. Conflicts can be resolved arbitrarily (typically last write wins) provided that the resolution is also distributed to all nodes so that eventually all nodes have the same set of files. Nodes can be added and removed dynamically without having to reconfigure existing nodes. There must be no single point of failure and no additional boxes required to solve this (such as RabbitMQ)
I am thinking along the lines of using consul.io for dynamic configuration so that each node can refer to consul to determine what other nodes are available and writing a daemon (Golang) that monitors the relevant folders and communicates with other nodes using ZeroMQ.
Feels like I would be re-inventing the wheel though. This is a common problem and I expect there are solutions available already that I don't know about? Or perhaps my approach is wrong and there is another way to solve this?
Yes, there has been some stuff going on with distributed synchronization lately:
You could use syncthing (open source) or BitTorrent Sync.
Syncthing is node-based, i.e. you add nodes to a cluster and choose which folders to synchronize.
BTSync is folder-based, i.e. you obtain a "secret" for a folder and can synchronize with everyone in the swarm for that folder.
From my experience, BTSync has a better discovery and connectivity, but the whole synchronization process is closed source and nobody really knows what happens. Syncthing is written in go, but sometimes has trouble discovering peers.
Both syncthing and BTSync use LAN discovery via broadcast and a tracker for discovery, AFAIK.
EDIT: Or, if you're really cool, use IPFS to host the latest version, IPNS to "name" that and mount the IPNS on the servers. You can set the IPFS bootstrap list to some of your servers, which would even make you independent of external trackers. :)

Reliable scheduling with sidekiq

I'm building a monitoring service similar to pingdom but monitoring different aspects of a system and using sidekiq to queue the tasks which is working well. What I need to do is to schedule sending out pings every minute, rather than using a cron based system which would require spinning up a new ruby instance every minute I have gone down the route of using sidetiq (notice the different spelling with a "t") which uses sidekiq's own queue to schedule future tasks. This feels like a neat solution, however I am concerned this may not be the most reliable way of scheduling tasks? If there are issues with the system (as there inevitable will be at some point) will this method of scheduling tasks be less reliable than using a cron based method and why?
Thanks
You are giving too short description of your system needs but I'll try to guess how it could be:
In the first place using sidekiq means that you'll also need an instance of redis and also means that you'll need a way to monitor the sidekiq process and restart it in case of failure and possibly redis server.
A method based on cron tasks will have fewer requirements therefore much less possibilities of failing.
cron has been around for a long time and it's battle tested and it's very very reliable, but has it's drawbacks too.
Said that, you can build a system with separate instances of redis in a master/slave configuration and you can also use Redis sentinel to implement a failover in case of the master failure, implement a monitoring/alerting system on this setup (you can use something super simple like this http://contribsys.com/inspeqtor/ from the sidekiq author) and you can also start several instances of sidekiq in different machines.
With all of that, you can have a quite reliable system for running sidekiq with sidetiq.
Hope it helps

Ubuntu: Remote Logins (SSHD) - Kill Session & Jobs at Timeout

Server Scenario:
Ubuntu 12.04 LTS
Torque w/ Maui Scheduler
Hadoop
I am building a small cluster (10 nodes). The users will have the ability to ssh into any child node(LDAP Auth) but this is really unnecessary since all computation jobs they want to run can be submitted on the head node using torque, hadoop, or other resource managers tied with a scheduler to insure priority and proper resource allocation throughout the nodes. Some users will have priority over others.
Problem:
You can't force a user to use a batch system like torque. If they want to hog all the resources on one node or the head node they can just run their script / code directly from their terminal / ssh session.
Solution:
My main users or "superusers" want me to set up a remote login timeout which is what their current cluster uses to eliminate this problem. (I do not have access to this cluster so I can not grab the configuration). I want to setup a 30 minute timeout on all remote sessions that are inactive(keystrokes), if they are running processes I also want the session to be killed along with all job processes. This will eliminate people from NOT using an available batch system / scheduler.
Question:
How can I implement something like this?
Thanks for all the help!
I've mostly seen sys admins solve this by not allowing ssh access to the nodes (often done using the pam module in TORQUE), but there are other techniques. One is to use pbstools. The reaver script can be setup to kill user processes that aren't part of jobs (or shouldn't be on those nodes). I believe it can also be configured to simply notify you. Some admins forcibly kill things, others educate users, that part is up to you.
Once you get people using jobs instead of ssh'ing directly, you may want to look into the cpuset feature in TORQUE as well. It can help you as you try to get users to use the amount of resources they request. Best of luck.
EDIT: noted that the pam module is one of the most common ways to restrict ssh access to the compute nodes.

CPU bound/stateful distributed system design

I'm working on a web application frontend to a legacy system which involves a lot of CPU bound background processing. The application is also stateful on the server side and the domain objects needs to be held in memory across the entire session as the user operates on it via the web based interface. Think of it as something like a web UI front end to photoshop where each filter can take 20-30 seconds to execute on the server side, so the app still has to interact with the user in real time while they wait.
The main problem is that each instance of the server can only support around 4-8 instances of each "workspace" at once and I need to support a few hundreds of concurrent users at once. I'm going to be building this on Amazon EC2 to make use of the auto scaling functionality. So to summarize, the system is:
A web application frontend to a legacy backend system
task performed are CPU bound
Stateful, most calls will be some sort of RPC, the user will make multiple actions that interact with the stateful objects held in server side memory
Most tasks are semi-realtime, where they have to execute for 20-30 seconds and return the results to the user in the same session
Use amazon aws auto scaling
I'm wondering what is the best way to make a system like this distributed.
Obviously I will need a web server to interact with the browser and then send the cpu-bound tasks from the web server to a bunch of dedicated servers that does the background processing. The question is how to best hook up the 2 tiers together for my specific neeeds.
I've been looking at message Queue systems such as rabbitMQ but these seems to be geared towards one time task where any worker node can simply grab a job form a queue, execute it and forget the state. My needs are a little different since there could be multiple 'tasks' that needs to be 'sticky', for example if step 1 is started in node 1 then step 2 for the same workspace has to go to the same worker process.
Another problem I see is that most worker queue systems seems to be geared towards background tasks that can be processed anytime rather than a system that has to provide user feedback that I'm dealing with.
My question is, is there an off the shelf solution for something like this that will allow me to easily build a system that can scale? Would love to hear your thoughts.
RabbitMQ is has an RPC tutorial. I haven't used this pattern in particular but I am running RabbitMQ on a couple of nodes and it can handle hundreds of connections and millions of messages. With a little work in monitoring you can detect when there is more work to do then you have consumers for. Messages can also timeout so queues won't backup too greatly. To scale out capacity you can create multiple RabbitMQ nodes/clusters. You could have multiple rounds of RPC so that after the first response you include the information required to get second message to the correct destination.
0MQ has this as a basic pattern which will fanout work as needed. I've only played with this but it is simpler to code and possibly simpler to maintain (as it doesn't need a broker, devices can provide one though). This may not handle stickiness by default but it should be possible to write your own routing layer to handle it.
Don't discount HTTP for this as well. When you want request/reply, a strict throughput per backend node, and something that scales well, HTTP is well supported. With AWS you can use their ELB easily in front of an autoscaling group to provide the routing from frontend to backend. ELB supports sticky sessions as well.
I'm a big fan of RabbitMQ but if this is the whole scope then HTTP would work nicely and have fewer moving parts in AWS than the other solutions.

How can make my database records automatic

is there any way i can make my records in the database to be automatic. e.g i want a message to be sent to helpdesk if a requested service is not attended within 24 hours, without clicking anything.
technically it depends on the database you are using. if the database supports it, you could set up a scheduled job to scan the records and identify late services and email the helpdesk.
if the database doesn't support scheduled tasks then you could set up a client job on a timer to do the same thing.
This is what application software is for.
When the application saves to the database, the application also sends an email.
The traditional approach to this is to schedule a job (there are too many ways[1] to do that for me to go into details without knowing your server operating system, DBMS, and how much control you have to install or schedule programs on the server).
Your scheduled job would regularly check the database for records that have not been attended, and then take the appropriate action such as emailing the support team.
[1] Just so that this is not left completely unanswered; some DBMS (ex. SQL Server) have built in job scheduling facilities. You could run a Windows service on the server to do this. If not, you might consider running a Windows Service on one of your own servers to access the website (a great way to waste bandwidth).
Use a scheduler like this one, found on rufus site. You could program it to run, for instance, every hour, and make it do the job without human interaction.
I am a Java shop myself and I've been using quartz. It is quite good and usable if you can adjust to jruby.
I've never liked database or operating system based solutions, since you might not control them and often get asked to run on different environments.
Here's a very simple background job handler for Ruby:
codeforpeople.rubyforge.org/svn/bj/trunk/README
Easy to install and use. Fairly lightweight. It uses a SQL backend for managing concurrency. Runs on multiple machines simultaneously if you need it to.

Resources