As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
It is built around job-task concept. A job may have one to 100,000 tasks.
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
Tasks create other tasks on the fly.
Some tasks may run for minutes, while others may take many hours.
The tasks run according to a dependency hierarchy, which may be updated on the fly.
The job may be paused and resumed later.
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
The tasks tell their progress and result back to the manager.
The manager is aware if the task is alive or hanged.
We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:
Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
Task is unable to report its progress back to the manager, only job can.
There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).
Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.
I would take a look at the Condor high throughput computing project. It supports windows (and linux, and OSX) clients and servers, handles complex dependencies between tasks using DAGman and can suspend (and even move) tasks. I've experience of systems based on Condor that scale to thousands of machines across university campuses.
Platform LSF will do everything you need. It runs on Windows. It is commercial, and can be purchased with support.
Yes. 1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
Yes 2. It is built around job-task concept. A job may have one to 100,000 tasks.
Yes 3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
Yes 4. Tasks create other tasks on the fly.
Yes 5. Some tasks may run for minutes, while others may take many hours.
Yes 6. The tasks run according to a dependency hierarchy, which may be updated on the fly.
Yes 7. The job may be paused and resumed later.
Yes 8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
Yes 9. The tasks tell their progress and result back to the manager.
Yes 10. The manager is aware if the task is alive or hanged.
Have you looked at Beowulf? Lots of distributions to choose from, and lots of customization options. You ought to be able to find something to meet your needs...
I would recommend Beowulf cause Beowulf behaves more like a single machine rather than many workstations.
give gridgain a try. This should make runtime addition of nodes very easy, and you can monitor/manage the cluster using jmx interfaces
If you don't mind hosting your project in a cloud, you might want to have a look at Windows Azure / Appfabric. AFAIK it allows you to distribute your jobs via workflows and you can dynamically add more worker machines to handle your jobs as the load increases.
You can definitely solve this sort of problem using Data Synapse Grid Server.
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. Yes, a Broker can easily handle 2000 Engines.
It is built around job-task concept. A job may have one to 100,000 tasks. Yes, I have queued in excess of 250,000 tasks without issue. Eventually you will run out of memory.
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
Tasks create other tasks on the fly. It can be done, although I would not recommend this sort of model
Some tasks may run for minutes, while others may take many hours. yes
The tasks run according to a dependency hierarchy, which may be updated on the fly. yes, but I would manage this outside of the grid computing infrastructure
The job may be paused and resumed later. yes
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. yes
The tasks tell their progress and result back to the manager. yes
` 10. The manager is aware if the task is alive or hanged. yes
Have you examined the SunGrid Engine? It's been a long time since I used it, and I never used it to its full capabilities, but this is my understanding.
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. yes
It is built around job-task concept. A job may have one to 100,000 tasks. not sure
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
Tasks create other tasks on the fly. I think so?
Some tasks may run for minutes, while others may take many hours. yes
The tasks run according to a dependency hierarchy, which may be updated on the fly. not sure
The job may be paused and resumed later. not sure
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. pretty sure
The tasks tell their progress and result back to the manager. pretty sure
`
10. The manager is aware if the task is alive or hanged. yes
Related
I have a cluster that does not allow direct ssh access, but does permit me to submit commands through a proprietary interface.
Is there any way to either manually launch OpenMPI jobs, or documentation on how to write a custom launcher?
I don't think you can do it without breaking some kind of agreement.
I assume you have some kind of a web-based interface that allows you to fill certain fields and maybe upload data. Or something similar. What this interface will probably do - is it's going to generate a request/file for a scheduler. Most likely, SGE or PBS. The direct access to the cluster is limited in order to
organize task priorities and order
prevent users from hogging the machines
make it easier to launch complicated tasks requiring complicated machine(s) configuration
So, you, effectively, want to go around a scheduler. I don't think you can or you should.
However, usually, the clusters have so-called, head nodes which would allow SSH access to them. These nodes would serve as a place to submit scheduler requests from them and, maybe, do small compilation/result processing (with very limited resources). Such configuration would eliminate the web interface but still leaves a very important scheduler for a cluster that is used by many people concurrently.
I have 2 sets of storm topologies in use today, one is up 24/7, and does it's own work.
The other, is deployed on demand, and handles a much bigger loads of data.
As of today, we have N supervisors instances, all from the same type of hardware (CPU/RAM), I'd like my on demand topology to run on stronger hardware, but as far as I know, there's no way to control which supervisor is assigned to which topology.
So if I can't control it, it's possible that the 24/7 topology would assign one of the stronger workers to itself.
Any ideas, if there is such a way?
Thanks in advance
Yes, you can control which topologies go where. This is the job of the scheduler.
You very likely want either the isolation scheduler or the resource aware scheduler. See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Storm-Scheduler.html and https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html.
The isolation scheduler lets you prevent Storm from running any other topologies on the machines you use to run the on demand topology. The resource aware scheduler would let you set the resource requirements for the on demand topology, and preferentially assign the strong machines to the on demand topology. See the priority section at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html#Topology-Priorities-and-Per-User-Resource.
I have a mesos / marathon system, and it is working well for the most part. There are upwards of 20 processes running, most of them using only part of a CPU. However, sometimes (especially during development), a process will spin up and start using as much CPU as is available. I can see on my system monitor that there is a pegged CPU, but I can't tell what marathon process is causing it.
Is there a monitor app showing CPU usage for marathon jobs? Something that shows it over time. This would also help with understanding scaling and CPU requirements. Tracking memory usage would be good, but secondary to CPU.
It seems that you haven't configured any isolation mechanism on your agent (slave) nodes. mesos-slave comes with an --isolation flag that defaults to posix/cpu,posix/mem. Which means isolation at process level (pretty much no isolation at all). Using cgroups/cpu,cgroups/mem isolation will ensure that given task will be killed by kernel if exceeds given memory limit. Memory is a hard constraint that can be easily enforced.
Restricting CPU is more complicated. If you have machine that offers 8 CPU cores to Mesos and each of your tasks is set to require cpu=2.0, you'll be able run there at most 4 tasks. That's easy, but at given moment any of your 4 tasks might be able to utilize all idle cores. In case some of your jobs is misbehaving, it might affect other jobs running on the same machine. For restricting CPU utilization see Completely Fair Scheduler (or related question How to understand CPU allocation in Mesos? for more details).
Regarding monitoring there are many possibilities available, choose an option that suits your requirements. You can combine many of the solutions, some are open-source other enterprise level solutions (in random order):
collectd for gathering stats, Graphite for storing, Grafana for visualization
Telegraf for gathering stats, InfluxDB for storing, Grafana for visualization
Prometheus for storing and gathering data, Grafana for visualization
Datadog for a cloud based monitoring solution
Sysdig platform for monitoring and deep insights
I just noticed the fact that many Pig jobs on Hadoop are killed due to the following reason: Container preempted by scheduler
Could someone explain me what causes this, and if I should (and am able to) do something about this?
Thanks!
If you have the fair scheduler and a number of different queue's enabled, then higher priority applications can terminate your jobs (in a preemptive fashion).
Hortonworks have a pretty good explanation with more details
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/preemption.html
Should you do anything about it? Depends if your application is within its SLA's and performing within expectations. General good practice would be to review your job priority and the queue it's assigned to.
If your Hadoop cluster is being used by many business units. then Admins decides queue for them and every queue has its priorities( that too is decided by Admins). If Preemption is enabled at scheduler level,then higher-priority applications do not have to wait because lower priority applications have taken up the available capacity. So in this case lower propriety task must have to release resources, if not available at cluster to let run higher-priority applications.
How would I determine the current server load? Do I need to use JMX here to get the cpu time, or is there another way to determine that or something similar?
I basically want to have background jobs run only when the server is idle. I will use Quartz to fire the job every 30 minutes, check the server load then proceed if it is low or halt if it is busy.
Once I can determine how to measure the load (cpu time, memory usage), I can measure these at various points to determine how I want to configure the server.
Walter
Tricky to do in a portable way, it would likely depend considerably on your platform.
An alternative is to configure your Quartz jobs to run in low-priority threads. Quartz allows you to configure the thread factory, and if the server is busy, then the thread should be shuffled to the back of the pack until it can be run without getting in the way.
Also, if the load spikes in the middle of the job, then the VM will automatically throttle your batch job until the load drops again. It should be self-regulating, which you wouldn't get by manual introspection of the current load.
I think you've answered your own question. If you want a pure Java solution, then the best that you can do is the information returned by the ThreadMXBean.
You can find out how many threads there are, how many processors the host machine has and how much time has been used by each thread, and calculate CPU load from that.