Why are mapreduce attempts killed due to "Container preempted by scheduler"? - hadoop

I just noticed the fact that many Pig jobs on Hadoop are killed due to the following reason: Container preempted by scheduler
Could someone explain me what causes this, and if I should (and am able to) do something about this?
Thanks!

If you have the fair scheduler and a number of different queue's enabled, then higher priority applications can terminate your jobs (in a preemptive fashion).
Hortonworks have a pretty good explanation with more details
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/preemption.html
Should you do anything about it? Depends if your application is within its SLA's and performing within expectations. General good practice would be to review your job priority and the queue it's assigned to.

If your Hadoop cluster is being used by many business units. then Admins decides queue for them and every queue has its priorities( that too is decided by Admins). If Preemption is enabled at scheduler level,then higher-priority applications do not have to wait because lower priority applications have taken up the available capacity. So in this case lower propriety task must have to release resources, if not available at cluster to let run higher-priority applications.

Related

Deploy 2 different topologies on a single Nimbus with 2 different hardware

I have 2 sets of storm topologies in use today, one is up 24/7, and does it's own work.
The other, is deployed on demand, and handles a much bigger loads of data.
As of today, we have N supervisors instances, all from the same type of hardware (CPU/RAM), I'd like my on demand topology to run on stronger hardware, but as far as I know, there's no way to control which supervisor is assigned to which topology.
So if I can't control it, it's possible that the 24/7 topology would assign one of the stronger workers to itself.
Any ideas, if there is such a way?
Thanks in advance
Yes, you can control which topologies go where. This is the job of the scheduler.
You very likely want either the isolation scheduler or the resource aware scheduler. See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Storm-Scheduler.html and https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html.
The isolation scheduler lets you prevent Storm from running any other topologies on the machines you use to run the on demand topology. The resource aware scheduler would let you set the resource requirements for the on demand topology, and preferentially assign the strong machines to the on demand topology. See the priority section at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html#Topology-Priorities-and-Per-User-Resource.

AWS EMR Metric Server - Cluster Driver is throwing Insufficient Memory Error

This is in relation to my previous post (here) regarding the OOM I'm experiencing on a driver after running some Spark steps.
I have a cluster with 2 nodes in addition to the master, running the job as client. It's a small job that is not very memory intensive.
I've paid particular attention to the hadoop processes via htop, they are the user generated ones and also the highest memory consumers. The main culprit is the amazon.emr.metric.server process, followed by the state pusher process.
As a test I killed the process, the memory as shown by Ganglia dropped quite drastically whereby I was then able to run 3-4 consecutive jobs before the OOM happened again. This behaviour repeats if I manually kill the process.
My question really is regarding the default behaviour of these processes and whether what I'm witnessing is the norm or whether something crazy is happening.

How do YARN applications estimate needed resources

I'm wondering how does a YARN app (let's say MapReduce job) estimate needed resources (CPU, RAM) for a single mapper/reducer.
The question is too broad but I'll try to give a direction for investigation. When Yarn application is executed, it requests some number of resources from Resource Manager. Resource management in Yarn is implemented by means of schedulers. Yarn supports two schedulers:
Fair scheduler
Capacity Scheduler
Schedulers define rules that are used for estimating "slots" for an application. For some schedulers "slots" are defined only by memory required for the application (Capacity Scheduler with DefaultResourseCalculator). Others take number of CPUs into account as well.

Apache Mesos Schedulers and Executors by example

I am trying to understand how the various components of Mesos work together, and found this excellent tutorial that contains the following architectural overview:
I have a few concerns about this that aren't made clear (either in the article or in the official Mesos docs):
Where are the Schedulers running? Are there "Scheduler nodes" where only the Schedulers should be running?
If I was writing my own Mesos framework, what Scheduler functionality would I need to implement? Is it just a binary yes/no or accept/reject for Offers sent by the Master? Any concrete examples?
If I was writing my own Mesos framework, what Executor functionality would I need to implement? Any concrete examples?
What's a concrete example of a Task that would be sent to an Executor?
Are Executors "pinned" (permanently installed on) Slaves, or do they float around in an "on demand" type fashion, being installed and executed dynamically/on-the-fly?
Great questions!
I believe it would be really helpful to have a look at a sample framework such as Rendler. This will probably answer most of your question and give you feeling for the framework internal.
Let me now try to answer the question which might be still be open after this.
Scheduler Location
Schedulers are not on on any special nodes, but keep in mind that schedulers can failover as well (as any part in a distributed system).
Scheduler functionality
Have a look at Rendler or at the framework development guide.
Executor functionality/Task
I believe Rendler is a good example to understand the Task/Executor relationship. Just start reading the README/description on the main github page.
Executor pinning
Executors are started on each node when the first Task requiring such executor is send to this node. After this it will remain on that node.
Hope this helped!
To add to js84's excellent response,
Scheduler Location: Many users like to launch the schedulers via another framework like Marathon to ensure that if the scheduler or its node dies, then it can be restarted elsewhere.
Scheduler functionality: After registering with Mesos, your scheduler will start getting resource offers in the resourceOffers() callback, in which your scheduler should launch (at least) one task on a subset (or all) of the resources being offered. You'll probably also want to implement the statusUpdate() callback to handle task completion/failure.
Note that you may not even need to implement your own scheduler if an existing framework like Marathon/Chronos/Aurora/Kubernetes could suffice.
Executor functionality: You usually don't need to create a custom executor if you just want to launch a linux process or docker container and know when it completes. You could just use the default mesos-executor (by specifying a CommandInfo directly in TaskInfo, instead of embedded inside an ExecutorInfo). If, however you want to build a custom executor, at minimum you need to implement launchTask(), and ideally also killTask().
Example Task: An example task could be a simple linux command like sleep 1000 or echo "Hello World", or a docker container (via ContainerInfo) like image : 'mysql'. Or, if you use a custom executor, then the executor defines what a task is and how to run it, so a task could instead be run as another thread in the executor's process, or just become an item in a queue in a single-threaded executor.
Executor pinning: The executor is distributed via CommandInfo URIs, just like any task binaries, so they do not need to be preinstalled on the nodes. Mesos will fetch and run it for you.
Schedulers: are some strategy to accept or reject the offer. Schedulers we can write our own or we can use some existing one like chronos. In scheduler we should evaluate the resources available and then either accept or reject.
Scheduler functionality: Example could be like suppose say u have a task which needs 8 cpus to run, but the offer from mesos may be 6 cpus which won't serve the need in this case u can reject.
Executor functionality : Executor handles state related information of your task. Set of APIs you need to implement like what is the status of assigned task in mesos slave. What is the num of cpus currently available in mesos slave where executor is running.
concrete example for executor : chronos
being installed and executed dynamically/on-the-fly : These are not possible, you need to pre configure the executors. However you can replicate the executors using autoscaling.

Please recommend an alternative to Microsoft HPC [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
It is built around job-task concept. A job may have one to 100,000 tasks.
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
Tasks create other tasks on the fly.
Some tasks may run for minutes, while others may take many hours.
The tasks run according to a dependency hierarchy, which may be updated on the fly.
The job may be paused and resumed later.
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
The tasks tell their progress and result back to the manager.
The manager is aware if the task is alive or hanged.
We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:
Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
Task is unable to report its progress back to the manager, only job can.
There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).
Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.
I would take a look at the Condor high throughput computing project. It supports windows (and linux, and OSX) clients and servers, handles complex dependencies between tasks using DAGman and can suspend (and even move) tasks. I've experience of systems based on Condor that scale to thousands of machines across university campuses.
Platform LSF will do everything you need. It runs on Windows. It is commercial, and can be purchased with support.
Yes. 1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
Yes 2. It is built around job-task concept. A job may have one to 100,000 tasks.
Yes 3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
Yes 4. Tasks create other tasks on the fly.
Yes 5. Some tasks may run for minutes, while others may take many hours.
Yes 6. The tasks run according to a dependency hierarchy, which may be updated on the fly.
Yes 7. The job may be paused and resumed later.
Yes 8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
Yes 9. The tasks tell their progress and result back to the manager.
Yes 10. The manager is aware if the task is alive or hanged.
Have you looked at Beowulf? Lots of distributions to choose from, and lots of customization options. You ought to be able to find something to meet your needs...
I would recommend Beowulf cause Beowulf behaves more like a single machine rather than many workstations.
give gridgain a try. This should make runtime addition of nodes very easy, and you can monitor/manage the cluster using jmx interfaces
If you don't mind hosting your project in a cloud, you might want to have a look at Windows Azure / Appfabric. AFAIK it allows you to distribute your jobs via workflows and you can dynamically add more worker machines to handle your jobs as the load increases.
You can definitely solve this sort of problem using Data Synapse Grid Server.
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. Yes, a Broker can easily handle 2000 Engines.
It is built around job-task concept. A job may have one to 100,000 tasks. Yes, I have queued in excess of 250,000 tasks without issue. Eventually you will run out of memory.
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
Tasks create other tasks on the fly. It can be done, although I would not recommend this sort of model
Some tasks may run for minutes, while others may take many hours. yes
The tasks run according to a dependency hierarchy, which may be updated on the fly. yes, but I would manage this outside of the grid computing infrastructure
The job may be paused and resumed later. yes
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. yes
The tasks tell their progress and result back to the manager. yes
` 10. The manager is aware if the task is alive or hanged. yes
Have you examined the SunGrid Engine? It's been a long time since I used it, and I never used it to its full capabilities, but this is my understanding.
There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable. yes
It is built around job-task concept. A job may have one to 100,000 tasks. not sure
A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node. yes
Tasks create other tasks on the fly. I think so?
Some tasks may run for minutes, while others may take many hours. yes
The tasks run according to a dependency hierarchy, which may be updated on the fly. not sure
The job may be paused and resumed later. not sure
Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks. pretty sure
The tasks tell their progress and result back to the manager. pretty sure
`
10. The manager is aware if the task is alive or hanged. yes

Resources