What are the Apache Flink non-intensive vs intensive tasks? - parallel-processing

I have been learning Apache Flink.
At the part of Task Slots and Resources, how the subtasks are allocated into task slots within TaskManager is discussed.
"It is easier to get better resource utilization. Without slot sharing, the non-intensive source/map() subtasks would block as many resources as the resource intensive window subtasks."
What are the non-intensive vs intensive tasks and what are their different characteristics?

Related

How do I find a marathon runaway process

I have a mesos / marathon system, and it is working well for the most part. There are upwards of 20 processes running, most of them using only part of a CPU. However, sometimes (especially during development), a process will spin up and start using as much CPU as is available. I can see on my system monitor that there is a pegged CPU, but I can't tell what marathon process is causing it.
Is there a monitor app showing CPU usage for marathon jobs? Something that shows it over time. This would also help with understanding scaling and CPU requirements. Tracking memory usage would be good, but secondary to CPU.
It seems that you haven't configured any isolation mechanism on your agent (slave) nodes. mesos-slave comes with an --isolation flag that defaults to posix/cpu,posix/mem. Which means isolation at process level (pretty much no isolation at all). Using cgroups/cpu,cgroups/mem isolation will ensure that given task will be killed by kernel if exceeds given memory limit. Memory is a hard constraint that can be easily enforced.
Restricting CPU is more complicated. If you have machine that offers 8 CPU cores to Mesos and each of your tasks is set to require cpu=2.0, you'll be able run there at most 4 tasks. That's easy, but at given moment any of your 4 tasks might be able to utilize all idle cores. In case some of your jobs is misbehaving, it might affect other jobs running on the same machine. For restricting CPU utilization see Completely Fair Scheduler (or related question How to understand CPU allocation in Mesos? for more details).
Regarding monitoring there are many possibilities available, choose an option that suits your requirements. You can combine many of the solutions, some are open-source other enterprise level solutions (in random order):
collectd for gathering stats, Graphite for storing, Grafana for visualization
Telegraf for gathering stats, InfluxDB for storing, Grafana for visualization
Prometheus for storing and gathering data, Grafana for visualization
Datadog for a cloud based monitoring solution
Sysdig platform for monitoring and deep insights

deploy bolts/spout to a specific supervisor

We are running a storm application using a single type on instance in AWS and a single topology to run our system.
This is causing some resource limitation issues.
The way we want to address this is by splitting our IO intense bolts into a cluster of a few dozens t1.small machines (for example) and all our CPU intense bolts to two large machines with lots of cpu & memory.
Basically what i am asking is, is there a way to start all this supervisors and then deploy one topology that include cpu intense bolts on the big machines and to the small machines the deploy IO bolts?
You can implement a custom scheduler using interface IScheduler.
See
http://www.exogeni.net/2015/04/enabling-site-aware-scheduling-for-apache-storm-in-exogeni/
https://dcvan24.wordpress.com/2015/04/07/metadata-aware-custom-scheduler-in-storm/
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/DemoScheduler.java

What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?

As spark runs in-memory what does resource allocation mean in Spark when running on yarn and how does it contrast with hadoop's container allocation?
Just curious to know as hadoop's data and computations are on the disk where as Spark is in-memory.
Hadoop is a framework capable of processing large data. It has two layers. One is a distributed file system layer called HDFS and the second one is the distributed processing layer. In hadoop 2.x, the processing layer is architectured in a generic way so that it can be used for non-mapreduce applications also.
For doing any process, we need system resouces such as memory, network, disk and cpu. The term container came in hadoop 2.x. In hadoop 1.x, the equivalent term was slot. A container is an allocation or share of memory and cpu. YARN is a general resource management framework which enables efficient utilization of the resources in the cluster nodes by proper allocation and sharing.
In-memory process means, the data will be completely loaded into memory and processed without writing the intermediate data to the disk. This operation will be faster as the computation happens in memory without much disk I/O operations. But this needs more memory because the entire data will be loaded into the memory.
Batch process means the data will be taken and processed in batches, intermediate results will be stored in the disk and again supplied to the next process. This also needs memory and cpu for processing, but it will be less as compared to that of fully in-memory processing systems.
YARN's resource manager act as the central resource allocator for applications such as mapreduce, impala (with llama), spark (in yarn mode) etc. So when we trigger a job, it will request the resource manager for the resources required for execution. The resource manager will allocate resources based on the availability. The resources will be allocated in the form of containers. Container is just an allocation of memory and cpu. One job may need multiple containers. Containers will be allocated across the cluster depending upon the availability. The tasks will be executed inside the container.
For example, When we submit a mapreduce job, an MR application master will be launched and it will negotiate with the resource manager for additional resources. Map and reduce tasks will be spawned in the allocated resources.
Similarly when we submit a spark job (YARN mode), a spark application master will be launched and it will negotiate with the resource manager for additional resources. The RDD's will be spawned in the allocated resources.

How do YARN applications estimate needed resources

I'm wondering how does a YARN app (let's say MapReduce job) estimate needed resources (CPU, RAM) for a single mapper/reducer.
The question is too broad but I'll try to give a direction for investigation. When Yarn application is executed, it requests some number of resources from Resource Manager. Resource management in Yarn is implemented by means of schedulers. Yarn supports two schedulers:
Fair scheduler
Capacity Scheduler
Schedulers define rules that are used for estimating "slots" for an application. For some schedulers "slots" are defined only by memory required for the application (Capacity Scheduler with DefaultResourseCalculator). Others take number of CPUs into account as well.

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:
In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container. There are different AppMasters running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.
I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.
Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.
YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.
From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.
Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.
Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.
FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.
I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.
All the above answers covered lot of information: I am simplifying all the information as follows:
MapReduce: YARN:
1. It is Platform plus Application It is a Platform in Hadoop 2.0 and
in Hadoop 1. 0 and it is only of doesn't exist in Hadoop 1.0
the applications in Hadoop 2.0
2. It is single use system i.e., It is multi purpose system, We can run
We can run MapReduce jobs only. MapReduce, Spark, Tez, Flink, BSP, MPP,
MPI, Giraph etc... (General Purpose)
3. JobTracker scalability i.e., Both Resource Management and
Both Resource Management and Application Management gets separated &
Job Management managed by RM+NM, Paradigm specific AMs
respectively.
4. Poor Resource Management Flexible Resource Management i.e.,
system i.e., slots (map/reduce) containers.
5. It is not highly available High availability and reliability.
6. Scaled out up to 5000 nodes Scaled out 10000 plus nodes.
7. Job->tasks Application -> DAG of Jobs -> tasks
8. Classical MapReduce = MapReduce Yarn MapReduce = MapReduce API +
API + MapReduce FrameWork MapReduce FrameWork + YARN System
+ MapReduce System So MR programs which were written over
Hadoop 1.0 run over Yarn also with out
changing a single line of code i.e.,
backward compatibility.
Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.
Issue of Scalability : Job Tracker runs on a single machine even though you have thousands of nodes in Hadoop cluster. The responsibilities of Job tracker : Resource management, Job and Task schedule and monitoring. Since all these processes are running on a single node, this model is not scalable.
Issue of availability ( Single point of failure): Job Tracker is a single point of failure.
Resource utilization: Due to predefined number of Map & Reduce task slots, resources are not utilized properly. When all Mapper nodes are busy, Reducer nodes are idle and can't be used to process Mapper tasks.
Tight integration with Map Reduce framework: Hadoop 1.x can run Map reduce jobs only. Support for jobs other than Map Reduce jobs does not exists.
Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Now advantages of YARN
Scalability issues have been resolved
No single point of failure. All components are highly available
Resource utilization has been improved with proper utilization of Map and reduce slots.
Non Map Reduce Jobs can be submitted
It looks like this link might be what you're looking for: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.
My understanding is that YARN is supposed to be more generic. You can create your own YARN applications that negotiate directly with the Resource Manager for resources (1), and MapReduce is just one of several Application Managers that already exist (2).

Resources