I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks (which must be >= parallelism_hint) to split the input data into number_of_task partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.
Related
I have confusion in the number of tasks that can work in parallel in Flink,
Can someone explain to me:
what is the number of parallelism in a distributed system? and its relation to Flink terminology
In Flink, is it the same as we say 2 parallelism = 2 tasks work in parallel?
In Flink, if 2 operators work separately but the number of parallelism in each one of them is 1, does that count as parallel computation?
Is it true that in a KeyedStream, the maximum number of parallelism is the number of keys?
Does the Current CEP engine in Flink able to work in more than 1 task?
Thank you.
Flink uses the term parallelism in a pretty standard way -- it refers to running multiple copies of the same computation simultaneously on multiple processors, but with different data. When we speak of parallelism with respect to Flink, it can apply to an operator that has parallel instances, or it can apply to a pipeline or job (composed of a several operators).
In Flink it is possible for several operators to work separately and concurrently. E.g., in this job
source ---> map ---> sink
the source, map, and sink could all be running simultaneously in separate processors, but we wouldn't call that parallel computation. (Distributed, yes.)
In a typical Flink deployment, the number of task slots equals the parallelism of the job, and each slot is executing one complete parallel slice of the application. Each parallel instance of an operator chain will correspond to a task. So in the simple example above, the source, map, and sink can all be chained together and run in a single task. If you deploy this job with a parallelism of two, then there will be two tasks. But you could disable the chaining, and run each operator in its own task, in which case you'd be using six tasks to run the job with a parallelism of two.
Yes, with a KeyedStream, the number of distinct keys is an upper bound on the parallelism.
CEP can run in parallel if it is operating on a KeyedStream (in which case, the pattern matching is being done independently for each key).
I need a functionality in storm that i know (based on the docs) has not been yet implemented. I need to add more tasks at runtime without the need to have an initial large number of tasks, because it might cause performance issues. because Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.
I know that rebalance command can be used to add executors ans worker processes at runtime and there is a rule that #executors<=#tasks and this means that number of tasks should be static at runtime, but i'm curious how hard is it(if not impossible) to add this feature to storm.
Is there a way to implement this functionality in storm or it can't be done at all? if there is a way please give me clue how to do it.
Not sure what you mean by "since those extra tasks run serially".
Tasks is Storm are use to exploit data parallelism. In theory it's possible to add code to change the number of tasks at runtime. But it would be a huge change and AFAIK there are no plans to add this feature.
Compare http://storm.apache.org/releases/1.0.3/Understanding-the-parallelism-of-a-Storm-topology.html
Because keys are assigned to tasks hash based, changing the number of tasks would require to rehash all keys to new tasks. If an operator builds up an key-based internal state, this state would need to get partitioned by key and redistributed accordingly, too.
I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive.
I am not able to understand DAG concept.
Anyone have a good reference for understanding the architecture of Apache TEZ.
The presentation from Hadoop Summit (slide 35) discussed how the DAG approach is optimal vs MapReduce paradigm:
http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (aka workflow, aka Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.). Because the entire job's steps can be computed before execution time, the system can take advantage of caching intermediate job results "in memory". Whereas, in MapReduce all intermediate data between MapReduce phases required writing to HDFS (disk) adding latency.
YARN also allows container reuse for Tez tasks. E.g. each server is chopped into multiple "containers" rather than "map" or "reduce" slots. For any given point in the job execution this allows Tez to use the entire cluster for the map phases or the reduce phases as needed. Whereas in Hadoop v1 prior to YARN, the number of map slots (and reduce slots) were fixed/hard coded at the platform level. Better utilization of all available cluster resources generally leads to faster
Apache Tez represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.
Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance which is managed by Tez. Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph.
… with vertices in the graph representing application logic and edges representing movement
of data. A rich dataflow definition API allows users to express complex query logic in an
intuitive manner and it is a natural fit for query plans produced by higher-level
declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as
a single Tez job that will run the entire computation. Expanding this logical graph into a
physical graph of tasks and executing it is taken care of by Tez.
Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components
•DAG. this defines the overall job. The user creates a DAG object for each data processing job.
•Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
•Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.
Data movement: Defines routing of data between tasks ◦One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.
Scheduling. Defines when a consumer task is scheduled ◦Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.
Data source: Defines the lifetime/reliability of a task output ◦Persisted: Output will be available after the task exits. Output may be lost later on.
Persisted-Reliable: Output is reliably stored and will always be available
Ephemeral: Output is available only while the producer task is running.
Additional details on Tez architecture are presented in this Apache Tez Design Doc.
I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are:
Tez will share data between Map Reduce jobs in memory when possible, avoiding the overhead of writing/ reading to/ from HDFS
With Tez you can run multiple map/ reduce DAGs defined on Hive, in one Tez session without needing to start a new application master each time.
You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/
Tez is a DAG (Directed acyclic graph) architecture. A typical Map reduce job has following steps:
Read data from file -->one disk access
Run mappers
Write map output --> second disk access
Run shuffle and sort --> read map output, third disk access
write shuffle and sort --> write sorted data for reducers --> fourth disk access
Run reducers which reads sorted data --> fifth disk output
Write reducers output -->sixth disk access
Tez works very similar to Spark (Tez was created by Hortonworks well before Spark):
Execute the plan but no need to read data from disk.
Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Only one read and one write.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
References http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey
https://community.hortonworks.com/questions/83394/difference-between-mr-and-tez.html
Main difference to MR and TEZ is writing intermediate data to local disk in MR. But, in TEZ, either mapper/reducer functionality will execute in an single instance on each container using in memory. TEZ is moreover performs operations like transactions or actions in spark operations.
What is the benefit of using multiple task in an executor in storm topology. I mean I couldn't understand that except doing multiple thing, we can achieve any speed or parallelism?
Michael G.Noll wrote a great tutorial that should help you to understand storm parallelism.
Usually a topology runs one task per executor. However since you cannot increase the number of tasks while a topology is running you can declare multiple tasks per executor in order to scale up parallelism over time.
There is no specific use case to have multiple tasks per executor other than the possibility to increase the topology parallelism.
I'm trying to learn twitter storm by following the great article "Understanding the parallelism of a Storm topology"
However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ? I think having multiple tasks in a thread, since a thread executes sequentially, only make the thread a kind of "cached" resource, which avoids spawning new thread for next task run. Am I correct?
I may clear those confusion by myself after taking more time to investigate, but you know, we both love stackoverflow ;-)
Thanks in advance.
Disclaimer: I wrote the article you referenced in your question above.
However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Yes, and yes.
Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?
Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.
As I wrote in the article please note that:
The number of executor threads can be changed after the topology has been started (see storm rebalance command).
The number of tasks of a topology is static.
And by definition there is the invariant of #executors <= #tasks.
So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.
Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
In practice we normally we run 1 task per executor.
PS: Note that Storm will actually spawn a few more threads behind the scenes. For instance, each executor has its own "send thread" that is responsible for handling outgoing tuples. There are also "system-level" background threads for e.g. acking tuples that run alongside "your" threads. IIRC the Storm UI counts those acking threads in addition to "your" threads.