I'm trying to figure out slot sharing and parallelism in Flink with the example WordCount.
Saying that I need to do the word count job with Flink, there are only one data source and only one sink.
In this case, can I make a design just like the image above? I mean, I set two sub-tasks on Source + map() and two sub-tasks on keyBy()/window()/apply(), in other words, I have two lines: A --- B --- Sink and C --- D --- Sink so that I can get a better performance.
For example, there is a data stream coming: aaa, bbb, aaa. With the design above, I may get such a situation: aaa and bbb goes into the A --- B and the other aaa goes into the C --- D. And finally, I can get the result aaa: 2, bbb: 1 at the Sink. Am I right for now?
If I'm right, I know that subtasks of the same task cannot share a slot, so does it mean that A and C can't share a slot, B and D can't share a slot? Am I right? How do I assign the slots? Should I put A + B + Sink into one slot and put C + D into another slot?
Slot sharing is enabled by default. With slot sharing enabled, the number of slots required is the same as the parallelism of the task with the highest parallelism (which is two in this case).
In this example the scheduler will put A + B + Sink into one slot, and C + D into another. This isn't something you normally need to configure or even give much thought to, as the defaults work well in most cases.
If you were to completely disable slot sharing, then this job would require 5 slots, one for each of A, B, C, D, and the sink. But disabling slot sharing is almost never a good idea. Just make sure each slot has sufficient resources to run all of the subtasks concurrently.
Related
I am using Apache Beam on GCP Dataflow. I want to use a PCollection multiple times but I'm worried that it might recompute an expensive PCollection. I can't find a "materialize" or "cache" transform in Apache Beam documentation.
import apache_beam as beam
# Set up a pipeline and read in a PCollection
p = beam.Pipeline()
input_data = p | beam.io.ReadFromText('input.txt')
reused_data = input_data | beam.Map(some_expensive_function)
# Write the outputs to different files
reused_data | beam.io.WriteToText('output1.txt')
reused_data | beam.io.WriteToText('output2.txt')
# Run the pipeline
p.run()
What will happen here? Will it recompute my data or will it cache my data? What if I don't have enough memory on my machines?
In the pipeline as written, nothing will be cached or re-computed (modulo failure recovery). Though the details are left up to the runner, most runners do what is called fusion. In particular, what will happen in this case is roughly
Get first element from input.txt.
apply some_expensive_function, resulting in some element X.
write X to output1.txt
write X to output2.txt
[go back to step 1]
If there were other DoFns between 2 and 3/4, they would be applied, element by element, and their outputs fully taken care of, before going back to step 1 to start on the next element. At no point is the full reused_data PCollection materialized, it's only materialized one element at a time (possibly in parallel across many workers of course).
If for some reason fusion is not possible (this happens with conflicting resource constraints or side inputs sometimes) the intermediate data is implicitly materialized to disk rather than re-computed.
In this case the reused_data is computed once, and then the same PCollection and data will be sinked in the 2 GCS buckets.
reused_data
| |
| |
| |
Write GCS bucket 1 Write GCS bucket 2
Each sink will traverse the reused_data pcollection to write the result to cloud storage bucket.
If you have to use expensive data on your input PCollection, I recommend you using the Dataflow runner instead of DirectRunner in your local machine.
Dataflow runner will treat your data in parallel, autoscaling and with multiple Compute Engine VMs if necessary.
As Mazlum Tosun already answered, your PCollection reused_data is written twice. However, I wanted to point out that the PCollection may only be distributed along as a pointer. Consequently, this might lead to incorrect behavior if you want/start to manipulate your Pcollection in one of the branches of your pipeline.
For example, if you run this code (e.g., here)
import apache_beam as beam
class ManipulatePcoll(beam.DoFn):
def process(self, element):
element[1] = 55
yield element
with beam.Pipeline() as pipeline:
main = (
pipeline
| "init main" >> beam.Create([[1,2,3]])
)
# pipeline branch 1
(
main
| "print original result" >> beam.Map(print)
)
# pipeline branch 2
(
main
| beam.ParDo(ManipulatePcoll())
| "print manipulated result" >> beam.Map(print)
)
you get as a result
[1, 55, 3]
[1, 55, 3]
since main in branch 1 points to the same memory as in branch 2.
However, there are cases for which the PCollection is actually serialized and copied, e.g. when distributing data between different workers (see here for a full list).
About your second question, allow me to cite the beam documentation and programming guide
A PCollection is a large, immutable “bag” of elements. There is no
upper limit on how many elements a PCollection can contain; any given
PCollection might fit in memory on a single machine, or it might
represent a very large distributed data set backed by a persistent
data store.
I have a pool of 10 employees. These employees are to operate two machines. 7 of these employees can operate machine A better than the remaining 3, who can operate machine B better.
Example: Worker XYZ works on machine A, on which he is better. On machine B, however, one worker is missing, because all 3 workers there are overloaded. Now worker XYZ should leave his prioritized work on machine A and go to machine B. As soon as the workload there becomes less again, worker XYZ should return to the prioritized machine A.
Does anyone have an idea how to do this kind of prioritization? In the best case still a prioritization on the basis of a quality matrix, so that one could take not only boolean values, but also double as value?
I had two approaches so far:
1.) create a single resource for each of the workers and do a function query for priority when a seize-block is received or.
2.) using the maintenance block in the different priorities for different tasks and try to assign them to the individual employees.
Unfortunately, I'm not really getting anywhere with either approach. Can anyone help me? Thank you very much in advance.
I have a workflow constructed in Flink that consists of a custom source, a series of maps/flatmaps and a sink.
The run() method of my custom source iterates through the files stored in a folder and collects, through the collect() method of the context, the name and the contents of each file (I have a custom object that stores this info in two fields).
I then have a series of maps/flatmaps transforming such objects which are then printed into files using a custom sink. The execution graph as this is produced in the Flink's Web UI is the following:
I have a cluster or 2 workers setup to have 6 slots each (they both have 6 cores, too). I set the parallelism to 12. From the execution graph I see that the parallelism of the source is 1, while the rest of the workflow has parallelism 12.
When I run the workflow (I have around 15K files in the dedicated folder) I monitor, using htop, the resources of my workers. All the cores reach up to 100% utilisation for most of the time but every roughly 30 minutes or so, 8-10 of the cores become idle for about 2-3 minutes.
My questions are the following:
I understand that the source runs having parallelism 1 which I believe is normal when reading from a local storage (my files are located into the same directory in each worker as I don't know which worker will be selected to execute the source). Is it normal indeed? Could you please explain why this is the case?
The rest of my workflow is executed having parallelism 12 which looks to be correct as by checking the task managers' logs I get prints from all the slots (e.g., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**3/12**)] INFO ...., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**5/12**)] INFO ...., etc.)). What I don't understand though is if one slot is executing the source role and I have 12 slots in my cluster, how is the rest of the workflow executed by 12 slots? Is one slot acting for both the source and one instance of the rest of the workflow? If yes, how are the resources for this specific slot allocated? Would it be possible for someone to explain the steps undergoing in this workflow? For example (this might be wrong):
Slot 1 reads files and forwards them to available slots (2 to 12)
Slot 1 forwards one file to itself and stops reading until it finishes its job
When done, slot 1 reads more files and forwards them to slots that became available
I believe what I describe above is wrong but I give it as an example to better explain my question
Why I have this idle state for the majority of the cores every 30 minutes (more or less) that lasts for about 3 minutes?
To answer the specific question about parallelizing your read, I would do the following...
Implement your custom source by extending the RichSourceFunction.
In your open() method, call getRuntimeContext().getNumberOfParallelSubtasks() to get the total parallelism and call getRuntimeContext().getIndexOfThisSubtask() to get the index of the sub-task being initialized.
In your run() method, as you iterate over files, get the hashCode() of each file name, modulo the total parallelism. If this is equal to your sub-task's index, then you process it.
In this way you can spread the work out over 12 sub-tasks, without having sub-tasks try to process the same file.
The single consumer setup limits the overall throughput of your pipeline to the performance of the only one consumer. Additionally, it introduces the heavy shuffle to all slots - in this case, all the data read by consumer gets serialized on this consumer slot as well, which is an additional CPU load. In contrast, having the consumer parallelism equal to map/flat map parallelsm would allow to chain the source -> map operations and avoid shuffle.
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. So, in your case slot 1 has both consumer and map/flat map tasks, and other slots have only map/flat map tasks. See here for more details: https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/runtime.html#task-slots-and-resources. Also, you can actually view the instances for each subtasks on Web UI.
Do you have checkpointing enabled ? If yes and if it's 30 minutes, then probably this is the interval when the state gets snapshotted.
Consider that we are going to compute the average of a number of temperature sensors in a given period of time and this computation will be done in a parallel fashion using a SPE. Usually, this computation is done by at least four UDF:
map -> keyBy -> window -> aggregate
If my keyBy operator is responsible to get the ID of each sensor and I have only 2 sensors, the parallelism of 2 is enough to my application (disclaimer: I don't want to consider how large is the window or the tuples to be fit in memory for now).
If I have 1000 sensors it will be very nice to increase the parallelism. Let's say to 100 nodes.
But what if my parallelism is set to 100 and I am processing tuples only of 2 sensors. Will I have 98 nodes idle? Do Spark, Flink, or Storm knows that they don't have to shuffle data to the 98 nodes?
The motivation for my question is this other question.
What kind of application and scenario can I implement which shows that the current Stream Processing Engines (Storm, Flink, Spark) don't know how to optimize the parallelism internally in order to shuffle fewer data across the network?
Can they predict any characteristic of the data volume or variety? or the resources underneath the hood?
Thanks
The whole point of keyBy() is to distribute items with the same key to the same operator. If you have 2 keys, your items are literally being split into 2 groups and your max parallelism for this stream is 2. Items with key A will be sent to one operator and items with key B will be sent to another operator.
Within Flink, if you want to just distribute the processing of your items amongst all of the parallel operators then you can use DataStream::shuffle().
I've to write a weighted load balancing algorithm and I'm looking for some references. Is there any book ? that you can suggest to understand such algorithms.
Thanks!
A simple algorithm here isn't that complicated.
Let's say you have a list of servers with the following weights:
A 10
B 20
C 30
Where the higher weight represents it can handle more traffic.
Just divide the amount of traffic sent to each server by the weight and sort smallest to largest. The server that comes out on top gets the user.
for example, let's say each server starts at 10 users, then the order is going to be:
C - 10 / 30 = 0.33
B - 10 / 20 = 0.50
A - 10 / 10 = 1.00
Which means the next 5 requests will go to server C. The 6th request will go to either C or B. The 7th will go to whichever one didn't handle the 6th.
To complicate things, you might want the balancer to be more intelligent. In which case it needs to keep track of how many requests are currently being serviced by each of the servers and decrement them when the request is completely fulfilled.
Further complications include adding stickiness to sessions. Which means the balancer has to inspect each request for the session id and keep track of where they went last time.
On the whole, if you can just buy a product from a company that already does this.
Tomcat's balancer app and the tutorial here serve as good starting points.