In Apache Beam, can PCollections be reused? Will they be recomputed or can they be cached - caching

I am using Apache Beam on GCP Dataflow. I want to use a PCollection multiple times but I'm worried that it might recompute an expensive PCollection. I can't find a "materialize" or "cache" transform in Apache Beam documentation.
import apache_beam as beam
# Set up a pipeline and read in a PCollection
p = beam.Pipeline()
input_data = p | beam.io.ReadFromText('input.txt')
reused_data = input_data | beam.Map(some_expensive_function)
# Write the outputs to different files
reused_data | beam.io.WriteToText('output1.txt')
reused_data | beam.io.WriteToText('output2.txt')
# Run the pipeline
p.run()
What will happen here? Will it recompute my data or will it cache my data? What if I don't have enough memory on my machines?

In the pipeline as written, nothing will be cached or re-computed (modulo failure recovery). Though the details are left up to the runner, most runners do what is called fusion. In particular, what will happen in this case is roughly
Get first element from input.txt.
apply some_expensive_function, resulting in some element X.
write X to output1.txt
write X to output2.txt
[go back to step 1]
If there were other DoFns between 2 and 3/4, they would be applied, element by element, and their outputs fully taken care of, before going back to step 1 to start on the next element. At no point is the full reused_data PCollection materialized, it's only materialized one element at a time (possibly in parallel across many workers of course).
If for some reason fusion is not possible (this happens with conflicting resource constraints or side inputs sometimes) the intermediate data is implicitly materialized to disk rather than re-computed.

In this case the reused_data is computed once, and then the same PCollection and data will be sinked in the 2 GCS buckets.
reused_data
| |
| |
| |
Write GCS bucket 1 Write GCS bucket 2
Each sink will traverse the reused_data pcollection to write the result to cloud storage bucket.
If you have to use expensive data on your input PCollection, I recommend you using the Dataflow runner instead of DirectRunner in your local machine.
Dataflow runner will treat your data in parallel, autoscaling and with multiple Compute Engine VMs if necessary.

As Mazlum Tosun already answered, your PCollection reused_data is written twice. However, I wanted to point out that the PCollection may only be distributed along as a pointer. Consequently, this might lead to incorrect behavior if you want/start to manipulate your Pcollection in one of the branches of your pipeline.
For example, if you run this code (e.g., here)
import apache_beam as beam
class ManipulatePcoll(beam.DoFn):
def process(self, element):
element[1] = 55
yield element
with beam.Pipeline() as pipeline:
main = (
pipeline
| "init main" >> beam.Create([[1,2,3]])
)
# pipeline branch 1
(
main
| "print original result" >> beam.Map(print)
)
# pipeline branch 2
(
main
| beam.ParDo(ManipulatePcoll())
| "print manipulated result" >> beam.Map(print)
)
you get as a result
[1, 55, 3]
[1, 55, 3]
since main in branch 1 points to the same memory as in branch 2.
However, there are cases for which the PCollection is actually serialized and copied, e.g. when distributing data between different workers (see here for a full list).
About your second question, allow me to cite the beam documentation and programming guide
A PCollection is a large, immutable “bag” of elements. There is no
upper limit on how many elements a PCollection can contain; any given
PCollection might fit in memory on a single machine, or it might
represent a very large distributed data set backed by a persistent
data store.

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Flink workflow parallelism with custom source

I have a workflow constructed in Flink that consists of a custom source, a series of maps/flatmaps and a sink.
The run() method of my custom source iterates through the files stored in a folder and collects, through the collect() method of the context, the name and the contents of each file (I have a custom object that stores this info in two fields).
I then have a series of maps/flatmaps transforming such objects which are then printed into files using a custom sink. The execution graph as this is produced in the Flink's Web UI is the following:
I have a cluster or 2 workers setup to have 6 slots each (they both have 6 cores, too). I set the parallelism to 12. From the execution graph I see that the parallelism of the source is 1, while the rest of the workflow has parallelism 12.
When I run the workflow (I have around 15K files in the dedicated folder) I monitor, using htop, the resources of my workers. All the cores reach up to 100% utilisation for most of the time but every roughly 30 minutes or so, 8-10 of the cores become idle for about 2-3 minutes.
My questions are the following:
I understand that the source runs having parallelism 1 which I believe is normal when reading from a local storage (my files are located into the same directory in each worker as I don't know which worker will be selected to execute the source). Is it normal indeed? Could you please explain why this is the case?
The rest of my workflow is executed having parallelism 12 which looks to be correct as by checking the task managers' logs I get prints from all the slots (e.g., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**3/12**)] INFO ...., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**5/12**)] INFO ...., etc.)). What I don't understand though is if one slot is executing the source role and I have 12 slots in my cluster, how is the rest of the workflow executed by 12 slots? Is one slot acting for both the source and one instance of the rest of the workflow? If yes, how are the resources for this specific slot allocated? Would it be possible for someone to explain the steps undergoing in this workflow? For example (this might be wrong):
Slot 1 reads files and forwards them to available slots (2 to 12)
Slot 1 forwards one file to itself and stops reading until it finishes its job
When done, slot 1 reads more files and forwards them to slots that became available
I believe what I describe above is wrong but I give it as an example to better explain my question
Why I have this idle state for the majority of the cores every 30 minutes (more or less) that lasts for about 3 minutes?
To answer the specific question about parallelizing your read, I would do the following...
Implement your custom source by extending the RichSourceFunction.
In your open() method, call getRuntimeContext().getNumberOfParallelSubtasks() to get the total parallelism and call getRuntimeContext().getIndexOfThisSubtask() to get the index of the sub-task being initialized.
In your run() method, as you iterate over files, get the hashCode() of each file name, modulo the total parallelism. If this is equal to your sub-task's index, then you process it.
In this way you can spread the work out over 12 sub-tasks, without having sub-tasks try to process the same file.
The single consumer setup limits the overall throughput of your pipeline to the performance of the only one consumer. Additionally, it introduces the heavy shuffle to all slots - in this case, all the data read by consumer gets serialized on this consumer slot as well, which is an additional CPU load. In contrast, having the consumer parallelism equal to map/flat map parallelsm would allow to chain the source -> map operations and avoid shuffle.
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. So, in your case slot 1 has both consumer and map/flat map tasks, and other slots have only map/flat map tasks. See here for more details: https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/runtime.html#task-slots-and-resources. Also, you can actually view the instances for each subtasks on Web UI.
Do you have checkpointing enabled ? If yes and if it's 30 minutes, then probably this is the interval when the state gets snapshotted.

K-way merge sort divided on multiple hosts

I have ~8000 files, with ~6TB data on disk. Each file contains a list of key value pairs, and I wish to consolidate those values into a single list of sorted key-value pairs (e.g. so if key A occurs in two files, the consolidated file contains the key A once and that key contains all values from the two files).
I have implemented this k-way merge for a single core on a single host in Python [gist -- see this thread for a nice intuitive overview of the procedure]. I now wish to distribute the work over multiple hosts that do not have shared memory but can have shared network access.
The key space that I need to sort is absolutely enormous, roughly 26^24, but the vast majority of keys are not present in the data (so it doesn't make sense to give each worker a set of keys with which to concern themselves).
Do others have any ideas how one could go about implementing a distributed k-way merge algorithm? This strikes me as entirely non-trivial, but there may be low hanging fruit that I'm not seeing. Any pointers others can offer would be greatly appreciated.
Notes
The compute setup is parameterizable. I'm working on two compute clusters, each will allow me to use ~10-1000 nodes concurrently, each with 12-24 cores and ~120GB RAM. The machines come online at some indeterminant time after they're requested. Network communication happens over TCP. Disks are SSD with an AFS filesystem and storage is abundant.
Additionally, I'm using a simple Python package big-read to read only n lines from each of the 8,000 files into RAM at any given time, so RAM management for an "external sort" is already tractable...
Highly related: K-way merge with stxxl.
A distributed sort/merge is very similar to a sort/merge on a single host. The basic idea is to split the files among the separate hosts. Have each host sort its individual files and then begin the merge operation that I described in Divide key value pairs into equal lists without access to key value counts. So each host has a priority queue containing the next item from each of the files that it sorted.
One of the hosts maintains a priority queue that contains the next item from each of the other hosts. It selects the first one from that queue, outputs it, and polls the host it came from for the next item, which it inserts into the priority queue and continues.
It's a priority queue of priority queues, distributed among multiple hosts. Graphically, it looks something like this:
Host1 Host2 Host3 Host4
------------------------------------------------------------------
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16
\ | | / \ | | / \ | | / \ | | /
---------- ---------- ------------ ------------
PQ1 PQ2 PQ3 PQ4
\ \ / /
\ \ / /
\ \ / /
\ \ / /
---------------\ /------------------
\ /
\ /
\ /
--
Master PQ
on primary host
Now, it's highly inefficient to be requesting a single item at a time from the individual hosts. The primary host could request, say, 1,000 items from each host and hold them in individual buffers. Whenever a host's buffer runs out, the primary host requests another buffer full from the host. That will reduce the amount of network traffic.
This also reduces I/O on the individual hosts: you never have to write the combined files to disk. You sort the individual files and write them to disk as described in my earlier answer, but then you begin the merge on the individual hosts and send the items to the primary host that does the big merge.
This is an already solved problem. Most mapreduce frameworks, such as Hadoop, do a distributed sort under the hood. The best ones will come complete with logic to detect failing machines, take them out, and redo their work. (When you're working with large distributed systems at scale, compensating for machine failure is important.) Just find a good framework and use it, rather than re-inventing the wheel.
As for how they sort it, I understand that the standard approach is a mergesort. At first you are handing out chunks of work that look like, "Sort this block." Then you start handing out chunks of work that look like, "Merge these chunks together." The tricky bit comes when your chunks to merge do not fit on a single computer. Then you need to take a group of chunks, and figure out where to partition it, then merge the pieces. I am not positive how they accomplish that. My best off the cuff idea would be to take something like an every thousand'th element subselection, sort that, partition it, and tell each machine that holds the full data where to cut their datasets into chunks, and who go send data to for the merging.
However it is done, you will eventually wind up with an ordered set of machines, each of which has an ordered section of data, and between them you have the full data all sorted.
IMPORTANT: When dealing with large distributed data sets, it is very important to avoid creating bottlenecks anywhere. Implicitly or explicitly. You start with distributed data. You process it in a distributed way. You wind up with distributed data. Always.
Do each of the 8000 files first need to be sorted by key or are they already sorted by key? If the 8000 first need to be sorted by key, that initial phase will be CPU bound. This initial phase to sort the files can be done in parallel (and multi-threaded, such as gnu sort). After this point, the process normally becomes file I/O bound, during the merge steps, but if the file I/O with the SSD's can be done independently, then the merge phases can also be done in parallel, using groups of SSD's. Eventually a final merge to produce a single sorted file will be file I/O bound and there would be no advantage to attempting a parallel implementation of this.
If your compare method is not very complicated, the bottleneck is most-likely the file-IO. This will get worse when you do it over a network rather than on a local hard drive. (But you can only be sure after profiling)
I'm sure the file-IO is your bottleneck (But you can only be sure after profiling).
I would recommend:
Load the data in big chunks into RAM (as big as you can) use quicksort on each chunk to sort it in RAM and write it as one file per chunk to disc.
Use your k-way merge to merge this big sorted files.

Spark Time Series - Custom Group By 10 Minute Intervals: Improve Performance

We have time series data (timestamp in us since 1970 and integer data value):
# load data and cache it
df_cache = readInData() # read data from several files (paritioned by hour)
df_cache.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
df_cache.agg({"data": "max"}).collect()
# now data is cached
df_cache.show()
+--------------------+---------+
| time| data|
+--------------------+---------+
|1.448409599861109E15|1551.7468|
|1.448409599871109E15|1551.7463|
|1.448409599881109E15|1551.7468|
Now we want to calculate some non-trivial things on top of 10 Minute time windows using an external python library. In order to do so, we need to load the data of each time frame in memory, apply the external function and store the result. Therefore a User Defined Aggregate Function (UDAF) is not possible.
Now the problem is, when we apply the GroupBy to the RDD, it is very slow.
df_cache.rdd.groupBy(lambda x: int(x.time / 600e6) ). \ # create 10 minute groups
map(lambda x: 1). \ # do some calculations, e.g. external library
collect() # get results
This operation takes for 120Mio samples (100Hz data) on two nodes with 6GB Ram around 14 minutes. Spark Details for the groupBy stage:
Total Time Across All Tasks: 1.2 h
Locality Level Summary: Process local: 8
Input Size / Records: 1835.0 MB / 12097
Shuffle Write: 1677.6 MB / 379
Shuffle Spill (Memory): 79.4 GB
Shuffle Spill (Disk): 1930.6 MB
If I use a simple python script and let it iterate over the input files, it takes way less time to finish.
How can this job be optimized in spark?
The groupBy is your bottleneck here : it needs to shuffle the data across all partitions, which is time consuming and takes a hefty space in memory, as you can see from your metrics.
The way to go here is to use the reduceByKey operation and chaining it as follow :
df_cache.rdd.map(lambda x: (int(x.time/600e6), (x.time, x.data) ).reduceByKey(lambda x,y: 1).collect()
The key takeaway here is that groupBy needs to shuffle all of your data across all partitions, whereas reduceByKey will first reduce on each of the partition and then across all partitions - reducing drastically the size of the global shuffle. Notice how I organized the input into a key to take advantage of the reduceByKey operation.
As I mentionned in the comments, you might also want to try your program by using Spark SQL's DataFrame abstraction, that can potentially give you an extra boost, thanks to its optimizer.

Scalable seq -> groupby -> count

I have a very large unordered sequence of int64s - about O(1B) entries. I need to generate the frequency histogram of the elements, ie:
inSeq
|> Seq.groupBy (fun x->x)
|> Seq.map (fun (x,l) -> (x,Seq.length l))
Let's assume I have only, say 1GB of RAM to work with. The full resulting map won't fit into RAM (nor can I construct it on the fly in RAM). So, of course we're going to have to generate the result on disk. What are some performant ways for generating the result?
One approach I have tried is partitioning the range of input values and computing the counts within each partition via multiple passes over the data. This works fine but I wonder if I could accomplish it faster in a single pass.
One last note is that the frequencies are power-law distributed. ie most of the items in the list only appear only once or twice, but a very small number of items might have counts over 100k or 1M. This suggests possibly maintaining some sort of LRU map where common items are held in RAM and uncommon items are dumped to disk.
F# is my preferred language but I'm ok working with something else to get the job done.
If you have enough disk space for a copy of the input data, then your multiple passes idea really requires only two. On the first pass, read an element x and append it to a temporary file hash(x) % k, where k is the number of shards (use just enough to make the second pass possible). On the second pass, for each temporary file, use main memory to compute the histogram of that file and append that histogram to the output. Relative to the size of your data, one gigabyte of main memory should be enough buffer space that the cost will be approximately the cost of reading and writing your data twice.

Resources