I am running my own GRPC server collecting events coming from various data sources. The server is developed in Go and all the event sources send the events in a predefined format as a protobuf message.
What I want to do is to process all these events with Apache Beam in memory.
I looked through the docs of Apache Beam and couldn't find a sample that does something like I want. I'm not going to use Kafka, Flink or any other streaming platform, just process the messages in memory and output the results.
Can someone show me a direction of a right way to start coding a simple stream processing app?
Ok, first of all, Apache Beam is not a data processing engine, it's an SDK that allows you to create a unified pipeline and run it on different engines, like Spark, Flink, Google Dataflow, etc. So, to run a Beam pipeline you would need to leverage any of supported data processing engine or use DirectRunner, which will run your pipeline locally but (!) it has many limitations and was mostly developed for testing purposes.
As every pipeline in Beam, one has to have a source transform (bounded or unbounded) which will read data from your data source. I can guess that in your case it will be your GRPC server which should retransmit collected events. So, for the source transform, you either can use already implemented Beam IO transforms (IO connectors) or create your own since there is no GrpcIO or something similar for now in Beam.
Regarding the processing data in memory, I'm not sure that I fully understand what you meant. It will mostly depend on used data processing engine since in the end, your Beam pipeline will be translated in, for example, Spark or Flink pipeline (if you use SparkRunner or FlinkRunner accordingly) before actual running and then data processing engine will manage the pipeline workflow. Most of the modern engines do their best efforts to keep all processed data in memory and flush it on disk only in the last resort.
Related
I am already having a stream pipeline written in Apache beam. Earlier I was running the same in Google Dataflow and it used to run like a charm. Now due to changing business needs I need to run it using flink runner.
Currently in use beam version is 2.38.0, Flink version is 1.14.5. I validated this and found this is supported and valid combination of versions.
The pipeline code which is written using Apache Beam sdk and it uses multiple ParDo and PTransforms. The pipeline is somewhat complicated in nature as it involves lot of interim operations (catered via these ParDo & PTransforms) between source and sink.The source in my case is Azure service bus topic which I am reading using JmsTopicIO reads. Until here all works fine i.e. stream of data enters in to the pipeline and getting processed normally. The problem occurs when load testing is performed. I see many operator going in to back pressure and eventually not able to read & process msgs from topic. Though the CPU and memory usage remains well under control of Job & Task manager.
Actual issue/problem/question: While troubleshooting this performance issues I observed that Flink is chaining and grouping these ParDo's and PTranforms (by itself) in to operators. With my implementation I see that many heavy processing tasks are getting combined in to same operator. This is causing slow processing of all such operators. Also the parallelism I have set (20 for now) is at pipeline level which mean each operator is running with 20 parallelism.
flinkOptions.setParallelism(20);
Question 1. Using apache beam sdk or any flink configuration is there any way I can control or manage the chaining and grouping of these ParDo/PTransforms in to operators (through code or config)?. So I should be able to uniformly distribute the load myself.
Question 2. With implementation using Apache Beam, how I can mention or set the parallelism to each operator (not to complete pipeline) based on the load on them?. This way I will be able to better allocate the resources to heavy computing operators (set of tasks).
Please suggest answers to above questions. Also if any other pointer can be given to me to work up on for flink performance improvements in my deployment. Just for reference please note my pipeline .
I'm looking at performing simple operations at scale (Data Extraction from files with Layout aware processing) the engine for which has a large startup time and the processing time on a file itself is in the order of a few minutes. Using a NiFi cluster (16+ nodes) for such processing results in the Apache NiFi cluster taking about 45 minutes for the cluster startup and being available (Deployment in Kubernetes). I was looking to see if Apache NiFi MiNiFi or Apache NiFi Stateless will be handy here to reduce the cluster startup time and also allow me to scale out the processing as required in an easier manner. Which of the two would be a better fit? I understand that MiNiFi itself is more for a data collection use case but was wondering if it still would fit my use case?
NiFi Stateless is basically an alternate runtime for NiFI flows. It is lighweight and doesn't persist data between restarts (hence the name :)
I think the information listed here is quite comprehensive:
https://github.com/apache/nifi/blob/main/nifi-stateless/nifi-stateless-assembly/README.md
https://bryanbende.com/development/2021/11/10/apache-nifi-stateless
NiFi MiNiFi is essentially a headless NiFi (doesn't support authoring flows) and has justa subset of NARs. On the other hand supports the C2 protocol and can download flows that way.
Based on your description if you have a single source and destination and you are ok with state lost between restarts stateless seems to be a better fit.
Currently utilising the Google Dataflow with Python for batch processing. This works fine, however, I'm interested in getting a bit more speed out of my Dataflow Jobs without having to deal with Java.
Using the Go SDK, I've implemented a simple pipeline that reads a series 100-500mb files from Google Storage (using textio.Read), does some aggregation and updates CloudSQL with the results. The number of files being read can range from dozens to hundreds.
When I run the pipeline, I can see from the logs that files are being read serially, instead of in parallel, as a result the job takes much longer. The same process executed with the Python SDK triggers autoscaling and runs multiple reads within minutes.
I've tried specifying the number of workers using --num_workers=, however, Dataflow scales the job down to one instance after a few minutes and from the logs no parallel reads take place in the time the instance was running.
Something similar happens if I remove the textio.Read and implement a custom DoFn for reading from GCS. The read process is still run serially.
I'm aware the current Go SDK is experimental and lacks many features, however, I haven't found a direct reference to limitations with Parallel processing, here. Does the current incarnation of the Go SDK support parallel processing on Dataflow?
Thanks in advance
Managed to find an answer for this after actually creating my own IO package for the Go SDK.
SplitableDoFns are not yet available in the Go SDK. This key bit of functionality is what allows the Python and Java SDKs to perform IO operations in parallel and thus, much faster than the Go SDK at scale.
Now (GO 1.16) it's built-in :
https://pkg.go.dev/google.golang.org/api/dataflow/v1b3
I'm running a PySpark application in:
emr-5.8.0
Hadoop distribution:Amazon 2.7.3
Spark 2.2.0
I'm running on a very large cluster. The application reads a few input files from s3. One of these is loaded into memory and broadcast to all the nodes. The other is distributed to the disks of each node in the cluster using the SparkFiles functionality. The application works but performance is slower than expected for larger jobs. Looking at the log files I see the following warning repeated almost constantly:
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
It tends to happen after a message about accessing the file that was loaded into memory and broadcasted. Is this warning something to warn about? How to avoid it?
Google searching brings up several people dealing with this warning in native Hadoop applications, but I've found nothing about it in Spark or PySpark and can't figure out how those solutions would apply for me.
Thanks!
Ignore it.
The more recent versions of the AWS SDK always tell you off when you call abort() on the input stream, even when it's what you need to do when moving around a many-GB file. For small files, yes, reading to the EOF is the right thing to do, but with big files, no.
See: SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream
If you see this a lot, and you are working with columnar data formats such as ORC and Parquet, switch the input streams over to random IO over sequential by setting the property fs.s3a.experimental.fadvise to random. This stops it from ever trying to read the whole file, and instead only reading small blocks. Very bad for full file reads (including .gz files), but transforms column IO.
Note, there's a small fix in S3A for Hadoop 3.x on the final close HADOOP-14596. Up to the EMR team whether to backport or not.
+I'll add some text to the S3A troubleshooting docs. The ASF have never shipped a hadoop release with this problem, but if people are playing mix-and-match with the AWS SDK (very brittle), then this may surface
Note: This only applies to non-EMR installations as AWS doesn't offer s3a.
Before choosing to ignore the warnings or altering your input streams via settings per Steve Loughran's answer, make absolutely sure you're not using s3://bucket/path notation.
Starting with Spark 2, you should leverage the s3a protocol via s3a://bucket/path, which would likely address the warnings you're seeing (it did for us) and substantially boost the speed of S3 interactions. See this answer for detail on a breakdown of differences.
We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!
In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.