Correct me if I'm wrong, but looks like I will have troubles if I try to use Map3 and Engine in the same project due the same package names?
Is there any way it can be done?
PS Side note - looks like Engine master branch (at least 'demo' module) cannot be compiled right now due to dependencies issue?
Yes, open source version of Chronicle Engine depends on Chronicle Map 2 (and supports only Chronicle Map 2) which uses the same package names as Chronicle Map 3, so they might not work together without shading/package renaming, even if you use Chronicle Map 3 and Chronicle Engine separately, i. e. not doing replication or remote calls to Chronicle Map 3.
There is a Chronicle Engine version which supports Chronicle Map 3, but it is closed source, visit http://chronicle.software for details.
Related
We have springboot app that consumes from a single topic and produces records to multiple topics.
Recently upgraded this app to Sprinboot-2.6.7 and other dependencies accordingly in gradle project.
App is able to consume & produce correctly, BUT the issue is it seems to create kafka adminclients repeatedly(1000s) and seems to be leaking memory(potentially due to this?), ultimately leading to instance crashing and not being able to keep up with lag.
Some kafka related dependent jars in external libraries
org.apache.kafka:kafka-clients:3.0.1
org.springframework.cloud:spring-cloud-stream:3.2.3
org.springframework.cloud:spring-cloud-stream-binder-kafka:3.2.3
org.springframework.cloud:spring-cloud-stream-binder-kafka-core:3.2.3
org.springframework.integration:spring-integration-kafka:5.5.11
org.springframework.kafka:spring-kafka:2.8.5
Is there a reason for this? missing configuration?
So adminClient was not the issue. The problem was from the default size 10 of the hashmap that stores output channels.
I have set spring.cloud.stream.dynamic-destination-cache-size=30, since we have actaully 17 output destinations in app already.
In case of the default size 10 of this hashmap "StreamBridge.channelCache" it keeps removing and adding the values to map repeatedly "Once this size is reached, new destinations will trigger the removal of old destinations" calling GC every now and then.
I am already having a stream pipeline written in Apache beam. Earlier I was running the same in Google Dataflow and it used to run like a charm. Now due to changing business needs I need to run it using flink runner.
Currently in use beam version is 2.38.0, Flink version is 1.14.5. I validated this and found this is supported and valid combination of versions.
The pipeline code which is written using Apache Beam sdk and it uses multiple ParDo and PTransforms. The pipeline is somewhat complicated in nature as it involves lot of interim operations (catered via these ParDo & PTransforms) between source and sink.The source in my case is Azure service bus topic which I am reading using JmsTopicIO reads. Until here all works fine i.e. stream of data enters in to the pipeline and getting processed normally. The problem occurs when load testing is performed. I see many operator going in to back pressure and eventually not able to read & process msgs from topic. Though the CPU and memory usage remains well under control of Job & Task manager.
Actual issue/problem/question: While troubleshooting this performance issues I observed that Flink is chaining and grouping these ParDo's and PTranforms (by itself) in to operators. With my implementation I see that many heavy processing tasks are getting combined in to same operator. This is causing slow processing of all such operators. Also the parallelism I have set (20 for now) is at pipeline level which mean each operator is running with 20 parallelism.
flinkOptions.setParallelism(20);
Question 1. Using apache beam sdk or any flink configuration is there any way I can control or manage the chaining and grouping of these ParDo/PTransforms in to operators (through code or config)?. So I should be able to uniformly distribute the load myself.
Question 2. With implementation using Apache Beam, how I can mention or set the parallelism to each operator (not to complete pipeline) based on the load on them?. This way I will be able to better allocate the resources to heavy computing operators (set of tasks).
Please suggest answers to above questions. Also if any other pointer can be given to me to work up on for flink performance improvements in my deployment. Just for reference please note my pipeline .
I am running my own GRPC server collecting events coming from various data sources. The server is developed in Go and all the event sources send the events in a predefined format as a protobuf message.
What I want to do is to process all these events with Apache Beam in memory.
I looked through the docs of Apache Beam and couldn't find a sample that does something like I want. I'm not going to use Kafka, Flink or any other streaming platform, just process the messages in memory and output the results.
Can someone show me a direction of a right way to start coding a simple stream processing app?
Ok, first of all, Apache Beam is not a data processing engine, it's an SDK that allows you to create a unified pipeline and run it on different engines, like Spark, Flink, Google Dataflow, etc. So, to run a Beam pipeline you would need to leverage any of supported data processing engine or use DirectRunner, which will run your pipeline locally but (!) it has many limitations and was mostly developed for testing purposes.
As every pipeline in Beam, one has to have a source transform (bounded or unbounded) which will read data from your data source. I can guess that in your case it will be your GRPC server which should retransmit collected events. So, for the source transform, you either can use already implemented Beam IO transforms (IO connectors) or create your own since there is no GrpcIO or something similar for now in Beam.
Regarding the processing data in memory, I'm not sure that I fully understand what you meant. It will mostly depend on used data processing engine since in the end, your Beam pipeline will be translated in, for example, Spark or Flink pipeline (if you use SparkRunner or FlinkRunner accordingly) before actual running and then data processing engine will manage the pipeline workflow. Most of the modern engines do their best efforts to keep all processed data in memory and flush it on disk only in the last resort.
Currently utilising the Google Dataflow with Python for batch processing. This works fine, however, I'm interested in getting a bit more speed out of my Dataflow Jobs without having to deal with Java.
Using the Go SDK, I've implemented a simple pipeline that reads a series 100-500mb files from Google Storage (using textio.Read), does some aggregation and updates CloudSQL with the results. The number of files being read can range from dozens to hundreds.
When I run the pipeline, I can see from the logs that files are being read serially, instead of in parallel, as a result the job takes much longer. The same process executed with the Python SDK triggers autoscaling and runs multiple reads within minutes.
I've tried specifying the number of workers using --num_workers=, however, Dataflow scales the job down to one instance after a few minutes and from the logs no parallel reads take place in the time the instance was running.
Something similar happens if I remove the textio.Read and implement a custom DoFn for reading from GCS. The read process is still run serially.
I'm aware the current Go SDK is experimental and lacks many features, however, I haven't found a direct reference to limitations with Parallel processing, here. Does the current incarnation of the Go SDK support parallel processing on Dataflow?
Thanks in advance
Managed to find an answer for this after actually creating my own IO package for the Go SDK.
SplitableDoFns are not yet available in the Go SDK. This key bit of functionality is what allows the Python and Java SDKs to perform IO operations in parallel and thus, much faster than the Go SDK at scale.
Now (GO 1.16) it's built-in :
https://pkg.go.dev/google.golang.org/api/dataflow/v1b3
In Websphere MQ series , command level for a queue manager is 701. What does it actually specify ?
WebSphere products use a "[version].[release].[modification].[Fix Pack]" naming convention. For example 7.0.1.6 is the current release specified down to the Fix Pack level.
Fix packs are limited to bug fixes and very limited non-disruptive functionality enhancements.
Modifications can include functionality enhancements but no API changes. For example the Multi-Instance Queue Manager was introduced in 7.0.1.
Releases may introduce significant new function and limited API changes but are highly forward and backward compatible withing the same version.
Versions encapsulate a core set of functionality. Changes at this level may sacrifice some backward compatibility in trade for significant new functionality. For example, WMQ Pub/Sub was moved from Message Broker to base MQ in the V7 release.
Since administrative functionality does not change in Fix Packs but may change at the Modification level, compatibility with administrative tools is based on the queue manager's Command Level.
There is an old but still useful TechNote which described this when the numbering conventions were adopted for WMQ.
It displays the major version number of WMQ - e.g. 530,600,700,701. Despite being 'only' a .0.1 increment, WMQ 7.0.1 gets a new major version number due to a number of internal changes (e.g. multi-instance QMs), although WMQ 6.0.1.x and 6.0.2.x were both CMDLEVEL 600
Command level, although similar to the V.R.M.F, it not exactly the same thing. The Command level is used to allow configuration applications to know what commands (and attributes within those commands) will be understood by the command server.
The first thing any configuration application should do is discover the PLATFORM and CMDLEVEL of the queue manager. Then that application can determine which commands/attributes it would be acceptable to send to that queue manager.
It is possible that CMDLEVEL could be increased in the service stream. Then the V.R.M.F. would not necessarily match the CMDLEVEL. This would happen if some new external attributes were added in the service stream, so queue managers without that patch would not understand them, but queue managers with the patch would. How does an application determine what to send? Well, the CMDLEVEL would determine that and so would have to be upped by the patch.