Are there any libraries to provide diagnostics on TPL dataflow blocks? - task-parallel-library

We're using TPL dataflow blocks, and we want to get diagnostic information on things like the amount of messages being queued in each block, what the targets for each block are, and maybe some metrics like messages per second.
The diagnostics seem fairly straight-forward to implement, but I'd be interested to hear if there's anything out there that does something similar.

Related

ZeroMQ is dropping messages

I am using ZeroMQ to communicate between multiple services. I ran into an issue where I was sending responses but they were never getting to the caller. I did a lot of debugging and couldn't figure out what was happening. I eventually reduced the message size (I was returning the results of a query) and the messages started coming in. Then I increased the memory size of my JVM and the original messages started coming back.
This leads me to believe that the messages were too big to fit into memory and ZeroMQ just dropped them. My question is, how can I properly debug this? Does ZeroMQ output any logs or memory dumps?
I am using the Java version of ZeroMQ.
Q : "...how can I properly debug this?"
Well, if one were aware of the native ZeroMQ API settings, at least of both the buffering "mechanics" and the pair of { SNDHWM | RCVHWM }-hard-cut-off limits, one might do some trial-error testing for fine-tuning these parameters.
Q : "Does ZeroMQ output any logs or memory dumps?"
Well, no, the native ZeroMQ knowingly did not ever attempted to do so. The key priority of the ZeroMQ concept is the almost linearly scalable performance and the Zen-of-Zero reflecting that excluded any single operation, that did not support achieving this at minimalistic low-latency envelopes.
Yet, newer versions of the native API provide a tool called socket-monitor. That may help you write your own internal socket-events' analyser, if in such a need.
My 11+ years with ZeroMQ have never got me into an unsolvable corner. Best get the needed insights into the Context()-instance and Socket()-instance parameters, that will better configure the L3, protocol-dependent and O/S-related buffering attributes ( some of which need not be present in the java-bindings, yet the native API shows at its best all the possible and tweakable parameters of the ZeroMQ data-pumping engines ).

Tasks vs. TPL Dataflow vs. Async/Await, which to use when?

I have read through quite a number technical documents either by some of the Microsoft team, or other authors detailing functionality of the new TPL Dataflow library, async/await concurrency frameworks and TPL. However, I have not really come across anything that clearly delineates which to use when. I am aware that each has its own place and applicability but specifically I wonder in regards to the following situation:
I have a data flow model that runs completely in-process. At the top sits a data generation component (A) which generates data and passes it on either via data flow block linkages or through raising events to a processing component (B). Some parts within (B) have to run synchronously while (A) massively benefits from parallelism as most of the processes are I/O or CPU bound (reading binary data from disk, then deserializing and sorting them). In the end the processing component (B) passes on transformed results to (C) for further usage.
I wonder specifically when to use tasks, async/await, and TPL data flow blocks in regards to the following:
Kicking off the data generation component (A). I clearly do not want to lock the gui/dashboard thus this process would have to somewhat run on a different thread/task.
How to call methods within (A), (B), and (C) that are not directly involved in the data generation and processing process but perform configuration work that may possibly take several hundred milliseconds/seconds to return. My hunch is that this is where async/await shines?
The most I struggle with is how to best design the message passing from one component to the next. TPL Dataflow looks very interesting but it is sometimes too slow for my purpose. (Note at the end in regards to performance issues). If not using TPL Dataflow how do I achieve responsiveness and concurrency by in-process inter-task/concurrent data passing? Example, clearly if I raise an event within a task the subscribed event handler runs in the same task instead of being passed to another task, correct? In summary, how can component (A) go about its business after passing on data to component (B) while component (B) retrieves the data and focuses on processing it? Which concurrency model is best used here?
I implemented data flow blocks here, but is that truly the best approach?
I guess above points in summary point to my struggle with how to design and implement API type components using standard practice? Should methods be designed async, data inputs as data flow blocks, and data output as either data flow block or event? What is the best approach in general? I am asking because most of the components mentioned above are supposed to work independently, so they can essentially be swapped out or independently altered internally without having to re-write accessors and output.
Note on performance: I mentioned TPL Dataflow blocks are sometimes slow. I deal with a high throughput, low latency type of application and target disk I/O limits and thus tpl dataflow blocks often performed much slower than, for example, a synchronous processing unit. Issue is that I do not know how to embed the process in its own task or concurrent model to achieve something similar than what tpl dataflow blocks already take care of, but without the overhead that comes with tpl df.
It sounds like you have a "push" system. Plain async code only handles "pull" scenarios.
Your choice is between TPL Dataflow and Rx. I think TPL Dataflow is easier to learn, but since you've already tried it and it won't work for your situation, I would try Rx.
Rx comes at the problem from a very different perspective: it is centered around "streams of events" rather than TPL Dataflow's "mesh of actors". Recent versions of Rx are very async-friendly, so you can use async delegates at several points in your Rx pipeline.
Regarding your API design, both TPL Dataflow and Rx provide interfaces you should implement: IReceivableSourceBlock/ITargetBlock for TPL Dataflow, and IObservable/IObserver for Rx. You can just wire up the implementations to the endpoints of your internal mesh (TPL Dataflow) or query (Rx). That way, your components are just a "block" or "observable/observer/subject" that can be composed in other "meshes" or "queries".
Finally, for your async construction system, you just need to use the factory pattern. Your implementation can call Task.Run to do configuration on a thread pool thread.
Just wanted to leave this here, if it helps someone to get a feeling when to use dataflow, because I was surprised at the TPL Dataflow performance. I had a the next scenario:
Iterate through all the C# code files in project (around 3500 files)
Read all the files lines (IO operation)
Iterate through all the file lines and find some strings in them
Return the files and their lines which have a the searched string in
I thought that this was a really good example for the TPL Dataflow but when I have just generated a new Task for each file which I needed to open, and done all the logic in that specific task, this code was faster.
From this my conclusion was to go with Await/Async/Task implementation by default, at least for such simple tasks and that TPL Dataflow was made for more complex situations, especially when you would need Batching and other more "pushed" based scenarios and when the synchronization is more of an issue.
Edit: Then I have done some more reasearch on this and created a demo project and the results are quite interesting. Because as when we have more operations and as they become more complex, the TPL Dataflow becomes more efficient.
Here is the link to the repo.

How to specify thread priority?

I need to run multiple threads on an embedded-linux target.
One of the threads requires a lot of resources so I need it to run in background at a low priority.
There will be times when the higher priority threads will have nothing to do. A typical vala Thread.create looks like this:
Thread.create<void*> (pProcessor->run, true);
Is there a way to specify the thread priority?
You can't use the threading stuff in GLib, you would have to use pthreads directly. There is some information on how to do that in C here. You would also need to create Vala bindings for the relevant functions since nobody has done so yet (it's pretty easy... if you understand how Vala maps to C it would only take a couple minutes).
If I were you I would look into using a priority queue instead. If you don't feel like writing your own bump should already have everything you need (specifically, Semaphore and/or TaskQueue), or AsyncPriorityQueue if you would prefer to work at a lower level.

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

Are there generic implementations of ARIES or other ACID-ensuring algorithms?

I have an application which I'd like to carry out certain actions atomically.
However, these actions are occurring on remote storage, and connections may drop, actions may fail, etc.
There's plenty of literature on how to enforce ACID properties - write-ahead logging, ARIES, etc. - but are there any generic implementations of these available?
Essentially what I'm looking for is some library/API/etc. where I can provide some stable storage (eg local hard disk) for logging, and perform my paricular actions (on an unstable remote storage), and have this hypothetical helper code handle the bulk of the ACID bookkeeping.
Obviously I would need to provide my own custom code rolling back certain things and soforth, but it seems like the high level of doing the logging, scanning the log, etc. could be generalized and wrapped in some library.
Does such a thing exist?
In C#, the System.IO.Log namespace has log-related helpers, which might be close to what you're looking for, though it won't help directly with isolation on its own. If you use LogRecordSequence, you get a pretty sophisticated log implementation underneath it.
Additionally sqlite does all of this and is in the public domain. I imagine its storage engine etc. would be somewhat separable, though you'd likely have to tear it apart yourself.

Resources