What are the use cases for TPL Dataflow over Reactive Extensions (Rx) - task-parallel-library

I'm specifically looking at writing some signal processing algorithms in one or other, or maybe some combination of both of these.
Performance isn't a big concern, clarity of expressing intent is more important.
I'd be looking to implement the following 'Blocks' and compose them:
Filters (both FIR and IIR)
Phase Detectors
Integrators
Mixers
Function Generator
PLL (using the above as building blocks)
I get that Rx can be considered as 'Linq-to-streams', and TPL is an abstraction over concurrency.
I also get that Rx uses TPL internally to manage its asynchronous bits and that TPL dataflow adds composability to TPL.
So both are asynchronous, both are composable, both are quite high level (Rx moreso).
Where should each be used, both generally and in my Signal Processing items above?

It depends on what kind of primitives you're dealing with - Rx and TPL are much more richer if you're using amplified types to push data, but if you're dealing with individual samples (such as a IObservable<byte>, ISourceBlock<float> etc.) it might be tedious to work with.
Having recently implemented a Function Generator, FFT, power spectra quantiser among others, I started out with Rx (this wasn't a case for concurrency/parallelism where TPL excels), but found that I spent more time trying to make it work in the Rx model - I eventually settled for System.Stream.
It worked out well for me and was surprisingly composable. However, performance and avoiding GC were top on my list, so if you don't mind either, I'd suggest Rx - you can do some really cool things with reactive combinators.

Related

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

VHDL behavioural vs structural performance

I was wandering, in terms of "performance" if there's some kind of difference between vhdl structural and behavioural. I know that nowdays is more common to write behavioural instead of structural but since i'd like to have an understanding in terms of performance i have been thinking that maybe there's some difference...
There is no hardware-related reason to prefer one form or the other.
It may be that one form leads to faster simulation than the other; I haven't seen any evidence for this in general, but then I haven't looked. It is true that after synthesis, the design is translated into a structural form, and post-synthesis simulation is slow, but this is due to the sheer size of the resulting structural version expressed as thousands of individual gates and their interconnections.
What matters more is the quality of synthesis results: It should be possible to write a design in both forms and have it synthesise to essentially the same hardware. And this seems to be generally true, in my experience.
Sometimes you will find synthesis tools have difficulty efficiently translating a construct (usually behavioural) but not as frequently as in the past.
What matters most (unless you are pushing the boundaries of speed or FPGA size) is clarity, leading to readability, reliability, efficiency, testability, maintainability and so on. If you can't understand it you can't see the inefficiencies or even test it properly.
Here, structural VHDL has a role to play at the top level : dividing a system into blocks like CPU, memory interface, FFT processor, UART, SPI and so on. Sometimes hierarchically, so you might want to divide the memory interface into refresh logic, error correction, address multiplexing, and so on.
But most blocks - for example, tasks that a single state machine can handle, are clearest and simplest when expressed behaviourally. So in a UART you might have two separate processes for TX and RX, while an SPI interface (which sends and receives on the same SPI clock) is probably best as a single behavioural process.

Efficient use of ALMs (Adaptive Logic Modules)?

I have a Verilog design that compiles to ~15K LEs on a Cyclone IV (EP4CE22F17C6N). When I compile the same same code on a Cyclone V (5CEFA2F23C8N), it takes ~8500 ALMs. Based on Altera's own LE equivalency for the particular Cyclone V, this would be ~20K LEs. Now, I realize that the estimates are going to be highly dependent on particular design, but a %33 increase in "effective" resource utilization seems like a lot.
So it makes me wonder if there are design tips/tricks/etc. for making more efficient use of ALMs. In particular, I'm looking for Verilog constructs that would improve the register density, fabric density, dense packing, etc.
I would agree with the comments above that generally you shouldn't need to optimise, however it's always important to check that your code does map to the chosen architecture. Specifically:
Reset
Using the wrong kind of reset for your architecture can cause problems. It's also very easy to accidentally cause the synthesis tool to insert logic to emulate a clock-enable. For full details see this answer. For Altera you should be using an asynchronous reset which is synchronously de-asserted.
Priority of control signals
In Altera:
Asynchronous Clear, aclr—highest priority
Asynchronous Load, aload
Enable, ena
Synchronous Clear, sclr
Synchronous Load, sload
Data In, data—lowest priority
Latches
Easy to grep from the reports, but unless you're absolutely sure it's intentional, latches are generally bad mmmmkay.
Synthesis
There are many options available to tweaking the behaviour of the synthesis process. Here are a few that will affect your results:
ALM_REGISTER_PACKING_EFFORT
This option guides the Fitter when packing registers into ALMs.
MUX_RESTRUCTURE
Allows the Compiler to reduce the number of logic elements required to implement multiplexersin a design.
OPTIMIZATION_TECHNIQUE
Specifies the overall optimization goal for Analysis & Synthesis: attempt to maximize performance, minimize logic usage, or balance high performance with minimal logic usage.
Bear in mind that if your device isn't getting too full, the tool won't have much "incentive" to minimise logic utilisation unless you explicitly tell it to.

Implementions of algorithms for evaluating circuits

Consider the problem of circuit evaluation, where the input is a boolean circuit C and an input string x and you want to compute C(x). (Assume fan-in 2 if you like.)
This is a 'trivial' problem algorithmically, however it appears non-trivial to implement when C can be huge (think several million gates) and memory management becomes an issue.
There are several ways this problem can be approached, trading off memory, time, and disc access. But before going through all this work myself, does anyone know of any existing implementations of algorithms for this problem? It would be surprising to me if none exist...
For C/C++, the standard digital circuit design & simulation system for more than 10 years now is SystemC.
It is a library that allows you to design digital logic in C++. There are supporting software that allows you to do timing analysis and even generate schematic netlist for C code.
I've only played with it a little before deciding that I was more comfortable with Verilog. But it is a mature piece of software with lots of industry support. Googling around will yield a lot of information including several tutorial pages.
It sounds like Binary Decision Diagrams could be used for your task? There are well-known algorithms (and implementations) of these which are very compact in terms of memory usage, given that they are designed to be used on huge state spaces.

Comparing FPGA with ASIC design

I have a fundamental question. I produced some FPGA image for some media application and
now I would like to compare my results to the ones of ASIC implementation of the same algorithm in terms of performance & area. I have heard such a comparasion does not make sense since it is somewhat comparing apples and oranges. But I have heard about the Gate-equivalence metric, cant I use this for comparison reasons?
Thanks
As has been pointed out, gate equivalents are only a rough guesstimate and not all that accurate for determining area in an ASIC. There are different of ways you can go about finding out how your design would perform (and cost) in an ASIC. You likely used an HDL (VHDL or Verilog) to implement your design. If you have access to a synthesis tool like Synopsys' Design Compiler (DC) you can use that with one of the supplied ASIC vendor libraries to determine area. You can also use it to generate a post-synthesis, gate-level netlist that you can use in simulation to determine performance. DC will also give you information about critical path timing, etc. that can be used to calculate performance as well.
However, DC is a very expensive product and you likely used FPGA vendor supplied tools to synthesize your HDL design. You could approach an ASIC vendor and ask them to analyze your design to determine size & performance (they would likely use DC - you'd have to be willing to hand your HDL over to them). They may be inclined to do this in order to win your business. But as has been pointed out ASIC NREs are very expensive, so unless you have a high-volume product it probably doesn't make sense to move your design to an ASIC.
The gate equivalence metric might get you to within an order of magnitude - if that's good enough for you? The problem is that a 4-input LUT can implement a single AND gate, or a complex 4-input function representing several gates. Or (in a Xilinx chip) it can be a shift register with 16 bits of memory in it. And it has a flipflop attached to its output (with attendant control signals and the like.... another few gates). And if you've used Block memory or the DSP blocks, they are even harder to quantify.
When you say you want to compare performance and area, do you really mean "cost"? Is this a potential product with millions of units sold, or "just" a few 10s of thousands? ASIC NRE is big!
You can optimise your FPGA design for cost as well, which might be good enough, depending on your volumes. For example, an image-processing design done in a traditional fashion can be 10x bigger than one designed for seriously small FPGA usage, with similar application performance... if you know what you're doing :)

Resources