I am designing a video pixel data processing pipeline in VHDL which involves several steps including multiply and divide.
I want to keep signals synchronised so that I can e.g. maintain a sync signal and output it correctly at the end of the pipeline along with manipulated pixel data which has been through several processing stages.
I assume I want to use shift registers or something to delay signals by the right number of cycles so that the output is correct, but I'm looking for advice about good ways to design this, particularly as the number of pipeline stages for different signals may vary as I evolve the design.
Good question.
I'm not aware of a complete solution but here are two partial strategies...
Interconnecting components... It would be really nice if a component could export a generic whose value was its pipeline depth. Unfortunately you can't, and dedicating a port to this seems silly (though it's probably workable; as it would be an integer constant, it would disappear in synthesis)
Failing that, pass IN a generic indicating the budget for this module. Inside the module, assert (severity FAILURE) if the budget can't be met... (this assert is checkable at synth time and at least Xilinx XST handles similar asserts)
Make the budget a hard number, and either assert if not equal to actual pipeline depth, or add pipe stages inside the module if the budget is too large, and only assert if the budget is too small.
That way you are connecting predictable modules, and the top level can perform pipeline arithmetic to balance things (e.g. passing a computed constant value to a programmable delay line)
Within a component... I use a single process, with registers represented as internal signals whose names reflect their pipe stage, exponent_1, exponent_2, exponent_3 and so on. Within the process, the first section describes all the actions for the first cycle, the second section describes the second cycle, and so on. Typically the "easier" paths may be copied verbatim to the next pipe stage, just to sync them with the critical path. The process is fairly organised and easy to maintain.
I might break a 32-bit multiply down into 16*16 chunks and pipeline the partial product additions. The control this gives, USED to give better results than XST gave alone...
I know some people prefer variables within a process, and I use them for intermediate results in a pipe stage, but using signals I can describe the pipeline in its natural order (thanks to postponed assignment) whereas using variables, I would have to describe it backwards!
I create a package for each of my major processing blocks, one of the constants in there is the processing delay of that block. I can then connect that up to my general-purpose "delay-line" block which has a generic for the number of cycles.
Keeping that constant in "sync" with the actual implementation is best done by a self-checking testbench.
Something to consider is delay lines (i.e. back to back registers) vs FIFOs.
Consider a module X with a pipeline delay N. FIFOs work well when there is a N is variable. The trick is remembering that you can only request new work when both the module and the FIFO can accept it. Ideally you size the FIFO so that it can contain the maximum number of items that X can work on concurrently, but sometimes that's not practical. For example, if your calculation includes accesses to a distant memory.
Another option is integrating the side channel (i.e. the path that your sync flag is taking) into the module X rather than it going outside. If you do this then if any part of the calculation has to stall, you can also stall the side channel and the two stay in sync. You can do this because you're in a scope that has all the necessary signals in it. Then all signals, whether used in the calculation or not, appear at the output at the same time.
Related
This problem has been bothering me for a long time, based on my understanding:
set_false_path is a timing constraints which is not required to be optimized for timing. we can use it for two flop synchronizer since it is not required to get captured in a limited time.
set_clock_groups It saves us from defining too many false paths.
set_multicylce_path used to relax the path requirement when the default worst requirement is too restrictive. we can set the set/hold clk to fix the timing. we can use it in cross domain
set_max_skew/set_max_delay -datapath_only used on asynchronous FIFO style that does the whole convert read/write pointers from binary to gray. Looks like set_max_skew help with control the skew between the multiple bits of the gray code to the double-flop synchronizers. Why do you need the "datapath_only"? Just using set_multicycle_path will also pass the timing check.
So in summary, all those methods can be used in async fifo right?
And the set_false_path is the most simple way. No need to worry about the mcp cycle or max delay. I guess we use it only when the logic between 2 FF is "combinational"? Can we use it when there are sequence logic between 2 cross domain FF?
If ignoring all timing calculations using FP is bad, when is it a good time to use it? In theory I can replace all the FP with MCP.
What factors do you need to consider in order to choose the most suitable constraints?
So apparently there are 4 following questions in your post:
Question 1: So in summary, all those methods can be used in async fifo right?
Question 2: And the set_false_path is the most simple way. No need to worry about the mcp cycle or max delay. I guess we use it only when the logic between 2 FF is "combinational"? Can we use it when there are sequence logic between 2 cross domain FF?
Question 3: If ignoring all timing calculations using FP is bad, when is it a good time to use it? In theory I can replace all the FP with MCP.
Question 4: What factors do you need to consider in order to choose the most suitable constraints?
Following are the 4 answers to aforementioned questions:
Answer 1: As shown below in figure, with an asynchronous FIFO, data can arrive at arbitrary time intervals on the transmission side, and the receiving side pulls data out of the queue as it has the bandwidth to process it.
Therefore, Yes, you can use all those optimizations/constraints/methods for asynchronous FIFO.
Answer 2: Yes set_false_path can be considered as one of the most simplest. And as the following figure shows, you are right we use when the logic between 2 FF is "combinational"?
Furthermore, based on my understanding, we do not use for sequence logic.
Answer 3: A false path is similar to the multicycle path in that it is not required to propagate signals within a single clock period. The difference is that a false path is not logically possible as dictated by the design. In other words, even though the timing analysis tool sees a physical path from one point to the other through a series of logic gates, it is not logically possible for a signal to propagate between those two points during normal operation. The main difference between a multicycle path with many available cycles (large n) versus a false path is that the multicycle path will still be checked against setup and hold requirements and will still be included in the timing analysis. It is possible for a multicycle path to still
fail timing, but a false path will never have any associated timing violations.
Hence use a multicycle path in place of a false path constraint when:
your intent is only to relax the timing requirements on a synchronous path;
but
the path still must be timed, verified and optimized.
Answer 4:
Although a very valid question yet too broad. It all depends on the underlying design. Most implementation tools for FPGA layout have a plenty of optimization options. And obviously not all constraints are used by all steps
in the compilation flow. Based on my experience and citing from Reference 1 the constraints that must be included in every design include all clock definitions,
I/O delays, pin placements, and any relaxed constraints including multicycle
and false paths.
Following two main references can further explain you to understand the the use of constraints:
Reference 1
Reference 2
I am not sure how to express my scenario using activity diagrams:
What I am trying to visualise is the fact that:
A message is received
Two independent and concurrent actions take place: logging of the message and processing the message
Logging always takes less time than processing
The first activity in the diagram is correct in the sense that the actions are independent but it does not relay the fact that logging is guaranteed to take less time than processing.
The second activity in the diagram is not correct because, even if logging completes before processing, it looks as though processing depended on the logging's finishing first and that does not represent the reality.
Here is a non-computer related example:
You are a novice in birdwatching, trying to make your first notes in your notebook about birds passing by
A flock of birds approaches, you try to recognise as many details as possible
You want to write down the details in your notebook, but wait, you begin to realise that your theoretical background does not work in practice, what should be a quick scribble actually amounts to nothing in the end because you did not recognise anything
In the meantime, the birds majestically flew away without waiting for you, the activity is gone
Or maybe you did actually write it down, it took you only a moment and the birds are still nearby, slowly flying away, ending the activity again after some time
Or maybe you were under such awe that you just kept watching at them, without taking any notes - they fly away, disappearing in the horizon, ending the activity
After a few hours, you have enough notes and you come home very happy - maybe you did not capture everything but this was enough to make you smile anyway
I can always add a comment to a diagram to express it all somehow but I wonder, is there a more structured way to express what I described in an activity diagram? If not an activity diagram then what kind of a diagram would be better suited in your opinion? Thank you.
Your first diagram assumes that the duration of logging is always shorter than processing:
If this assumption is correct, the upper flow reaches the flow-final node, and the remaining flows continue until the first reaches the activity-final node. Here, the processing continues and the activity ends when the processing ends. This is exactly what you want.
But if once, the execution would deviate from this assumption and logging would get delayed for any reason, then the end of the processing would reach the activity-final node, resulting in the immediate interruption of all other ongoing activities. So logging would not complete. Maybe it’s not a problem for you, but in most cases audit expects logs to be complete.
You may be interested in a safer way that would be to add a join node:
The advantage is that the activity does not depend on any assumptions. It will always work:
whenever the logging is faster, the token on that flow will wait at the join node, and as soon as process is finished the activity (safely) the join can happen and the outgoing token reaches the end. This is exactly what you currently expect.
if the logging is exceptionally slower, no problem: the processing will be over, but the activity will wait for the logging to be completed.
This robust notation makes logging like Schroedinger's cat in its box: we don't have to know what activity is longer or shorter. At the end of the activity, both actions are completed.
Time in activity diagrams?
Activity diagrams are not really meant to express timing and duration. It's about the flow of control and the synchronization.
However, if time is important to you, you could:
visually make one activity shorter than the other. This is super-ambiguous and absolute meaningless from a formal UML point of view. But it's intuitive when readers see the parallel flow (a kind of sublminal communication ;-) ) .
add a comment note to express your assumption in plain English. This has the advantage of being very clear an unambiguous.
using UML duration constraints. This is often used in timing diagram, sometimes in sequence diagrams, but in general not in activity diagrams (personally I have never seen it, but UML specs doesn't exclude it either).
Time is something very general in the UML specs, and defined independently of the diagram. For example:
8.4.4.2: A Duration is a value of relative time given in an implementation specific textual format. Often a Duration is a non- negative integer expression representing the number of “time ticks” which may elapse during this duration.
8.5.1: An Interval is a range between two values, primarily for use in Constraints that assert that some other Element has a value in the given range. Intervals can be defined for any type of value, but they are especially useful for time and duration values as part of corresponding TimeConstraints and DurationConstraints.
In your case you have a duration observation for the processing (e.g. d), and a duration constraint for the logging (e.g. 0..d).
8.5.4.2: An IntervalConstraint is shown as an annotation of its constrainedElement. The general notation for Constraints may be used for an IntervalConstraint, with the specification Interval denoted textually (...).
Unfortunately little more is said. The only graphical examples are for messages in sequence diagrams (Fig 8.5 and 17.5) and for timing diagrams (Fig 17.28 to 17.30). Nevertheless, the notation could be extrapolated for activity diagrams, but it would be so unusal that I'd rather recommend the comment note.
There used to be a parameter called CL_DEVICE_MAX_COMPUTE_UNITS that can be queried in OpenCL by calling clGetDeviceInfo, which indicates the number of parallel compute units on the OpenCL device, as a single work-group executes on a single compute unit.
However there don't seem to be a way to query that parameter in Vulkan.
Or am I missing something, as in it can actually be queried? Or we usually choose a default value (such as 256) arbitrarily when the input size is indeterminate?
Vulkan has no way to ask that question. And that's probably for the best.
First, the concept of "compute unit" was not well defined even in OpenCL. So exactly what this value means is not well understood.
Second, if the question you really want to ask is "how many work groups can execute in parallel at any one time", then the answer may be shader-dependent. For example, if a piece of hardware can execute 32 work items on a single computation unit, it may be able to populate these 32 work items from distinct work groups. That is, your notion that "a single work-group executes on a single compute unit" is not necessarily true.
If a shader's work group size is 16, there's little to be lost by running them both at the same time. Sure, different barrier usage may cause them to get split up, but it may not. It's probably better to take the chance that it'll work than to assume it won't.
And third... what exactly do you intend to do with that information? If you have X work groups to execute, issuing multiple dispatch commands in groups of CL_DEVICE_MAX_COMPUTE_UNITS isn't going to make this process go faster. And trying to interleave work groups from different compute tasks is going to be slower, due to having to reset pipelines or other state. It's better to through the whole work at the GPU and let its scheduler sort out how to apply the work items to the work groups.
Following truth table resulted from the circuit below. SR(NOR) latch is used. I have tried several times to trace through the circuit to see how truth table values are produced but its not working. Can someone explain to me what is going on ? This circuit was introduced in conjunction with racing although I am not sure if it has anything to do with it.
NOTE: "CLOCK" appears as a straight line to show how its connected everything. It is a normal clock that oscillates between 1 and 0. (this is how my instructor drew it).
Strictly, this does belong on EE. The other questions you've found are likely to be old - before EE was established.
You should look at the 1-to-0 transitions of the clock. When that occurs and only when that occurs, the value currently on S is transferred to Q.
The Race condition appears when the clock signal is delayed, even with the tiny amount of copper track between real components. The actual waveform is not 1-0 or 0-1, it ramps between the two values. A tiny variation between two components, one seeing the transition at say 2.7V and the other at 2.5 would mean that the first component moves the value from S to Q fractionally before the second, so when the second component decides to transfer the value, it may see the value after the transfer has occurred on the prior component. You therefore may have a race between the two. These delays can also be affected by supply-rail stability and temperature, so the whole arrangement can become unreliable if not carefully designed. The condition is often overcome be deliberately routing the clock so that it will arrive at the last component in the chain first, giving that end of the chain a head-start.
I've worked on systems where replacing a component with a faster version caused the circuit to stop working. The new component was working too fast for the remainder of the circuit - and you needed to deliberately select (or use factory-selected) slower versions.
On a related note, before hard-drives became cheap, and floppy-drives (you may need to google that) before them it was common to use casste tapes (even more likely you'd need google on those.) Cheap and cheerful was best. If you used a professional quality recorder/player, you'd often get unusable results.
On a single ladder rung how many outputs can you have. If you have more than one. Would it be AND Logic, or OR Logic. Series, or parallel. I'm trying to make six lights flash using timer on delay instructions with a closed input instruction. I will using an Allen Bradley SLC 500 series PLC.
In a ControlLogix or CompactLogix PLC a ladder logic rung may have as many outputs (OTE) as you like, both at the right hand end of logic rung and even in the middle of a logic rung.
Each output is controlled only by the logic leading up to it. If you have multiple outputs at the same point in the rung, they will all have the output reflecting the logic condition from the rung start up to that point. This is a common method used to drive several outputs with the same signal at once.
If you have multiple outputs at different points in the rung, each will have outputs that correspond to the logic leading to that output. Logic downstream from an OTE acts as if the OTE wasn't present.
Now, you may have complex devices (e.g., Timer) controlled by logic within a rung.
Obviouosly, further logic that depends on the output of the complex device (e.., Timer Done) will not be independent of the behaviour of complex device. But just like OTEs, you may have lots of complex devices in a rung.
If you are programming an SLC500 then you cannot have an OTE in the middle of a rung. It must be at the very right hand side of the rung. You may however (and is a common practice) create a branch around the OTE and have another OTE (or OTU, or OTL, or any other output) on its own branch (again at the very right of the branch).
So using this method you can have as many OTEs on any given rung. However a best practice is to limit the number (to say 10 or 20 per rung) for readability and split them onto several rungs as necessary.