Combinational loops in HDLs

Combinational loops in HDLs - vhdl

This is just an experiment I'm trying to wrap my brain around.
So I've got two registers r1 r2 and two wires w1 w2. What I want is, if both r's are 1, both w's should be 1. If one r is 1, the corresponding w should be 1 and the other should be 0. If both r's are 0, w1 should be 1 and w2 should be 0.
11=>11
10=>10
01=>01
00=>10
The caveat is I want the assign for w1 not to include r2 directly, and vice versa. So, I've got (in Verilog for instance--a VHDL answer would be perfectly fine too)
assign w1 = r1 | !w2;
assign w2 = r2 | !w1;
Which is necessary but not sufficient. All the cases above are true, but 00=>01 is also true. In fact when r1=r2=0, it just creates a cycle of wires without a driver, so I think the result is non-deterministic.
Is there any way to get the result I'm looking for without including r2 in the assignment for w1, or vice versa? (And without introducing new variables). Basically just to ensure that in a wire-cycle, w1 is pulled high and w2 pulled low?

No, I think there is no clean way to do this without extra wires/signals and without your cross dependency.
By the way, your "cyclical wires" are commonly referred to as a combinational loops and it is a good practice to avoid these.
As for the simulation of a VHDL model with combinational loop, the result is deterministic provided the simulator converges to a stable point, ie no more signal value change. If signal values continuously change, then you are likely to reach your simulator's iteration limit. I don't know for Verilog but I assume it is deterministic as well.
As for synthesis, tools with either reject this construct and raise an error, or try to handle this, with a possible very bad impact on timing.
Again, even if your simulation is ok and your synthesis tool allows this, combinational loops should be avoided.

What you have currently is very similar to an SR latch, and as such it has a metastable condition (also known as a race condition).
From your truth table above though, it looks like w2 should be set only to r2.
assign w2 = r2;
That change should fix your race condition; though as expressed above, be wary of the large restrictions created by combinational logic.

Related

How do you logically formulate the problem of operating systems deadlocks?

It is a programming assignment in Prolog to write a program that takes in input of processes, resources, etc and either print out the safe order of execution or simply return false if there is no safe order of execution.
I am very new to Prolog and I just simply can't get myself to think in the way it is demanding me to think (which is, formulating the problem logically and letting the compiler figure out how to do the operations) and I am stuck just thinking procedurally. How would you formulate such a problem logically, in terms of predicates and whatnot?
The sample input goes as follows: a list of processes, a list of pairs of resource and the number of available instances of that resource and facts allocated(x,y) with x being a process and y being a list of resources allocated to x and finally requested(x,y) such that x is a process and y is a list of resources requested by x.
For the life of me I can't think of it in terms of anything but C++. I am too stuck. I don't even need code, just clarity.
edit: here's a sample input. I seriously just need to see what I need to do. I am completely clueless.
processes([p1, p2, p3, p4, p5]).
available_resources([[r1, 2], [r2, 2], [r3, 2], [r4, 1], [r5, 2]]).
allocated(p1, [r1, r2]).
allocated(p2, [r1, r3]).
allocated(p3, [r2]).
allocated(p4, [r3]).
allocated(p5, [r4]).
requested(p5, [r5]).

What you want to do is apply the "state search" approach.
Start with an initial state S0.
Apply a transformation to S0 according to allowed rules, giving S1. The rules must allowed only consistent states to be created. For example, the rules may not allow to generate new resources ex nihilo.
Check whether the new state S1 fulfills the condition of a "final state" or "solution state" permitting you to declare victory.
If not, apply a transformation to S1, according to allowed rules, giving S2.
etc.
Applying transformations may get you to generate a state from which no progress is possible but which is not a "final state" either. You are stuck. In that case, dump a few of the latest transformations, moving back to an earlier state, and try other transformations.
Through this you get a tree of states through the state space as you explore the different possibilites to reach one of the final states (or the single final state, depending on the problem).
What we need is:
A state description ;
A set of allowed state transformations (sometimes called "operators") ;
A criterium to decide whether we are blocked in a state ;
A criterium to decide whether we have found a final state ;
Maybe a heuristic to decide which state to try next. If the state space is small enough, we can try everything blindly, otherwise something like A* might be useful.
The exploration algorithm itself. Starting with an initial state, it applies operators, generating new states, backtracks if blocked, and terminates if a final state has been reached.
State description
A state (at any time t) is described by the following relevant information:
a number of processes
a number of resources, several of the same kind
information about which processes have allocated which resources
information about which processes have requested which resources
As with anything else in informatics, we need a data structure for the above.
The default data structure in Prolog is the term, a tree of symbols. The extremely useful list is only another representation of a particular tree. One has to a representation so that it speaks to the human and can still be manipulated easily by Prolog code. So how about a term like this:
[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)]
This expresses the fact that process p1 owns two resources r1 and one resource r2 and wants r3 next.
The full state is then a list of list specifying information about each process, for example:
[[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
Operator description
Prolog does not allow "mutable state", so an operator is a transformation from one state to another, rather than a patching of a state to represent some other state.
The fact that states are not modified in-place is of course very important because we (probably) want to retain the states already visited so as to be able to "back track" to an earlier state in case we are blocked.
I suppose the following operators may apply:
In state StateIn, process P requests resource R which it needs but can't obtain.
request(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
In state StateIn, process P obtains resource R which is free.
obtain(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
In state StateIn, process P frees resource R which is owns.
free(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
The code would be written such that if StateIn were
[[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
then free(StateIn, StateOut, p1, r2) would construct a StateOut
[[process(p1),owns(r1),owns(r1),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
which becomes the new current state in the search loop. Etc. Etc.
A criterium to decide whether we are blocked in the current state
Often being "blocked" means that no operators are applicable to the state because for none of the operators, valid preconditions hold.
In this case the criterium seems to be "the state implies a deadlock".
So a predicate check_deadlock(StateIn) needs to be written. It has to test the state description for any deadlock conditions (performing its own little search, in fact).
A criterium to decide whether we have found a final state
This is underspecified. What is a final state for this problem?
In any case, there must be a check_final(StateIn) predicate which returns true if StateIn is, indeed, a final state.
Note that the finality criterium may also be about the whole path from the start state to the current state. In that case: check_path([StartState,NextState,...,CurrentState]).
The exploration algorithm
This can be relatively short in Prolog as you get depth-first search & backtracking for free if you don't use specific heuristics and keep things primitive.
You are all set!

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.

Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

HLS Tool to Make Maths Simpler on FPGAs

I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?

There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.

In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sin²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial

Ensuring propagation is complete in VHDL without an explicit click

I am looking to build a VHDL circuit which responds to an input as fast as possible, meaning I don't have an explicit clock to clock signals in and out if I don't absolutely need one. However, I am also looking to avoid "bouncing" where one leg of a combinatorial block of logic finishes before another.
As an example, the expression B <= A xor not not A should clearly never assign true to B. However, in a real implementation, the two not gates introduce delays which permit the output of that expression to flicker after A changes but the not gates have not propagated that change. I'd like to "debounce" that circuit.
The usual, and obvious, solution is to clock the circuit, so that one never observes a transient value. However, for this circuit, I am looking to avoid a dependence on a clock signal, and only have a network of gates.
I'd like to have something like:
x <= not A -- line 1
y <= not x -- line 2
z <= A xor y -- line 3
B <= z -- line 4
such that I guarantee that line 4 occurs after line 3.
The tricky part is that I am not doing this in one block, as the exposition above might suggest. The true behavior of the circuit is defined by two or more separate components which are using signals to communicate. Thus once the signal chain propagates into my sub-circuit, I see nothing until the output changes, or doesn't change!
In effect, the final product I'm looking for is a procedure which can be "armed" by the inputs changing, and "triggered" by the sub-circuit announcing its outputs are fully changed. I'd like the result to be snynthesizable so that it responds to the implementation technology. If it's on a FPGA, it always has access to a clock, so it can use that to manage the debouncing logic. If it's being implemented as an ASIC, the signals would have to be delayed such that any procedure which sees the "triggered" signal is 100% confident that it is seeing updated ouputs from that circuit.
I see very few synthesizable approaches to such a procedural "A happens-before B" behavior. wait seems to be the "right" tool for the job, but is typically only synthesizable for use with explicit clock signals.

Theoretically, is comparison between 0 and 255 faster than 0 and 1?

From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?

A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.

Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.

When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio