Verilog: Sequential Block Time - time

Why is the time for all instructions in a sequential block (non-parallel) all the same?
i.e.
module abc;
reg [31:0] r;
initial
begin
r = 0;
$display($time, " ", r);
r = 1;
$display($time, " ", r);
r = r + 2;
$display($time, " ", r);
$finish;
end
endmodule
Output:
0 x
0 0
0 2

Verilog is a language designed to describe models of hardware and test code for exercising those models that can be run in a simulator (it was later re-purposed as a language to describe hardware for logic synthisis tools).
"time" refers not to the real world the simulator is running in but to the simulated world inside the simulator. Roughly speaking time in the simulated world only moves forward when there is nothing left to do at the current time point.

Verilog description of hardware consists of procedural blocks. These blocks are executed in pseudo-parallel fashion relatively to each other. The code inside every block is simulated sequentially in the same time slot.
such procedural blocks are all 'alsways' blocks, initial block and final block. You are testing the initial block. It is special and is executed, as the name suggests, at the very beginning of the simulation. All statements sequentially and at time '0'.
for always blocks, the time will be non-zero, but still the same for all instructions in the same block.
If you want to see time differences in an initial block, you need to add delays, i.e.
initial
begin
r = 0;
$display($time, " ", r);
#1
r = 1;
$display($time, " ", r);
#1
r = r + 2;
$display($time, " ", r);
$finish;
end
in the above example i added two 1-cycle delays. You should see the time incrementing in your case. Still all instructions are executed sequentially, the delay just stops execution for one cycle.
To see a parallel behavior you would need a real hardware description with always blocks and you need to simulate it for multiple cycles. Then you might notice that the order of prints between different always blocks will vary, depending on the state of the simulation. However even in this case simulator will finish simulation for all blocks for time 'a' before it starts simulation for other blocks for time 'b'.

Related

Fischer's Mutual Exclusion Algorithm

Two processes are trying to enter their critical sections, executing the same program:
while true do begin
// Noncritical section.
L: if not(id=0) then goto L;
id := i;
pause(delay);
if not(id=i) then goto L;
// Inside critical section.
id := 0;
end
The constant i identifies the process (i.e., has the value 1 or 2), and id is a global variable, with value 0 initially. The statement pause(delay) delays the execution for delay time units. It is assumed that the assignment id := i; takes at most t time units.
It has been proved that for delay > t the algorithm is correct.
I have two questions:
1) Suppose both processes A and B pass the control at label L. Suppose that at this point A get always chosen by the scheduler until it enters in its critical section. Suppose that, while A is in it critical section, the scheduler dispatches process B; since it has already passed the control at label L it also can enter in its critical section. Where am i wrong?
2) Why if delay == t the algorithm isn't correct?
Suppose that process A and B reach the label L at times t_A and t_B respectively (t_A < t_B) but the difference between these times is smaller-equal than t (worst-case assignment time). If it was larger than t, process B would stop at label L and wait until id=0.
As a result, process B will still see id=0 and assign its ID as well. But process A is not aware about this assignment yet. The only way for process A to get informed about this assignment is to wait for some time and re-check the value of id.
This waiting time must be larger that t. Why?
Let's consider two edge cases here
-case 1: t_A = t_B, in other words, process A and B reached label L at the same time. They both see id=0 and hence assign their IDs to it.
Let's assume that process A's assignment finishes in almost 0 time and process B's assignment finishes in worst-case time t. This means that process A has to delay more than t time, in order to see the process B's update to variable id. If delay is smaller-equal than t, the update will not be visible and they both will enter critical section. This is actually already sufficient for claiming that delay has to be larger than t.
-case 2: t_B = t_A + t, in other words, process A reaches label L, assigns its ID in worst-case time t, then after t time, process B reaches label L, checks id=0 (because assignment of process A has not finished yet) and assigns its ID in worst-case time t. Again here, if process A's delay will be smaller-equal than t, it will not see the update of process B.

Is it possible to remove the following !$OMP CRITICAL regions

I have a fortran code that shows some very unsatisfactory performance due to some $OMP CRITICAL regions. This question is actually more about how to the critical regions can be avoided and whether those regions can be removed? In those critical regions I am updating counters and reading/writing values to an array
i=0
j=MAX/2
total = 0
!$OMP PARALLEL PRIVATE(x,N)
MAIN_LOOP:do
$OMP CRITICAL
total = total + 1
x = array(i)
i = i + 1
if ( i > MAX) i=1 ! if the counter is past the end start form the beginning
$OMP END CRITICAL
if (total > MAX_TOTAL) exit
! do some calculations here and get the value of the integer (N)
! store (N) copies of x it back in the original array with some offset
!$OMP CRITICAL
do p=1,N
array(j)=x
j=j+1
if (j>MAX) j=1
end do
!$OMP END CRITICAL
end do MAIN_LOOP
$OMP END PARALLEL
One simple thing that came to my mind is to eliminate the counter on total by using explicit dynamic loop scheduling.
!$OMP PARALLEL DO SCHEDULE(DYNAMIC)
MAIN_LOOP:do total = 1,MAX_TOTAL
! do the calculation here
end do MAIN_LOOP
!$OMP END PARALLEL DO
I was also thinking to allocate different portion of the array to each thread and using the thread ID to do offsetting. This time each processor will have it's own counter which will be stored in an array count_i(ID) and something of the sort
!this time the size if array is NUM_OMP_THREADS*MAX
x=array(ID + sum(count_i)) ! get the offset by summing up all values
ID=omp_get_thread_num()
count_i(ID)=count_i(ID)+1
if (count_i(ID) > MAX) count_i(ID) = 1
This however will mess the order and will not do the same as the original method. Moreover some empty space will be present, since the different threads will not able to fit the entire range 1:MAX
I would appreciate your help and ideas.
Your use of critical sections is a bit strange here. The motivation for using critical sections must be to avoid having an entry in the array being clobbered before it can be read. Your code does accomplish this, but only accidentally, by acting as barriers. Try replacing the critical stuff with OMP barriers, and you should still get the right result and the same horrible speed.
Since you always write to the array half its length away from where you write to it, you can avoid critical sections by dividing the operation into one step which reads from the first half and writes to the second half, and vice versa. (Edit: After the question was edited, this is no longer true, so the approach below won't work).
nhalf = size(array)/2
!$omp parallel do
do i = 1, nhalf
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, nhalf
array(i) = f(array(i+nhalf))
end do
Here f(x) represents whatever calculation you want to do to the array values. It doesn't have to be a function if you don't want it to. If it isn't clear, this code first loops through the entries in the first half of the array in parallel. The first task may go through i=1,1+nproc,1+2*nproc, etc. while the second task goes through i=2,2+nproc,2+2*nproc, and so on. This can be done in parallel without any locking because there is no overlap between the part of the array that is read from and written to in this loop. The second loop only starts once every task has finished the first loop, so there is no clobbering between the loops.
Unlike in your code, there is here one i per thread, so one doesn't need locking to update it (the loop variable is automatically private).
This assumes that you only want to make one pass through the array. Otherwise you can just loop over these two loops:
do iouter = 1, (max_total+size(array)-1)/size(array)
nleft = max_total-(iouter-1)*size(array)
nhalf = size(array)/2
!$omp parallel do
do i = 1, min(nhalf,nleft)
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, min(nhalf,nleft-nhalf)
array(i) = f(array(i+nhalf))
end do
end do
Edit: Your new example is confusing. I'm not sure what it's supposed to do. Depending on the value of N, the array values may end being clobbered before they can be used. Is this intentional? It's hard to answer your question when it's not clear what you're trying to do. :/
I thought about this for a while and my feeling is that there is no good answer to this specific issue.
Indeed, your code seems, at first glance, like a good approach to the problem such as stated (although I personally find the problem itself a bit strange). However, there are problems in your implementation:
What happens if for some reason one of the threads gets delayed in processing its iteration? Just imagine that the thread owning very first index takes a while to process it (delayed y some third party process coming in the way and taking the CPU time on the core where the thread was pinned/scheduled for example) and is the last to finish... Then it will set back its values to array in a completely different order than what the sequential algorithm would have done. Is that something you can accept in your algorithm?
Even without this sort of "extreme" delay, can you accept that the order in which the i indexes were distributed among threads is different that the order in which the j indexes are subsequently updated? If the thread owning i+1 finishes right before the one owning i, it will use index j instead of index j+n as it should have had...
Again, I'm not sure I understand all the subtleties of your algorithm and how resilient it is to miss-ordering of iterations, but if ordering is something important, then the approach is wrong. In this case, I guess that a proper parallelisation could be something like this (put in a subroutine to make it compilable):
subroutine loop(array, maxi, max_iteration)
implicit none
integer, intent(in) :: maxi, max_iteration
real, intent(inout) :: array(maxi)
real :: x
integer :: iteration, i, j, n, p
i = 0
j = maxi/2
!$omp parallel do ordered private(x, n, p) schedule(static,1)
do iteration = 1,max_iteration
!$omp ordered
x = array(wrap_around(i, maxi))
!$omp end ordered
! do some calculations here and get the value of the integer (n)
!$omp ordered
do p = 1,n
array(wrap_around(j, maxi)) = x
end do
!$omp end ordered
end do
!$omp end parallel do
contains
integer function wrap_around(i, maxi)
implicit none
integer, intent(in) :: maxi
integer, intent(inout) :: i
i = i+1
if (i > maxi) i = 1
wrap_around = i
end function wrap_around
end subroutine loop
I hope this would work. However, unless the central part of the loop where n is retrieved does some serious computation, this won't be any faster than the sequential version.

Efficient way to skip code every X iterations?

I'm using GameMaker Studio and you can think of it as a giant loop.
I use a counter variable step to keep track of what frame it is.
I'd like to run some code only every Xth step for efficiency.
if step mod 60 {
}
Would run that block every 60 steps (or 1 second at 60 fps).
My understanding is modulus is a heavy operation though and with thousands of steps I imagine the computation can get out of hand. Is there a more efficient way to do this?
Perhaps involving bitwise operator?
I know this can work for every other frame:
// Declare
counter = 0
// Step
counter = (counter + 1) & 1
if counter {
}
Or is the performance impact of modulus negligible at 60FPS even with large numbers?
In essence:
i := 0
WHILE i < n/4
do rest of stuff × 4
do stuff that you want to do one time in four
Increment i
Do rest of stuff i%4 times
The variant of this that takes the modulus and switches based on that is called Duff’s Device. Which is faster will depend on your architecture: on many RISC chips, mod executes in a single clock cycle, but on other CPUs, integer division might not even be a native instruction.
If you don’t have a loop counter per se, because it’s an event loop for example, you can always make one and reset it every four times in the if block where you execute your code:
i := 1
WHILE you loop
do other stuff
if i == 4
do stuff
i := 1
else
i := i + 1
Here’s an example of doing some stuff one time in two and stuff one time in three:
WHILE looping
do stuff
do stuff a second time
do stuff B
do stuff a third time
do stuff C
do stuff a fourth time
do stuff B
do stuff a fifth time
do stuff a sixth time
do stuff B
do stiff C
Note that the stuff you do can include calling an event loop once.
Since this can get unwieldy, you can use template metaprogramming to write these loops for you in C++, something like:
constexpr unsigned a = 5, b = 7, LCM_A_B = 35;
template<unsigned N>
inline void do_stuff(void)
{
do_stuff_always();
if (N%a)
do_stuff_a(); // Since N is a compile-time constant, the compiler does not have to check this at runtime!
if (N%b)
do_stuff_b();
do_stuff<N-1>();
}
template<>
inline void do_stuff<1U>(void)
{
do_stuff_always();
}
while (sentinel)
do_stuff<LCM_A_B>();
In general, though, if you want to know whether your optimizations are helping, profile.
The most important part of the answer: that test probably takes so little time, in context, that it isn't worth the ions moving around your brain to think about it.
If it only costs 1% it's almost certain there are bigger speedups you should be thinking about.
However, if the loop is fast, you could put in something like this:
if (--count < 0){
count = 59;
// do your thing
}
In some hardware, that test comes down to a single instruction decrement-and-branch-if-negative.

How to optimize for time?

I'm trying to understand if there is a difference in speed when executing the following lines of code in a computer program:
myarray[1] = 5; return myarray[1];
myarray[0] = 5; return myarray[0];
x = 5; return x;
x = 5; y = x; return y;
return 5;
From what I understand, arrays are basically pointers (variables that store the memory addresses of other variables). Therefore (1) and (2) should be the same speed, but slower than (3), (4) and (5).
(5) should be the fastest, (3) should be slower than (5) because there is an equal sign, and (4) should be slower than (3) because there are two equal signs that need to be handled.
Would this be right?
You don't give a context what myarray, x and y are. Without that context, the question cannot be answered in any meaningful way. The extra assignments might have no side effects that cannot be optimised away.
Basically, looking at speed optimisation at this elementary level is completely pointless. If you want to look at speed, you need code that is substantial enough that the execution time can be measured. You cannot measure the time of one or two simple statements on a modern processor.

PRAM if-then-else CREW/EREW

In my book of parallel algorithms there is the following pseudo-code for the PRAM model:
procedure PrefixSumPRAM( A, n ):
BEGIN
b := new Array(2*n-1);
b[1] := SumPRAM(A, n); //this will load A with the computation tree and return the sum
for i := 1 to ( log2(n) - 1 ) do
BEGIN
for all procID where (2^i) <= procID <= ((2^(i+1))-1) do in parallel
BEGIN
if odd(procID) then
b[ procID ] := b[ procID/2 ];
else
b[ procID ] := b[ procID/2 ] - a[ procID+1 ];
END
END
END
but...PRAM specify that all processors must execute the same instruction on different data.
So this program is executable only on a CREW PRAM model?
or is executable on an EREW model then the processors with odd ID will execute
b[procID]:=b[procID/2];
when the processors with even id execute a (for example) NOP instruction?
In the PRAM model, there are an unbounded number of processors and a single memory. Although the processors operate in lock-step by executing one instruction per time step, each processor maintains its own state and can therefore execute the program in an arbitrary way according to the control flow.
In a CREW PRAM, two processors can read from the same memory location in the same time step, but only one processor can write to any memory location in one step. In an EREW PRAM, reads can also not occur concurrently.

Resources