The following loop in fortran almost takes no time
j=0
do i=1,1000000000000000000
j=j+1
end do
print*,j
But I just don't understand, our cpu is about GHz, which means 10^9 cycle in a second, while the above loop cycle is way too much than 10^9, why it almost takes no time?
It seems that the values is not computed at compiled time. We can add outer loop, until
do m=1,1000000000
do i=1,1000000000000000000
j=j+1
end do
end do
print*,j
Now it takes a second on my computer
Edit
I am using windows, intel parallel studio 15, with no extra compilation option: simply ifort test.f90. Timing method is simple, just wait after I press Enter in command line to execute the .exe
don't know fortran, but if this would be C, the compiler could optimize the above code removing the loop altogether as the value of j can be computed at compile time.
So the above code would be reduced to
print 1000000000000000000
Your logic about cycles and instructions is flawed. Modern CPUs parallelize code on hardware level, even if the code is serial:
a cpu has more a few ALU who can compute arithmetic instructions in parallel
instructions are executed in a pipeline, so at any one point, different stages of consecutive instructions are executed in parallel.
So "max of one instruction per cycle" doesn't hold.
Also increment by one is one of the fastest instruction in the CPU.
Related
I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?
First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.
I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel
There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.
In trying to optimise some code I find that using OpenMP linearly increases the time it takes to run. The representative section of code that I am trying to speed up is as follow:
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(c1)
DO k=1,ntotal
CALL OMP_INIT_LOCK(locks(k))
END DO
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k)
DO k=1,niac
i = pair_i(k)
j = pair_j(k)
dvx(:,k) = vx(:,i)-vx(:,j)
CALL omp_set_lock(locks(i))
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,j)-disp_nmh(:,i)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)),1, particle_data(i)%def_grad,dim)
CALL DGER(dim,dim,-1.d0, (-dvx(:,k)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)) ,1, particle_data(i)%vel_grad(1:dim,1:dim),dim)
CALL omp_unset_lock(locks(i))
CALL omp_set_lock(locks(j))
CALL DGER(dim,dim,-1.d0, (dvx(:,k)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)) ,1, particle_data(j)%vel_grad(1:dim,1:dim),dim)
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,i)-disp_nmh(:,j)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)),1, particle_data(j)%def_grad,dim)
CALL omp_unset_lock(locks(j))
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK(c2)
t_el = t_el + (c2-c1)/rate
WRITE(*,*) "Wall time elapsed: ", t_el
Note that for the simulation I am testing k=14000 which I thought was a reasonable candidate for running in parallel. So far as I know I have to use the locks to ensure that threads which are given the same value of "i" (but a different value of "j") cannot access the same index of the arrays which are being written to at the same time. I cannot figure out if the version of BLAS (sudo apt-get install libblas-dev liblapack-dev) which I use is thread safe. I ran a simulation with 8 cores and got the same result as without OpenMP so I am guessing that it could be. BLAS is used, in this case, to calculate and sum the outer product of many 3x3 matrices.
Is the implementation of OpenMP above the best way to speed up this code? I know very little about OpenMP but my guesses are that:
the memory being all over the place ("i" is sequential but "j" is not)
the overhead in starting and closing down all the threads
the constant locking and unlocking
and maybe the small loop size (although I thought 14000 would be sufficient)
are significantly outweighing the performance benefits. Is this correct? Or can the code above be modified to get some performance gain?
EDIT
I should probably add that the code above is part of a time integration loop. Hopefully this explains why the elapsed time is summed.
Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.
Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.
I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.
Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).
This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.
Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".
First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.
My question is calculate how many CPU cycle takes to execute MOV A, 5 instruction. Describe each.
can anyone please explain me how this works. And the 5 is a value is it? Just explain me the main points.
As far as i know,
first,
-get the instruction from memory (one clock cycle)
-update instruction pointer(one clock cycle)
-decode the instruction to see what it does(one clock cycle)
i'm stuck after this.
I'm assuming you are talking about the x86 processors. This of course depends on the processor, but it usually takes 1 clock cycle from what i remember.
This is because instructions are executed in a pipeline. This means that while the processor is computing the result of an instruction, it is decoding the next and fetching the one before that, so that each part of the processor is busy doing something.
Usually the instructions that need data from memory or do complex calculations like multiplications or divisions take a longer time to execute.
You can also get the number of cycles with RDTSC https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf