I have a project that has 188200 lines (code size: 5.2MB) that builds in 4.0 sec.
Another project that is 4.8x larger (900900 lines, code size 22MB) compiles in 70.8 sec (similar speed on 64/32bit compiler).
Since the Delphi compiler parses the code in one step, why the compilation speed is not linear (or close to linear) with the amount of code?
70sec is quite far away from the speed promised by Embarcadero (5 sec).
Using Delphi Sydney (that is v10.4.1 for those like me that lost track of which Delphi is which).
CPU: AMD FX 4GHz 8cores
RAM: 32GB (fast ram)
SSD: Samsung 840EVO
No antivirus (except Windows Defender)
Everything compiles in one code. The other 7 cores are free.
Related
I first posted about this issue in this question:
Indexed branch overhead on X86 64 bit mode
I've since noticed this in a few other assembly code programs, where align 16 has little effect, or in some cases makes the situation worse. In my prior question, I was also comparing aligning to even or odd multiples of 16 with significant difference in the case of small, tight loops.
The most recent example I encountered this issue with is a program to calculate pi to about 1 million digits, using a 4 term arctan series (Machin type forumla), combined with multi-threading, a mini-version of the approached used at Tokyo University in 2002 to calculate over 1 trillion digits
http://www.super-computing.org/pi_current.html.en
The aligns had almost no effect on the compute time, but removing them decreased the conversion from fractional to decimal from 7.5 seconds to 6.8 seconds, a bit over a 9% decrease. Rearranging the compute code in some cases increased the time from 98 seconds to 109 seconds, about 11% increase. However the worst case was my prior question, where there was a 36.5% increase in time for a tight loop depending on where the loop was located.
I'm wondering if this is specific to the Intel 3770K 3.5 ghz processor I'm running these tests on.
I have a dell laptop, i7 5th gen processor(1000MHz) with 4 logical cores and 2 physical cores with 16 GB RAM. I have taken a course on High performance computing, for which I have to draw graphs of speed-up.
Compared to a Desktop machine(800 MHz) (i5 5th gen and 8GB RAM), having 4 physical and logical cores, for same program, my laptop takes ~3 seconds while on the desktop machine it takes around 12 seconds. Ideally, since the laptop is 1.25 times faster, the time on my laptop should have been around 9 to 10 seconds
This might not have been the problem if I had got almost similar speedup. But in my laptop, using 4 threads, the speedup is nearly equal to 1.3 and for the same number of cores, on my desktop the speedup is nearly equal to 3.5. If my laptop was fast, then it would have also reflected that property for parallel program, but the parallel program was only ~1.3 times faster. What could have been the reason?
i am using PAPI liberary to tune and profile my application.
I want to know what (PAPI_REF_CYC : Reference clock cycles ) means actually?
Thanks in advance,
Some modern CPUs, including the Intel's and AMD's ones, are throttled.
This means that their clocks are not fixed but vary depending on the power management active - even if the CPU's brand frequency is X Ghz, more often than not, it is not running at that frequency.
For a couple of real example technology see the Intel Turbo boost technology/AMD Turbo core and Intel Enhanced Speedstep technology/AMD Quiet'n'Cool technology.
Since the core clock can slow down or speed up, comparing two different measures makes no sense.
Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different.
That's where the reference clock comes into play - it is uniform.
If snippet A runs in 100 ref clocks and snippet B runs in 200 ref clocks then B really took more time than A.
Converting ref clock ticks into time (e.g. seconds) is not that easy, each processor uses a difference reference frequency, even among processor with the same brand name.
I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.
Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.
I have a large ISim design for Spartan-6 using about 6 of the Spartan-6 FPGA IP cores. It needs to run for a simulation time of 13 seconds, but at present takes 40 seconds to run a simulation time of 1 ms. During the 13 seconds it will also write 480000 24 bit std_logic_vectors to a text file.
This equates to running time of 144 hours to run the entire simulation (almost a week!).
Is there a way, for example, of increasing the step size or turning off the settings for waveform plotting etc, or any other settings I can use to increase the simulation speed?
So far I have tried not plotting the waveform, but it doesn't seem to actually increase the speed.
Thanks very much
Yes adding signals to the waveform slowes every simulator down... but running such long simulations always create GiB of data and take hours or days.
You could check your code and:
improve sensitivity lists to reduce calculation cycles
some IP cores have a fast simulation mode which can be enabled by a generic parameter.
But in general there is only one solution: use another simulator. Especially one with optimization. (Can be disabled or restricted in free editions) E.g.:
GHDL - is open source and quite fast
QuestaSim / ModelSim
ModelSim is for example included in Altera Quartus Prime (WebPack) for free as Starter Edition.
Active-HDL
Active-HDL Student Edition is free to use. Alteratively, it's included in Lattice Diamond.
P.S. 40 sec for 1 ms (25 us per second) is very fast. My integration simulations usually calculate 20 ns per second. So you are 1000x faster)