Why does CUDA code run so much faster in NVIDIA Visual Profiler?

Why does CUDA code run so much faster in NVIDIA Visual Profiler? - performance

A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?
I'm using CUBLAS, Thrust and cuRAND.
Incidentally, there's been a noticeable slowdown in compiled code on my machine very recently, even old code that previously ran quickly, hence I'm getting suspicious.
Update:
I have checked that the calculated output on command line and Visual Profiler is identical - i.e. all required code has been run in both cases.
GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler.
However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line.
Moreover, far less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first cudaFree(0) call). In Visual Profiler that figure goes down to ~100MB.

This is an old question, but I've just finished chasing the same issue (though the cause may not be the same).
Namely: my app achieved between 900 and 1100 frames (synchronous launches) per second when running under NVVP, but around 100-120 when running outside of the profiler.
The cause appears to be a status message I was printing to the console via cout. I had intended for this to only happen about once every 100-200 frames. Instead, it was printing the status message for every frame, and the console IO became the bottleneck.
By only printing the status message every 100 frames (though the optimal number here would depend on your application), the frame rate jumped back up to match what I was seeing in NVVP. Of course, this could also be handled in a separate CPU thread if that sort of overhead is unacceptable in your circumstances.
NVVP has to redirect stdout to its own internal buffer in order to capture the application's output (which it shows in its console tab). It appears that NVVP's mechanism for buffering or processing that output has significantly less overhead than allowing the operating system to handle it directly. It looks like NVVP is buffering everything, and displaying it in a separate thread, or just saving a bunch of output until some threshold is reached, when it adds that buffer to its own console tab.
So, my advice would be to disable any console IO, and see if or how that affects things.
(It didn't help that VS2012 refused to profile my CUDA app. It would have been nice to see that 80% of the execution time was spent on console IO.)
Hope this helps!

This should not happen. I've never seen anything like it; probably something in your setup.

It could be that some JIT-compile step is skipped by the profiler. This could explain the difference in memory usage. Try creating a fat binary?

Related

Rcpp in Rstudio, can't cache in memory when parallel if I don't open the cpp file in Rstudio

I met a wired problem but I wonder if I'm asking the correct question:
result = parLapply(cl, 1:4,
function(j,rho_list_needed,delta0_needed,
V_iter_s,Sigma_list_needed) {
rhoj = rho_list_needed[[j]]
delta0_in_cpp = delta0_needed
v = as.vector(V_iter_s[,,,j])
sigmaj = Sigma_list_needed[[j]]
sourceCpp('sample_Z.cpp')#first time complie slow,then cashed
return(Sample_Z(rhoj,delta0_in_cpp, v,sigmaj,A,Cmatrix))
},rho_list_needed,delta0_needed,
V_iter[[s]],Sigma_list_needed)
When I was testing my sample_Z.cpp with parallel through parLapply, the single calculation takes around 1 sec. By parallel, my 4 iterations takes around 1.2 secs, which is a big improvement compared to unparalleled version, which is 8 sec.
There's no problem at all when I run my program yesterday. Just now I noticed a bug and revised my program. To give my PC a fresh environment, I restarted my computer. When started to run my program, I only opened the .R file, and run. But it took 9 sec for that parallel, which used to be 1.2 sec. The 9 sec was after warming up my cores, i.e., already sourced the cpp before I time it.
I just don't know where is the bug. Then try to source the cpp file directly in my global merriment, and I found out that there was no caching at all. The second time took the same time as the first one.
But I accidentally opened the sample_Z.cpp in Rstudio, explicitly at the editor. And then, everything works correct now.
I don't know how to search this similar problem on google with what kind of key words and I don't know if opening the cpp file is a must, while I never known before.
Can anyone tell me what's the real issue? Thanks!

After restarting your PC, you probably had extra processes running which would have competed for CPU cores that slowed down your algorithm. The fact you're rebooting suggests to me you're not using Linux... but if you are, watch with top while starting your code, or equivalent for your platform.

How to debug potential CPU/RAM errors in Bash script on Linux

I have a relatively simple bash script that reads from a set of static input files, stores the input in bash variables and then does a bunch of processing over said input by calling out to external scripts (e.g. written in Python, Go, other bash scripts etc.) and using the intermediate results.
Lately I have been experiencing an intermittent problem where a single character seems to be getting altered somewhere during the processing which then causes subsequent errors. Specifically, a lot of the processing I'm doing involves slicing up a list of comma-separated records, and one of the values on each line is a unix timestamp, e.g. 1354245000.
What seems to be happening is that occasionally one of these values will get altered slightly, so I end up with a timestamp like 13542458=2 or 13542458>2 or 13542458;2 coming out of one of the intermediate scripts. This then subsequently gets fed into another script, which throws an exception when it tries to parse the value to an integer.
In the title of this question, I've suggested that this might be a potential CPU/RAM error. I know the general folly in thinking errors are caused by low level things like hardware/compilers etcetera, but the nature of this particular error makes me think it may be possible, for the following reasons:
The input files are the same on each invocation of the script, and the script only fails on some invocations.
I cannot think of any sources of randomness in the source code prior to where the script is breaking. It's basically just slicing and dicing csv input.
I cannot think of any sources of concurrency in the source code -- even the Go scripts aren't actually written to run anything concurrently.
This problem has only arisen in the last week or so. Prior to this time, this error would never occur.
While I haven't documented every erroneous character, they seem to often be quite close in the ASCII table to numeric values (=, >, ; etc). That said, I guess the Hamming distance between two characters quite far apart can be small also with changes to a high order bit.
The script often breaks at a different stage on different runs. i.e. I have a number of separate Python scripts, and sometimes it'll make it past one script and then the error will be induced in another. Other times it'll be induced on an earlier script.
What I'd like to know is, is there any methodical way to either confirm or rule out a hardware error for this problem? Or if it is a hardware problem, is it possibly undetectable by the operating system?
A bit of further info on the machine:
Linux 64-bit, Ubuntu 12.04
Intel i7 processor
16GB DDR3 RAM
I'm hoping someone can either point me to a reliable way to verify whether the hardware is to blame or otherwise a sound reason as to what else might be the cause.

Try booting into Memtest to check your memory.

While it is highly unlikely that it will be hardware, if you have exhausted you standard software debug as suggested by #OliCharlesworth, here is an outline of hardware error investigation:
(1) check your log area for any `MCE` logs (machine check exceptions).
If you find any in either your log area (syslog) or sometimes in
the present working dir or /dir -- you have a hardware failure.
(2) check your log area for disk errors. e.g:
smartd[3963]: Device: /dev/sda [SAT], 34 Currently unreadable (pending) sectors
(3) check your drive integrity, e.g.: (as root) # `smartctl -a /dev/sda` if any abnormality, run:
smartctl -t short /dev/sda (change drive as required)
(4) download/install/boot to [memtest86](http://www.memtest86.com/download.htm)
(run the complete test)
If your cpu/motherboard has thrown no mce's, you have no disk error, your drive tests OK with smartctl and you have no memory errors with memtest86, then recheck the software debugging. While additional hardware errors can still be present (bad capacitors, etc..) the likelihood at this point is software. Good luck.

Is there a way to keep the Xcode console log from overflowing and locking up the session?

I have a Mac app (that is a testbed for a phone app) that spews massive amounts of output into the console log. Mostly this is what I want, but sometimes I run large "batch" runs and the console log essentially fills up and Xcode locks up. The only way I've found to prevent this is to monitor the job and every 30 seconds or so press "Clear", hoping that I'm not so close to the end that I clear out the 50 or so final lines giving the results of the run.
Yes, I could go through the code and reduce the number of lines output, but there are several reasons (not purely based on laziness) for not doing that.
Does anyone know of a way to tell Xcode to maintain the console as a "rotating buffer" of sorts, clearing old stuff from time to time so that it doesn't fill up?

You could write your own rotating buffer implementation, and log to that rather than with printf.
Or if you don't want to replace all your printfs:
#define printf rotatingPrintf
Perhaps it would work to write a command-line tool that has a rotating buffer, then pipe the output of your app to that tool. You can launch GUI apps from the command line like this:
$ /Applications/Foo.app/Contents/MacOS/Foo

How to check Matplotlib's speed in Xcode and increase performance?

I'm running into some considerable speed bottlenecks with a Python-Matplotlib-Xcode combination. I know some immediate responses will probably ask "Why are you doing python stuff in Xcode, just man up and use vim" --> I like the organizing ability and the built in version control, it makes elements of my work easier to deal with.
Getting python to run in xcode in the first place was a bit more tricky than I had hoped, but its possible. Now I have the following scenario:
A master file, 'main.py' does all the import stuff for me and sets up some universal formatting to make all the figures (for eventual inclusion in my PhD thesis) nice and uniform. Afterwards it runs a series of execfile commands to generate whichever graphics I need. Two things I can think of right off the bat:
1) at the very beginning of main.py after I import all the normal python stuff you tend to need, I call a system script which checks whether a certain filesystem is mounted. I keep all my climate model data on there since my local hard drive is too small to deal with all of it at once. Python pauses itself and waits for the system to do its thing, but once the filesystem has been found, it keeps going. Usually this only needs to happen once in the morning when I get to work, or if the VPN server kicked me off for whatever reason. (Side question, it'd be cool to know if theres a trick to automate an VPN login to reconnect as soon as it notices its not connected)
2) I'm not sure how much xcode is using on its own. running the same program from terminal is (somewhat) faster. I've tried to be memory conscience and turn off stuff I don't need while running the python/xcode combination.
Also, python launches a little window whenever I call plt.show(), this in itself takes time, I've considered just saving them as quick png files and opening them with some other viewer, although I guess that would also have to somehow take time to open up. Given how often these graphics change as I add model runs or think of nicer ways of displaying the data, it'd be nice to not waste something on the order of 15 to 30 minutes (possibly more) out of the entire day twiddling my thumbs and waiting for a window to pop up.

Benchmark it!
import datetime
start = datetime.datetime.now()
# your plotting code
td = datetime.datetime.now() - start
print td.total_seconds() # requires python version >= 2.7
Run it in xcode and from the command line, see what the difference is.

I/O performance of multiple JVM (Windows 7 affected, Linux works)

I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -

Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.

One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio