How to interpret NVIDIA Visual Profiler analysis/recommendations? - parallel-processing

I'm relatively new to CUDA and am currently under a project to accelerate computer vision applications in embedded systems with gpu's attached(NVIDIA TX1). What I'm trying to do is select between two libraries: OpenCV and VisionWorks(includes OpenVX).
Currently, I have made test codes to run Canny Edge Detection algorithm and the two libraries showed different execution times(VisionWorks implementation takes about 30~40% less time).
So, I wondered what the reason might be, and thus profiled the kernel that's taking the most time: 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra , which is taking up 37.2% of the entire application(from both OpenCV implementation and VisionWorks implementation) and 'edgesHysteresisLocal' from VisionWorks.
I followed the 'guided analysis' and the profiler suggested that the applications are both latency-bound, and below are the captures of 'edgesHysteresisLocal' from VisionWorks, and 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra.
OpenCV4Tegra - canny::edgesHysteresisLocalKernel
VisionWorks - edgesHysteresisLocal
So, my question is,
from the analysis, what can I tell about the causes of the different performances?
Moreover, when profiling CUDA applications in general, where is a good point to start? I mean, there are a bunch of metrics and it's very hard to tell what to look at.
Is there some educational materials regarding profiling CUDA applications in general? (I looked at many slides from NVIDIA, and I think they're just telling the definitions of the metrics, not where to start from in general.)
-- By the way, as far as I know, NVIDIA doesn't provide the source codes of VisionWorks and OpenCV4Tegra. Correct me if I'm wrong.
Thank you in advance for your answers.

1/ the shared_memory usage is different between the 2 libraries this is probably the cause of performance divergeance.
2/ Generally is use three metrics to know if my algorithm is well coded for CUDA devices :
the memory usage of kernels (bandwidth)
the amount of registers used : is there register spilling or not ?
the amount of shared memory bank conflicts
3/ i think there is many things on the internet....
another thing :
If you just want to qualify the usage of a lib versus another in order to select the best, why do you need the understand each implementations (it's interesting but not a pre-requisite isn't it)?
Why don't you measure algorithm performance with the cycle time and the quality of produced results according to a metric (false positive, average error on a set of results known, ...)

Related

Theoretically knowing how powerful a micro controller you need, to run your program?

With the vast array of micro controllers out there and even different levels of arduinos providing more power than the last, is there a mathematical way or some way of knowing how much processing power you need, just by analysis, to run your program as designed in order to choose the right micro?.
Without just trial and error. i.e without just trying it and if it is too slow buying the next chip up.
I've had to do performance projections for computer systems that did not exist yet. Things like cycle time ratios can only give a very rough guide. Generally, I had to resort to simulation, the nearest I could get to measuring on actual hardware.
That said, you may be able to find numbers for benchmarks similar to your code that will at least give you a starting point.
I would not do it by working up one chip at a time - your code may have a problem that makes it too slow for any feasible chip. I would try to find a chip that is fast enough, and work down if it is much faster than needed.

Most effective method to use parallel computing on different architectures

I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.
MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web
Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).

Parallel processing on FPGA. How to start with?

I have a computational intensive task which I used CUDA to implement it and now I want to make it even faster with FPGAs (if possible)
The system I want to implement is a series of computations each similar to matrix multiplication in sense of being parallel. It also has some non-parallel parts in between. It works with big amounts of data.
Although I want it as fast as possible, I have enough time to learn and explore with FPGAs.
here I'm asking for suggestions on how I start my path? Which FPGA to choose and where to learn about it. any website or online class or books? I've decided to do this anyway but your idea of whether this will be faster on FPGA or not would be helpful too.
The big wins from an FPGA over using a GPU come from:
Using non-standard word widths optimised to your application. This allows denser logic, which allows more parallel processing blocks
using your knowledge of the required accesses to external RAM to schedule them in hardware more efficiently than a general purpose memory controller can.
The downside is getting data to and from the FPGA. Draw a data-transfer diagram before you start. Even if the FPGA provides infinite speedup, you might still find it's not worth the effort if there's loads of data to be shuffled to and fro!
It's likely you'll be wanting a PCI express based board. Which is (I imagine) a whole new learning-curve before you get to do anything with the FPGA - but if you're up for it, it'll be a very interesting task!
In terms of choosing FPGAs, have a play with the software tools from the various vendors - at the learning stage that's much more important than the chips themselves. You won't find (at this early learning-stage) a show-stopper feature in any of the various chips. Also take into account the availability of boards with your required interfaces on, and any IP-core you might need to do the high-speed interfacing (eg PCIe)
You can get a substantial speedup on most parallel problems with an FPGA.
However, in addition to implementing your computation on the FPGA, there's a lot of work involved in getting the data back and forth from the CPU/main memory. This will require implementation of (for example) a PCI Express endpoint in the FPGA logic (bus mastering for maximum speed) and custom drivers on the software side. Most operating systems will require those drivers to be developed in kernel mode.
And you can't just use the most straightforward approach for FPGA programming either. You're going to need to worry about pipelining and clock synchronization in order to maximize throughput.
In other words, it's a substantially difficult task even for engineers with years of FPGA experience. I strongly suggest you find someone to work with on this. Depending on how proprietary your project is, you might find skilled academics willing to work with you as long as you provide them with all materials and publication rights.
If you're determined to go it alone, you'll need some hardware. Many different companies offer FPGA wired up as accelerators, for example http://www.nallatech.com/pci-express-cards.html
Depending on whether you choose a Xilinx or Altera FPGA, you'll find considerable sample code and tutorials for getting PCI express working.

Best practices for capturing and logging performance of software components

I am searching for good (preferably plug-and-play) solutions for performing diagnostics on software I am developing. The software I am working on has several components that require extensive computing resources, and so we're attempting to capture the performance of these components for two reasons: 1) estimate required computing resources and thus the costs of running the software, and 2) quantify what an "improvement" is for the component (i.e. if we modify the code and speed increases, then it's an improvement). Our application is composed of a search engine plus many other components, and understanding the speed of the search engine is also critical to the end-user.
It seems to be hard to search for a solution since I'm not sure how to properly define my problem. But what I've found so far seems to be basic error logging techniques. A solution whose purpose is to run statistics (e.g. statistical regressions) off of the data would be best. Maybe unit testing frameworks have built-in test timers, but we need to capture data from live runs of our application to account for the numerous different scenarios.
So really there are two questions:
1) Is there a predefined solution for these sorts of tests?
2) Is there any good reference for running statistical regressions on this kind of data? Let's say we captured execution time of the script and size of the input data (e.g. query). We can regress time on data size to understand the effect of changing the data size on the execution time. But these sorts of regressions are tricky since it's not clear what all of the relevant variables are. Any reference to analyzing performance data would be excellent, and benefit to many people I believe!
Thanks
Matt
Big apps like these are going to be doing a lot of non-CPU processing,
so to find optimization points
you're going to need wall-clock-based, not CPU-based, sampling.
gprof and some others only sample on CPU time, so they cannot see needless I/O or other system calls.
If you do manage to find and remove CPU-intensive performance problems, the I/O-intensive ones will only become a larger fraction of the time.
Take a look at Zoom.
It's a stack sampler that reports, by line of code, the percent of wall-clock time that line is on the stack.
Any code point worth optimizing will probably be such a line.
It also has a nice butterfly view for browsing the call graph.
(You don't want the call graph as a whole. It will be a meaningless rat's nest.)

Site on OpenGL call performance

I'm searching for reliable data on OpenGL's functions performance. A site that could for example:
...answer me how much more efficient is using glInterleavedArrays compared to gl*Pointer based implementation with strides, or without them. If applicable, show the comparisions on nVidia vs. ATI cards vs. embedded systems.
...answer me how much of a boost is gained in using VBO's vs. non-buffered data in the cases of static, dynamic and stream data.
I'd like to find a site that has "no-bullshit" performance data, not just vague statements like "glInterleavedArrays are usually faster than direct gl*Pointer usage".
Is there such a dream-site? Or at least somewhere where I can get answers to the forementioned questions?
(yes, I know that nothing will beat hand-profiling, but the fact that something works faster on my machine, doesn't mean it's faster generally on all cards...)
It's more about application level benchmarking than measuring performance of individual features, but it might be possible to learn something from specviewperf, especially if it's possible to discover more about what OpenGL mode each benchmark uses to perform it's rendering. The benchmark seems to include some options to tweak usage of display lists, vertex arrays etc, but I don't think SPECs published results go into any analysis of the effects of changing these from the defaults. They don't seem to have any VBO coverage yet.

Resources