Use GPU profiler (for example CodeXL) together with PyOpenCL

Use GPU profiler (for example CodeXL) together with PyOpenCL - debugging

I have my complex PyOpenCL app with a lot of buffers creations, kernel templating and etc. I want to profile my app on GPU to see what is the bottle neck in my case.
Is it possible to use some gpu profiler with PyOpenCl app? For example CodeXL.
P.S. I know about event profiling but it isn't enough.

Yes, it is possible. Look here: http://devgurus.amd.com/message/1282742

Related

How can I benchmark or profile an embedded ARM platform emulated?

I'm developing performance sensitive code for an embedded platform. In general, there are multiple ways to test for an embedded platform, and I'm doing so by developing on a full Linux machine, using the qemu-user arm mode as an emulator. I have full unit tests working, and now want to address performance.
I'd like to profile or benchmark my code. Now, doing so directly in qemu-user is silly, because a fast op may be emulated slowly. But, in principle, qemu could tell me how many clock cycles were emulated to run a function. Even if this doesn't have a full model, or even a partial model, of cache, mem latency, etc., it will still be very useful.
Is there a way I can use qemu to tell me some sense of how code A will perform vs code B? If not, is there another tool? (I recall Intel having some type of model which will tell you how fast given asm will execute.) In general, in the absence of an embedded platform with profiling tools, how can I benchmark and profile my code for ultimate performance?

Profiling OpenGL ES in Windows

I'm trying to do some profiling on my OpenGL ES code. Somewhere in my GPU pipeline (a shader I believe) is causing a huge delay. Which is the best profiler I can use? Is this one a good option? is there one I can use directly within Visual Studio?

If you have a GPU performance issue on IOS, the best is to use XCode tools to profile it directly on device, running the app from Xcode and then doing a frame capture to look at the timings for each draw call / the number of cycles used by each shader (more info here)
You can also profile on Windows if you are also able to simulate your graphics pipeline in classic OpenGL in your Windows version, but this may not be a good idea as the iPhone's GPU is very different than a classic desktop GPU so the bottleneck might not be the same on Windows than on IOS.
To profile on Windows I would suggest using either Nvidia PerfKit (if you have a Nvidia card) or AMD's GPU PerfStudio if you have an AMD card.
There is also RenderDoc which is a nice tool but not sure if it provides much profiling information (it is more for debugging graphics issues than profiling)

Application Compilation using Graphics Card

During the Microsoft Windows 10 Devices event, Panos Panay - whilst talking about the Surface Books graphics said the following:
It's for that coder, using the latest Visual Studio where they can compile using the GPU and CPU at the same time and not lose a minute (Video)
This could just be a throwaway comment, but given that it is possible to do CPU type activities on the GPU (CUDA?), I wondered if he was actually talking about a genuine way to make Visual Studio use both the CPU and the GPU to perform application compilation.
Looking online, I can't see an obvious answer. Is this possible?

If they use something like AMP underneath then that is exactly what it is designed to do... use CPU and GPU (heterogeneous computing).

Emulate OpenGL on machine with standard VGA graphics

So, we've got a little graphical doohickey that needs to run in a server environment without a real video card. All it really needs is framebuffer objects and maybe some vector/font anti-aliasing. It will be slow, I know. It just needs to output single frames.
I see this post about how to force software rendering mode, but it seems to apply to machines that already have OpenGL enabled cards (like NVidia).
So, for fear of trying to install OpenGL on a machine three time zones away with a bunch of live production sites on it-- has anybody tried this and/or know how to "emulate" an OpenGL environment? Unfortunately our dev server HAS a video card, so I can't really show "what I've tried".
The relevant code is all in Cinder, but I think our actual OpenGL utilization is lightweight for this purpose.
This would run on windows server 2008 Standard
I see MS has a software implementation for OGL 1.1, but can't seem to find one for 2.0

Build/find some Mesa DLLs.
It will be slow.

Is direct video card access possible? (No API)

I'm now a bit experienced with using OpenGL, which I started using because it's said that it is the only way to invoke video card functions. (besides DirectX - which I like less than OpenGL)
For programming (e.g. in C/C++) the OS gives many APIs, like functions for printing. But these can also be bypassed, by coding in Assembly-language - and call much lower level APIs (which gain speed) or direct CPU calls.
So I started wondering why this wouldn't be possible on the video card. Why should an API like OpenGL or DirectX be needed? The process going on with those is:
API-call >
OS calls video card (with complex opcodes, I think) >
video card responses (in complex binary format) >
OS decodes this format and responses to user (in expected API format)
I believe this should decrease the speed of the rendering process.
So my question is:
Is there any possibility to bypass any graphical API (under Windows) and make direct calls to the video card?
Thanks,
Dennis

Using assembly or bypassing an api doesnt automatically make something faster, often slower as you dont know what the folks that wrote the library know.
it is absolutely possible yes, those libraries are just processor instructions that poke and peek at registers and ram, and you could just as easily poke and peek at registers and ram. The first problem is can you get that information, sure, you can look at the linux drivers or other open source resources. Second, much of the heavy lifting today is done in the graphics chip by logic or graphics processors, so the host is just a go between and not necessarily the bottleneck if there is a bottleneck. And yes you can program the gpus depending on your video card/chip, etc.
You need to determine where the bottleneck really is, if there really is one, maybe the bus is your problem, maybe the operating system is your problem, or the compiler, or the hard disk or the system memory, the processor and architecture itself, caches, etc. At the same time how will you ever learn how to find these things unless you try.
I recommend getting rid of windows completely, no operating system, go bare metal. Take the linux and other open source resources plus anything you can get from the vendor and get closer to the metal. You will also need a lot of info about the pci/pcie bus and bridges, dma controllers, everything in the path. If you dont want to go that low then use linux or bsd or some other command line environment where it is well known how to take over the video system, and take over the video system while retaining an operating system and a development environment (vi/emacs, gcc).
if that is all way too advanced, then I recommend, dabbling in simple gpu routines to get a feel for how the video card works at least at some level and tackle this learning exercise one step at a time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio