How can i access the Intel CPU Counter - performance

Is there any small tool that gives me access to the data gathered by the Intel CPU Counters (like L1/L2 cache misses, branch prediction failures ... you know there are hunderts of them on modern Core2 CPU's).
It must work on Windows (while being able to use it with Solaris, FreeBSD, Linux, MacOSX would of course be nice).

Check out the Intel PCM (Performance Counter Monitor) tool which does exactly what you want to do.
Link: https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Intel PCM provides a rich API that allows you to instrument your code. Furthermore, to date, PCM is the only tool to read uncore events too.

This thread seems a little old but if you're still interested, I wrote a howto recently on this topic using nothing more than rdmsr and wrmsr in Linux. It only deals with the performance counters on an Intel uncore for Westmere, but the process I described might help you figure out what you need if you haven't already. I'm sure Windows has some equivalent program or function call to RDMSR and WRMSR. The problem is you need to be ring 0 (kernel mode) to read MSRs. I have no idea how to do that in Windows. I won't be able to help with any Windows questions but may be able to answer some MSR-related questions if you have any. I'm by no means an expert though.

PAPI is a very promising lead, however, I believe they discontinued support for Windows (and therefore .NET C#) quite a few years ago.
On the windows front, Visual Studio 2010 Premium comes with performance explorer. If you run any project or binary in instrumentation mode, you can get access to hardware events such as instructions retired.
The results can be somewhat mixed and inconsistent depending external factors, but it integrates with Visual Studio nicely and you get detailed counts (avg, maximum, total) on a per method/module level.
Intel V-tune performance analyzer also exposes these natively. I haven't played with this tool yet but it might be a more flexible API than what Visual Studio 2010 exposes.

You didn't write of your are looking for a application or for a library.
For Windows there is Intel VTune. But this not exactly an small tool. For linux I have used oprofile, which works without kernel patches.

On OS X, Shark lets you get data from the PMCs. I'm not sure what's available on Windows other than Intel's tools (VTune, as mentioned by drhirsch).

Try this
http://icl.cs.utk.edu/papi/
It is a full library that allows you to read any CPU counters data, works both on Windows and Linux [and other OS]

This thread looks pretty old. But still, all the above mentioned counters are available at Intel PCM .These counters can be used as a Microsoft Perfmon plugin or a command prompt interface. The Intel PCM gives informations like L2 and L3 cache hit ratio, cache misses etc.

Related

Distribution of an application that uses OpenCL

I would like to distribute a Windows/Linux application that uses openCL, but I can't find the best way to do it.
For the moment my problem are only on Windows:
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users) ?
2- For distribution of application that uses Visual Studio DLL, we have Visual Studio Redistributable to manage this easily and to avoid a big installation of Visual Studio. Is there a package like this for openCL ?
3- Finally, I don't know if I must provide OpenCL.dll or not (example of different point of view here)
I read several topics on the web about this problem without clear solution.
Thank you for your help.
1) You write to the OpenCL API and it works with whatever hardware your user has. User the header for the lower version you want to support (e.g., use cl.h from 1.1 if you want to target 1.1 and higher).
2) The OpenCL runtime is installed on the user's machine when they install a graphics driver. You don't need to (and should not) redistribute anything.
3) Please don't redistribute OpenCL.dll
The one problem you may need to deal with is if your user does not have any OpenCL installed on their machine. In this case, the call to clGetPlatformIDs will fail. There are various ways to deal with this, all platform specific. Dynamically linking to OpenCL.dll is one way, or running a helper process to test for OpenCL is another. An elegant solution on Windows is to delay load OpenCL.dll and hook that API to return 0 if the late binding fails.
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users)
Are you talking about running OpenCL kernels on CPU, or just host-side code while kernels run on GPU ? because if the former (on CPU), your users will need to install their respective OpenCL CPU implementation, IIRC the Intel CPU implementation does not run on AMDs (or at least that used to be the case, perhaps it's now different..)
3- Finally, I don't know if I must provide OpenCL.dll
You don't have to, but you should, IMO. The way OpenCL works (usually), OpenCL.dll is just an ICD loader - a small library (a few dozen KB) that loads the actual OpenCL implementation(s) by looking into a few predefined places. It should be safe to include on Windows, and it simplifies your program logic - you can always build with OpenCL enabled, and if there's no OpenCL implementation installed, the loader will return CL_PLATFORM_NOT_FOUND_KHR - you just handle that error by asking user to install an OpenCL implementation, or fallback to non-OpenCL code path if you have it, whatever suits you more.
There's no need to complicate your life with delayed DLL loads or helper processes. In fact that's the entire point of the ICD concept - you don't need to look for the platforms and DLLs yourself, you let the ICD loader do it. It's pretty absurd to write helper code to load a helper library (ICD) which then loads the actual implementation DLLs...

What is the quantifiable benefit of 64 bit for Visual Studio Code over 32 bit

I'm not a hardware guy, but I know that Visual Studio in a 64 bit version issue request was declined by Microsoft stating that a 64 bit version would not have good performance.
Two noticeable differences between the two that I feel are obvious is the code base. One began it's life in 1997, one would think that means more baggage on the Visual Studio side, less opportunities to have very modern application architecture and code and that may make it harder and possibly stuff may be built to perform on 32 bit and for some reason is not suitable for 64 bit? I don't know.
Visual Studio Code on the other hand is an modern Electron app which means it pretty much just compiled HTML. CSS and JavaScript. I'm betting making a version of Visual Studio Code has little in the way of obstructions and although performance may not be something truly noticeable, why not?
P.S.
I still would like to understand what areas may be improved in performance and if that improvement is negligible to the a developer. Any additional info or fun facts you may know would be great I would like to have as much info as possible and I will update the question with any hard facts I uncover that are not mentioned.
The existence of 64-bit Visual Studio Code is largely a side-effect of the fact that the Node.js- and Chromium-based runtimes of Electron support both 32- and 64-bit architectures, not a primary design goal for the application. Microsoft developed VS Code with Electron, a framework used to build desktop applications with web technologies.
Because Electron already includes runtimes for both architectures (and for different operating systems), VS Code can provide both versions with little additional effort—Electron abstracts the differences between machines from the JavaScript code.
By contrast, Microsoft distributes much of Visual Studio as compiled binaries that contain machine-specific instructions, and the cost of rewriting and maintaining the source code for 64-bits historically outweighed any benefits. In general, a 64-bit program isn't noticeably faster to the end user than its 32-bit counterpart if it never exceeds the limitations of a 32-bit system. Visual Studio's IDE shell doesn't do much heavy-lifting—the bulk of the expensive processing in a typical workflow is performed by the integrated toolchains (compilers, etc.) which usually support 64-bit systems.
With this in mind, any benefits we may notice from running a 64-bit version of VS Code are similar to those we would see from using a 64-bit web browser. Most significantly, a 64-bit version can address more than 4 GB of memory, which may matter if we need to open a lot of files simultaneously or very large files, or if we use many heavy extensions. So—most important to us developers—the editor won't run out of memory when abused.
While this sounds like an insurance policy worth signing, even if we never hit those memory limits, remember that 64-bit applications generally consume more memory than their 32-bit counterparts. We may want to choose the 32-bit version if we desire a smaller memory footprint. Most developers may never hit that 4 GB wall.
In rare cases, we may need to choose either a 32-bit or 64-bit version if we use an extension that wraps native code like a DLL built for a specific architecture.
Any other consequences, positive or negative, that we experience from using a 64-bit version of VSCode depend on the versions of Electron's underlying runtime components and the operating system they run on. These characteristics change continuously as development progresses. For this reason, it's difficult to state in a general manner that the 32-bit or 64-bit versions outperform the other.
For example, the V8 JavaScript engine historically disabled some optimizations on 64-bit systems that are enabled today. Certain optimizations are only available when the operating system provides facilities for them.
Future 64-bit versions on Windows may take advantage of address space layout randomization for improved security (more bits in the address space increases entropy).
For most users, these nuances really don't matter. Choose a version that matches the architecture of your system, and reserve switching only if you encounter problems. Updates to the editor will continue to bring optimizations for its underlying components. If resource usage is big concern, you may not want to use a GUI editor in the first place.
I haven't worked much on windows but have interacted with x86, x64 and ARM (Both 32-bit and 64-bit instruction set size) processors. Based on my experience, before writing the code in 64-bit format we thought: Do we really need 64-bit size instructions? If our operation can be performed within 32 bits, then why shall we need another 32 bits?
Think of it like this: You have a processor with 64-bit address and 64-bit data buses and 64-bit size registers. Almost all of the instructions of your program requires maximum 32 bits. What will you do? Well, I think there are two ways now:
Create a 64-bit version of your program and run all the 32-bit instructions on your 64-bit processor. (Wasting 32-bits or your processor in each instruction cycle, and filling the Program Counter with an address which is 4 bytes ahead). Your application / program which could have been executed in 256 MB of RAM now requires 512 MB, due to which other programs or processes running on the RAM will suffer.
Keep the program format to 32-bit and combine 2 32-bit instructions to be pushed into your 64-bit processor for execution.
Obviously, second approach will run faster with the same resources.
But yes, if your program is containing more instructions which are really 64-bit in size; For eg. Processing 4K videos (Better on 64-bit processor with 64-bit instruction set) or performing floating-points operations with up to 15 decimal digit precision, etc. Then, it is better to create 64-bit program file.
Long story in short: Try to write compact software and leverage the hardware as much as possible.
So far, what I have read Here, Here and Here; I came to know that most of the components of VS require only 32-bits instruction size.
Hope it explains.
Thanks
4 years later, in 2021, you now have:
"Microsoft's Visual Studio 2022 is moving to 64-bit" from Mary Jo Foley
It references the official publication "Visual Studio 2022" from Amanda Silver, CVP of Product, Developer Division
Visual Studio 2022 is 64-bit
Visual Studio 2022 will be a 64-bit application, no longer limited to ~4gb of memory in the main devenv.exe process. With a 64-bit Visual Studio on Windows, you can open, edit, run, and debug even the biggest and most complex solutions without running out of memory.
While Visual Studio is going 64-bit, this doesn’t change the types or bitness of the applications you build with Visual Studio. Visual Studio will continue to be a great tool for building 32-bit apps.
I find it really satisfying to watch this video of Visual Studio scaling up to use the additional memory that’s available to a 64-bit process as it opens a solution with 1,600 projects and ~300k files.
Here’s to no more out-of-memory exceptions. 🎉

OpenCL distribution

I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.
First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.
There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.
Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers

What debugger can you use on a DOS procted mode program?

I have a program written in CA-Clipper 5.2 and linked with Blinker 7. I recently learned how to compile it into protected mode in place of real mode. Now the real mode debugger won't work with the program. So now I need a way to debug my code. The documentation for Blinker says to use "NuMega SoftICE" or "Periscope". I'm not family with those debuggers, and can't find much on them from Google. It sounds like SofeICE was turn into some type of hacking tool. Any suggestion on a way to debug my program?
NuMega was bought out, and SoftICE was killed (something like five years ago, if memory serves). It was a kernel debugger, which is a kind of tool some hackers (in either sense of the word) find useful, but wasn't really a hacking tool as such. (Silly trivia of the day: people who beta-tested the original version of SoftICE for Windows NT got a T-shirt that read: "...and they said it couldn't be done!").
Periscope is (was) an in-circuit emulator. It was a board with a plug to fit into your CPU socket, and a socket where you put the original CPU. It would then monitor all the traffic over the CPU bus, providing a lot of debugging capability that most software debuggers can't even hope to match. As CPU buses got faster, however, it got extremely expensive, and eventually got to the point that there was no market left. There was definitely a version for the 486 (I've used it), but I don't think there was ever a version for the Pentium or newer.
As to what you would use: the HX DOS Extender is probably the only DOS Extender still maintained. Their page lists debuggers that can be used with it. I certainly can't guarantee compatibility with the DOS extender you're using, but there's at least a chance one of them might work.
try watcom debugger with commandline startup: wd /tr=rsi
trap for rational systems dos extender
be sure to get latest version: open-watcom-c-dos-1.9.7z
it has problems doing search
but earlier versions do not work well

Which is the best tool to test for Memory leak in Win32/COM application?

I'm looking for a tool which can monitor a running application (Win32/COM) for a long duration (1-3 days) and detect memory leaks if any. Any suggestions?
It is a .NET Windows application calling lots of unmanaged code.
You can try Memory Validator
iJeeves, the combination of BoundsChecker and .NET memory profiling should help you with your memory analysis. DevPartner Studio 10.5 ships February 4, 2011 with 64-bit application support. Depending on your application raw memory footprint, you may run x86 build configurations with the error dectection memory tracking analysis as long as you keep below the 2gb overall process virtual address limit, 3gb if you link the exe with LARGE_ADDRESS_AWARE and run on an x64 OS with extra RAM. The x64 build configuration will let you go up as high as your system RAM allows, at least until you start paging and performance grinds to a standstill. You can run BC error dection for your native code under the .NET process, but object leaks or held references in managed code require a second pass using the .NET memory profiler. We do not yet have a single pass analysis that can handle the mixed C++ and .NET code with full mixed stack traces but we can handle managed code above the line, any PInvokes that cross the line, and all native activity below the line in two passes. Shameless plug: I work on the DevPartner team. The links above pointing to microfocus.com acurately resolve to DevPartner pages. Look for DPS 10.5 when it ships and pull down the eval to see if it meets your needs.
AQTime is nice, I used it several times and it helped me with some tricky bugs.
I used to use Bounds Checker but nowadays I either use the Micrsoft inbuilt CRT library or build my own.
If your looking for a pay$'s tool then DevPartner is well worth using. It has memory leak detection for managed, and unmanaged code.
Application Verifier is free and from Microsoft. It detects memory leaks, double frees, overwrites and many other things. I use it all the time and it has helped me track down some nasty issues.

Resources