ARM Cortex-A8: How to measure cache utilization? - memory-management

I have a Freescale's i.MX515EVK, an ARM Cortex-A8/Ubuntu platform with me, unfortunately the Linux kernel on the board is not supporting some of the well known profilers such as Oprofiler or Zoom Profiler(Zoom supports ARM processors, but it internally, uses Oprofiler driver) which give very detailed reports about the cache utilization.
Cortex-A8 has 32KB Instruction and Data caches and a 256KB L2 Cache. Currently when my image processing algorithm is running, I'm totally blind about their usage.
Are there any other methods, other than using profilers to find out cache hits and misses?

Install Valgrind (it supports ARM nowdays) and use the cachegrind tool to check cache utilization. If you are running Ubuntu on the device, it should be as simple as sudo apt-get install valgrind. Valgrind can also help you simulate what would happen with different cache sizes.

Related

Low hardware simulation for performance profiling

I need to optimize the app I'm working on and I can't get reliable profiling data on my development machine. The app should run on low end ARM hardware on QNX, but from logistic reasons I don't have access to the final hardware for profiling.
I've tried to do profiling on my development machine, but as you can imagine everything is so fast that I can't pin point the slow parts. I've created a Linux virtual machine with reduced memory and CPU cores count, but they are still too fast compared to the final hardware.
Is it possible to reduce the CPU clock speed/ram speed/disk speed in a virtual machine to simulate low performance hardware or is there any other way to get relevant profiling data on my development machine?
Considering the app is processing several gigabytes of data I assume disk access is a major bottleneck and limiting disk speed might help
I can use any (as in most open source and commercially available) tool/approach that runs on Windows/Linux/MacOS on real or virtual machine.
This URL describes how to limit disk bandwidth on VirtualBox images. You could run a Linux VM on Virtualbox and use this method to limit disk access speeds, turn off Disk Caching using suggestions from this answer and profile your application. Alternatively you can download QNX SDP, which comes with the option of a prebuilt x86_64 Virtual Machine image that can be run using VMWare/Virtualbox/qemu
My previous experiences with QNX on armv7 and x86_64 suggest that the devb-sdmmc driver is possibly a bottleneck when working with a lot of big files being read from flash storage. devb-sdmmc and io-blk often require fine tuning of the drivers with proper cache, block, read-ahead size and other parameters helps improve disk access performance.

How does memory usage in Windows affect performance

I'm running windows 10 with 4GBs of DDR3 1066 on Intel second generation i5 mobile architecture.
I come from a OSX background mostly and memory has always been a concern for me because I prefer to have many tabs open. I noticed on OSX that the memory usage didn't relate that much to the performance of the applications so long as it wasn't fully saturated but easily on my iMac I can run 80% of memory and find no noticeable lag or stuttering. However on Windows I'm finding memory to be the major bottleneck in my system, I understand that upgrading to 8 or 16GBs of memory would be the upgrade path for me. However I would love to understand why my system slows down noticeably when I saturate 80% of the memory unlike OSX that seems to handle it just fine. Is it a bandwidth limitation? I know that Windows NT and Darwin are completely different Kernels and I would love to be educated in exactly how that affects the same usage scenario so differently.
Thank you in advance.

How to measure L1, L2, L3 cache hits & misses in OSX

I've a C++ program and I would like to quantify it's performance by checking the number of hits and misses against the CPU cache.
What's the best way to do it?
I tried using Intel's Performance Counter Monitor but it uses an unsigned Kernel Extension which are disabled on Yosemite. I can obviously disable the check to not load unsigned kexts but I wouldn't like to go down that path.
Is there any other possible way that I'm unaware of?
You can enable unsigned kernel extensions with OS X (reboot afterwards required):
sudo nvram boot-args=kext-dev-mode=1
This enables developer mode on your machine and you can run Intel Performance Counter Monitor as long at it supports Mac OS X 10.10 (Yosemite) in general.
Don't forget to disable it again after you are done with testing (security-issue otherwise):
sudo nvram boot-args=kext-dev-mode=0
As far as I know Intel's tool is far better than cache grind because it uses actual counters from the hardware instead of simulating an cpu and it's cache characteristics in software.
You could, in principle, apply for a kext signing certificate, if you're an Apple developer programme member, and sign the kext yourself. But they generally don't hand them out for internal use, and recommend you enable kext-dev-mode or disable SIP (depending on version). Another good path would be to ask Intel to provide a signed version of their kext!

qemu vs qemu-kvm: some performance measurements

I conducted the following benchmark in qemu and qemu-kvm, with the following configuration:
CPU: AMD 4400 process dual core with svm enabled, 2G RAM
Host OS: OpenSUSE 11.3 with latest Patch, running with kde4
Guest OS: FreeDos
Emulated Memory: 256M
Network: Nil
Language: Turbo C 2.0
Benchmark Program: Count from 0000000 to 9999999. Display the counter on the screen
by direct accessing the screen memory (i.e. 0xb800:xxxx)
It only takes 6 sec when running in qemu.
But it takes 89 sec when running in qemu-kvm.
I ran the benchmark one by one, not in parallel.
I scratched my head the whole night, but still not idea why this happens. Would somebody give me some hints?
KVM uses qemu as his device simulator, any device operation is simulated by user space QEMU program. When you write to 0xB8000, the graphic display is operated which involves guest's doing a CPU `vmexit' from guest mode and returning to KVM module, who in turn sends device simulation requests to user space QEMU backend.
In contrast, QEMU w/o KVM does all the jobs in unified process except for usual system calls, there's fewer CPU context switches. Meanwhile, your benchmark code is a simple loop which only requires code block translation for just one time. That cost nothing, compared to vmexit and kernel-user communication of every iteration in KVM case.
This should be the most probable cause.
Your benchmark is an IO-intensive benchmark and all the io-devices are actually the same for qemu and qemu-kvm. In qemu's source code this can be found in hw/*.
This explains that the qemu-kvm must not be very fast compared to qemu. However, I have no particular answer for the slowdown. I have the following explanation for this and I think its correct to a large extent.
"The qemu-kvm module uses the kvm kernel module in linux kernel. This runs the guest in x86 guest mode which causes a trap on every privileged instruction. On the contrary, qemu uses a very efficient TCG which translates the instructions it sees at the first time. I think that the high-cost of trap is showing up in your benchmarks." This ain't true for all io-devices though. Apache benchmark would run better on qemu-kvm because the library does the buffering and uses least number of privileged instructions to do the IO.
The reason is too much VMEXIT take place.

Decreasing performance of dev machine to match end-user's specs

I have a web application, and my users are complaining about performance. I have been able to narrow it down to JavaScript in IE6 issues, which I need to resolve. I have found the excellent dynaTrace AJAX tool, but my problem is that I don't have any issues on my dev machine.
The problem is that my users' computers are ancient, so timings which are barely noticable on my machine are perhaps 3-5 times longer on theirs, and suddenly the problem is a lot larger. Is it possible somehow to degrade the performance of my dev machine, or preferrably of a VM running on my dev machine, to the specs of my customers' computers?
I don't know of any virtualization solutions that can do this, but I do know that the computer/CPU emulator Bochs allows you to specify a limit on the number of emulated instructions per second, which you can use to simulate slower CPUs.
I am not sure if you can cpu bound it, but in VirutalBox or Parallel, you can bound the memory usage. I assume if you only give it about 128MB then it will be very slow. You can also limit the throughput on the network with a lot of tools. I guess the only thing I am not sure about is the CPU. That's tricky. Curious to know what you find. :)
You could get a copy of VMWare Workstation and choke the CPU of your VM.
With most virtual PC software you can limit the amount of RAM, but you are not able to set the CPU to a slower speed as it does not emulate a CPU, but uses the host CPU.
You could go with some emulation software like bochs that will let you setup an x89 processor environment.
You may try Fossil Toys
* PC Speed
PC CPU speed monitor / benchmark. With logging facility.
* Memory Load Test
Test application/operating system behaviour under low memory conditions.
* CPU Load Test
Test application/operating system behaviour under high CPU load conditions.
Although it doesn't simulate a specific CPU clock speed.

Resources