I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:
__global__ void kernel(..., long long int *runtime)
{
long long int start = 0;
long long int stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
/* Some code here */
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));
runtime[threadIdx.x] = stop - start;
...
}
The answer says to do a conversion as follows:
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
For which I do:
for(long i = 0; i < size; i++)
{
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
}
Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.
The correct conversion in your case is not GHz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
^^^^
but hertz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
^^^^^^^^^^^^^
In the dimensional analysis:
clock cycles
clock cycles / -------------- = seconds
second
the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.
Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):
$ cat t1306.cu
#include <stdio.h>
const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;
__global__ void kernel(long long *clocks){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
long long start=clock64();
while (clock64() < start+delay_time);
if (idx < nthr) clocks[idx] = clock64()-start;
}
int main(){
int peak_clk = 1;
int device = 0;
long long *clock_data;
long long *host_data;
host_data = (long long *)malloc(nthr*sizeof(long long));
cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
err = cudaMalloc(&clock_data, nthr*sizeof(long long));
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms\n", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$
This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.
In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.
Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.
clock64 returns a value in graphics clock cycles. The graphics clock is dynamic so I would not recommend using a constant to try to convert to seconds. If you want to convert to wall time then the better option is to use globaltimer, which is a 64-bit clock register accessible as:
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));
The unit is in nanoseconds.
The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz). This clock is used by CUPTI for start time when capturing concurrent kernel trace.
Related
I have a program written in C++ on a Raspberry Pi 3. I mmap /dev/gpiomem to access the GPIO registers directly in user mode. Here is my function to write to output pins:
static uint32_t volatile *gpiopage; // initialized by mmap() of /dev/gpiomem
static uint32_t lastretries = 0;
void PhysLib::writegpio (uint32_t value)
{
uint32_t mask = 0xFFF000; // [11:00] input pins
// [23:12] output pins
// all the pins go through an inverter converting the 3.3V to 5V
gpiopage[GPIO_CLR0] = value;
gpiopage[GPIO_SET0] = ~ value;
// sometimes takes 1 or 2 retries to make sure signal gets out
uint32_t retries = 0;
while (true) {
uint32_t readback = ~ gpiopage[GPIO_LEV0];
uint32_t diff = (readback ^ value) & mask;
if (diff == 0) break;
if (++ retries > 1000) {
fprintf (stderr, "PhysLib::writegpio: wrote %08X mask %08X, readback %08X diff %08X\n",
value, mask, readback, diff);
abort ();
}
}
if (lastretries < retries) {
lastretries = retries;
printf ("PhysLib::writegpio: retries %u\n", retries);
}
}
Apparently there is some internal cache or something delaying the actual updating of the pins. So I'm wondering if there is some magic MRC or MCR or whatever that I can put so I don't have to read the pins to wait for the update to actually occur.
I'm quite sure this is happening because this code is part of a loop:
while (true) {
writegpio (0x800000); // set gpio pin 23
software timing loop for 1uS
writegpio (0); // clear gpio pin 23
software timing loop for 1uS
}
Sometimes Linux timeslices during the software timing loops on me and I get a delay longer than 1uS, which is ok for this project. Before I put in the code that loops until it reads the updated bit, sometimes the voltage on the pin stays high for longer than 1uS and is then low for correspondingly less than 1uS, or vice versa, implying that there is a total of 2uS delay for the two timing loops but the update of the actual pin is being delayed by the hardware. After inserting the corrective code, I always get at least 1uS of high voltage and 1uS of low voltage each time through the loop.
I am using perf in sampling mode to capture performance statistics of programs running on multi-core platform from NXP S32 platform running Linux 4.19.
E.g configuration
Core 0 - App0 , Core 1 - App1, Core 2 - App2
Without sampling i.e. at program level, App0 takes 6.9 seconds.
On sampling at 1Million cycles,App0 takes 6.3 sec
On sampling at 2Million cycles, App0 takes 6.4 sec
On sampling at 5Million cycles, App0 takes 6.5 sec
On sampling at 100Million cycles, App0 takes 6.8 sec.
As you can see with higher sampling period (100Million) App0 takes higher time to finish execution.
Actually I would have expected the opposite, i.e. sampling at 1Million cycles should have resulted in the program taking more time to execute due higher number of samples generated (perf overhead) as compared to 100 Million cycles?
I am unable to explain this behavior what do you think is causing this?
Any leads would be helpful.
P.S - On the Pi3B the behavior is as expected i.e sampling at 1million cycles results in longer execution time compared to 100 Million cycles.
UPDATE: I do not use perf from command line, instead make a perf system call using the perf event with the following flags in the struct perf_event_attr.
struct perf_event_attr hw_event;
pid_t pid = proccess_id; // measure the current process/thread
int cpu = -1; // measure on any cpu
unsigned long flags = 0;
int fd_current;
memset(&hw_event, 0, sizeof(struct perf_event_attr));
hw_event.type = event_type;
hw_event.size = sizeof(struct perf_event_attr);
hw_event.config = event;
if(group_fd == -1)
{
hw_event.sample_period = 2000000;
hw_event.sample_type = PERF_SAMPLE_READ;
hw_event.precise_ip = 1;
}
hw_event.disabled = 1; // off by default. specifies whether the counter starts out disabled or enabled.
hw_event.exclude_kernel = 0; // excluding events that happen in the kernel-space
hw_event.exclude_hv = 1; // excluding events that happen in the hypervisor
hw_event.pinned = pinned; // specifies the counter to be on the CPU if at all possible. applies only to hardware counters and only to group leaders.
hw_event.exclude_user = 0; // excludes events that happen in user space
hw_event.exclude_callchain_kernel = 0; // Do not include kernel callchains.
hw_event.exclude_callchain_user = 0; // Do not include user callchains.
hw_event.read_format = PERF_FORMAT_GROUP; // Allows all counter values in an event group to be read with one read
fd_current = syscall(__NR_perf_event_open, &hw_event, pid, cpu, group_fd, flags);
if (fd_current == -1) {
printf("Error opening leader %llx\n", hw_event.config);
exit(EXIT_FAILURE);
}
return fd_current;
So, I am using the LTC 2366, 3 MSPS ADC and using the code given below, was able to achieve a sampling rate of about 380 KSPS.
#include <stdio.h>
#include <time.h>
#include <bcm2835.h>
int main(int argc, char**argv) {
FILE *f_0 = fopen("adc_test.dat", "w");
clock_t start, end;
double time_taken;
if (!bcm2835_init()) {
return 1;
}
bcm2835_spi_begin();
bcm2835_spi_setBitOrder(BCM2835_SPI_BIT_ORDER_MSBFIRST);
bcm2835_spi_setDataMode(BCM2835_SPI_MODE0);
bcm2835_spi_setClockDivider(32);
bcm2835_spi_chipSelect(BCM2835_SPI_CS0);
bcm2835_spi_setChipSelectPolarity(BCM2835_SPI_CS0, LOW);
int i;
char buf_[0] = {0x01, (0x08|0)<<4, 0x00}; // really doesn't matter what this is
char readBuf_0[2];
start = clock();
for (i=0; i<380000; i++) {
bcm2835_spi_transfernb(buf_0, readBuf_0, 2);
fprintf(f_0, "%d\n", (readBuf_0[0]<<6) + (readBuf_0[1]>>2));
}
end = clock();
time_taken = ((double)(end-start)/CLOCKS_PER_SEC);
printf("%f", (double)(time_taken));
printf(" seconds \n");
bcm2835_spi_end();
bcm2835_close();
return 0;
}
This returns about 1 second every time.
When I used the exact same code with LTC 2315, I still get a sampling rate of about 380 KSPS. How come? First of all, why is the 3 MSPS ADC giving me only 380 KSPS and not something like 2 MSPS? Second, when I change the ADC to something that's about 70% faster, I get the same sampling rate, why is that? Is that the limit of the Pi? Any way of improving this to get at least 1 MSPS?
Thank you
I have tested a bit of the Raspberry Pi SPI and found out that the spi has some overheads. In my case, I tried pyspi, where one byte seems to take at least 15us, and 75us between two words ( see these captures). That's slower than what you measure, so good for you!
Increasing the SPI clock changes the length of the exchange, but not the overheads. Hence, the critical time doesn't change, as the overhead is the limiting factor. 380ksps means 2.6us by byte, that may be well close to your overhead ?
The easier way to improve the ADC speed would be to used parallel ADCs instead of serial - it has the potential to increased overall speed to 20Msps+.
My question is about understanding the Linux perf tool metrics. I did an optimisations related to prefetch/cache-misses in my code, that is now faster. However, perf does not show me that (or more certainly, I do not understand what perf shows me).
Taking it back to where it all began. I did an investigation in order to speed up random memory access using prefetch.
Here is what my program does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
After my last tunings, my code is the following one:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <sched.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
#define PADDING 256
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + PADDING) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
unsigned int value = valueBuffer[index];
double s = sin(value);
sum += s;
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
void testWithPrefetchStep(unsigned short prefetchStep)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
int iterateOnPrefetchSteps()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
testWithPrefetchStep(prefetchStep);
}
}
void setCpuAffinity(int cpuId)
{
int pid=0;
cpu_set_t mask;
unsigned int len = sizeof(mask);
CPU_ZERO(&mask);
CPU_SET(cpuId,&mask);
sched_setaffinity(pid, len, &mask);
}
int main(int argc, char ** argv)
{
setCpuAffinity(7);
if (argc == 2)
{
testWithPrefetchStep(atoi(argv[1]));
}
else
{
iterateOnPrefetchSteps();
}
}
At the end of my previous stackoverflow question I thought I had all the elements: In order to avoid cache-misses I made my code prefetching data (using __builtin_prefetch) and my program was faster. Everything looked as normal as possible
However, I wanted to study it using the Linux perf tool. So I launched a comparison between two executions of my program:
./TestPrefetch 0: doing so, the prefetch is inefficient because it is done on the data that is read just after (when the data is accessed, it cannot have be loaded in the CPU cache). Run duration: 21.346 seconds
./TestPrefetch 1: Here the prefetch is far more efficient because data is fetched one loop-iteration before it is read. Run duration: 12.624 seconds
The perf outputs are the following ones:
$ gcc -O3 TestPrefetch.c -o TestPrefetch -lm && for cpt in 0 1; do echo ; echo "### Step=$cpt" ; sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./TestPrefetch $cpt; done
### Step=0
prefetchStep = 0, Sum = -1107.523504 - Time: 21346278 micro-seconds = 21.346 seconds
Performance counter stats for './TestPrefetch 0':
24387,010283 task-clock (msec) # 1,000 CPUs utilized
97 274 163 155 cycles # 3,989 GHz
59 183 107 508 instructions # 0,61 insn per cycle
425 300 823 cache-references # 17,440 M/sec
249 261 530 cache-misses # 58,608 % of all cache refs
24,387790203 seconds time elapsed
### Step=1
prefetchStep = 1, Sum = -1107.523504 - Time: 12623665 micro-seconds = 12.624 seconds
Performance counter stats for './TestPrefetch 1':
15662,864719 task-clock (msec) # 1,000 CPUs utilized
62 159 134 934 cycles # 3,969 GHz
59 167 595 107 instructions # 0,95 insn per cycle
484 882 084 cache-references # 30,957 M/sec
321 873 952 cache-misses # 66,382 % of all cache refs
15,663437848 seconds time elapsed
Here, I have difficulties to understand why am I better:
The number of cache-misses is almost the same (I even have a little bit more): I can't understand why and (overall) if so, why am I faster?
what are the cache-references?
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
I wouldn't trust the perf summary, as it's not very clear what each name represents and which perf counter are they programmed to follow. The default settings have also been known to count the wrong things (see - https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)
What could happen here is that your cache miss counter may count also the prefetch instructions (which may seem as loads to the machine, especially as you descend in the cache hierarchy). In that case, having more cache references (lookups) makes sense, and you would expect these requests to be misses (the whole point of a prefetch is to miss...).
Instead of relying on some ambiguous counter, find our the counter IDs and masks for your specific machine that represent demand reads lookups and misses, and see if they improved.
Edit: looking at your numbers again, I see an increase of ~50M accesses, but ~70M misses. It's possible that there are more misses due to cache thrashing done by the prefetches
can't understand why and (overall) if so, why am I faster?
Because you run more instructions per the time. The old one:
0,61 insn per cycle
and the new one
0,95 insn per cycle
what are the cache-references?
The count how many times the cache was asked if it does contain the data you were loading/storing.
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
Yes. But note that in today processors, there is no wait for any of this. The instructions are executed out-of-order, usually prefetched and if the next instruction needs some data that are not ready, other instructions will get executed.
I recently progress on my perf issues. I discovered a lot of new events among which some are really interesting.
Regarding the current problem, the following event have to be concidered: L1-icache-load-misses
When I monitor my test-application with perf in the same conditions than previously, I get the following values for this event:
1 202 210 L1-icache-load-misses
against
530 127 L1-icache-load-misses
For the moment, I do not yet understand why cache-misses events are not impacted by prefetches while L1-icache-load-misses are...
I'm writing a small program in C, and I want to measure it's performance.
I want to see how much time do it run in the processor and how many cache hit+misses has it made. Information about context switches and memory usage would be nice to have too.
The program takes less than a second to execute.
I like the information of /proc/[pid]/stat, but I don't know how to see it after the program has died/been killed.
Any ideas?
EDIT: I think Valgrind adds a lot of overhead. That's why I wanted a simple tool, like /proc/[pid]/stat, that is always there.
Use perf:
perf stat ./yourapp
See the kernel wiki perf tutorial for details. This uses the hardware performance counters of your CPU, so the overhead is very small.
Example from the wiki:
perf stat -B dd if=/dev/zero of=/dev/null count=1000000
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)
217 page-faults # 0.000 M/sec
3 CPU-migrations # 0.000 M/sec
83 context-switches # 0.000 M/sec
956.474238 task-clock-msecs # 0.999 CPUs
0.957617512 seconds time elapsed
No need to load a kernel module manually, on a modern debian system (with the linux-base package) it should just work. With the perf record -a / perf report combo you can also do full-system profiling. Any application or library that has debugging symbols will show up with details in the report.
For visualization flame graphs seem to work well. (Update 2020: the hotspot UI has flame graphs integrated.)
The best tool for you is called valgrind. It is capable of memory profiling, call-graph building and much more.
sudo apt get install valgrind
valgrind ./yourapp
However, to obtain the time your program executed, you can use time(8) linux utility.
time ./yourapp
You can also use
/usr/bin/time -v YourProgram.exe
It will show you all this information:
/usr/bin/time -v ls
Command being timed: "ls"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 60%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4080
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 314
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
You can also use the -f flag to format the output to fit your needs.
Please, be sure to call this program using it's full path, otherway it will call the 'time' command and that's not what you need...
Hope this helps!
Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS
perf is likely what OP wants as shown at https://stackoverflow.com/a/10114325/895245 but just for completeness, I'm going to show how to do this from inside a C program if you control the source code.
This method can allow for more precise measurements of a specific region of interest within the program. It can also get separate cache hit/miss counts for each different cache level. This syscall likely shares the same backend as perf.
This example is basically the same as Quick way to count number of instructions executed in a C program but with PERF_TYPE_HW_CACHE. By doing:
man perf_event_open
you can see that in this examples we are counting only:
L1 data cache (PERF_COUNT_HW_CACHE_L1D)
reads (PERF_COUNT_HW_CACHE_OP_READ), not writes of prefetches
misses (PERF_COUNT_HW_CACHE_RESULT_MISS), not hits
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
char *chars, c;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
chars = malloc(n * sizeof(char));
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HW_CACHE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CACHE_L1D |
PERF_COUNT_HW_CACHE_OP_READ << 8 |
PERF_COUNT_HW_CACHE_RESULT_MISS << 16;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
/* Write the memory to ensure misses later. */
for (size_t i = 0; i < n; i++) {
chars[i] = 1;
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Read from memory. */
for (size_t i = 0; i < n; i++) {
c = chars[i];
}
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
free(chars);
}
With this, I get results increasing linearly like:
./main.out 100000
# 1565
./main.out 1000000
# 15632
./main.out 10000000
# 156641
From this we can estimate a cache line size of: 100000/1565 ~ 63.9 which almost exactly matches the exact value of 64 according to getconf LEVEL1_DCACHE_LINESIZE on my computer, so I guess it is working.