Why is my program slow when access 4K offset array element? - performance

I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?
Here is my program, the access_array() traverses through array with different step size, and I measure the execution time for these step size.
// Program to calculate L1 cache line size, compile in g++ -O1
#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;
#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
}
}
int main(int argc, char** argv){
double cpu_us_used;
struct timeval start, end;
for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
return 0;
}
Result
My question is:
From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?
Here are my assumptions.
For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.
As a result, there is a gap between step = 32 and 64.
By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.
For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.
The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.
Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.
Platform : Intel Xeon(R) CPU E5-2667 0 # 2.90GHz
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 262144
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 15728640
LEVEL3_CACHE_ASSOC 20
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0

This CPU probably has:
a hardware cache line prefetcher, that detects linear access patterns within the same physical 4 KiB page and prefetches them before an access is made. This stops prefetching at 4 KiB boundaries (because the physical address is likely to be very different and unknown).
a hardware TLB prefetcher, that detects linear access patterns in TLB usage and prefetches TLB entries.
From 1 to 16 the cache line prefetcher is doing its job, fetching cache lines before you access them, so execution time remains the same (uneffected by cache misses).
At 32, the cache line prefetcher starts to struggle (due to the "stop at 4 KiB page boundary" thing).
From 64 to 512 the TLB prefetcher is doing its job, fetching TLB entries before you access them, so execution time remains the same (unaffected by TLB misses).
From 512 to 4096 the TLB prefetcher is failing to keep up. The CPU stalls waiting for TLB info for every "4096/step" accesses; and these stalls are causing "linear-ish" growth in execution time.
From, 4096 to 131072; I'd like to assume that the "new char[ARRAY_SIZE];" allocates so much space that the library and/or OS decided to give you 2 MiB pages and/or 1 GiB pages, eliminating some TLB misses, and improving execution time as the number of pages being accessed decreases.
For "larger than 131072"; I'd assume you start to see the effects of "1 GiB page TLB miss".
Note that it's probably easier (and less error prone) to get the cache characteristics (size, associativity, how many logical CPUs are sharing it, ..) and cache line size from the CPUID instruction. The approach you're using is more suited to measuring cache latency (how long it takes to get data from one of the caches).
Also; to reduce TLB interference the OS might allow you to explicitly ask for 1 GiB pages (e.g. mmap(..., MAP_POPULATE | MAP_HUGE_1GB, ... ) on Linux); and you can "pre-warm" the TLB by doing a "touch then CLFLUSH" warm-up loop before you start measuring. The hardware cache-line prefetcher can be disabled via. a flag in an MSR (if you have permission), or can be defeated by using an "random" (unpredictable) access pattern.

Finally I found the answer.
I tried to set the array size to 16KiB and smaller, but it slow on step = 4KiB too.
On the other hand, I tried to change the offset of steps from mutiplying 2 in each iteration to add 1 in each iteration, it still slow down when step = 4KiB.
Code
#define ARRAY_SIZE (4200)
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = idx + steps;
if(idx >= ARRAY_SIZE)
idx = 0;
}
}
for(int step = 4090 ; step <= 4100 ; step ++){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
Result
4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149
As a result, I suspected it's not related to any of cache / TLB / prefetch mechanism.
With more and more googling the relation ship between performance and magic number "4K", I have found the 4K aliasing problem on Intel Platform, which slows down the performance of load.
This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

Given this code snippet from this textbook that I am currently studying. Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson) (global edition, so the book's exercises could be wrong.)
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
and this is the information given
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae. You are evaluating its cache performance on a
machine with a 2,048-byte direct-mapped data cache with 32-byte blocks (B = 32).
struct algae_position {
int x;
int y;
};
struct algae_position grid[32][32];
int total_x = 0, total_y = 0;
int i, j;
You should also assume the following:
sizeof(int) = 4.
grid begins at memory address 0.
The cache is initially empty.
The only memory accesses are to the entries of the array grid.
Variables i, j,
total_x, and total_y are stored in registers
The book gives the following questions as practice:
A. What is the total number of reads?
Answer given : 2048
B. What is the total number of reads that miss in the cache?
Answer given : 1024
C. What is the miss rate?
Answer given: 50%
I'm guessing for A, the answer is derived from 32*32 *2? 32*32 for the dimensions of the matrix and 2 because there are 2 separate loops for x and y vals. Is this correct? How should the total number of reads be counted?
How do we calculate the total number of misses that happen in the cache and the miss rate? I read that the miss rate is (1- hit-rate)
Question A
You are correct about 32 x 32 x 2 reads.
Question B
The loops counts down from 31 towards 0 but that doesn't matter for this question. The answer is the same for loops going from 0 to 31. Since that is a bit easier to explain I'll assume increasing loop counters.
When you read grid[0][0], you'll get a cache miss. This will bring grid[0][0], grid[0][1], grid[0][2] and grid[0][3] into the cache. This is because each element is 2x4 = 8 bytes and the block size is 32. In other words: 32 / 8 = 4 grid elements in one block.
So the next cache miss is for grid[0][4] which again will bring the next 4 grid elements into the cache. And so on... like:
miss
hit
hit
hit
miss
hit
hit
hit
miss
hit
hit
hit
...
So in the first loop you simply have:
"Number of grid elements" divided by 4.
or
32 * 32 / 4 = 256
In general in the first loop:
Misses = NumberOfElements / (BlockSize / ElementSize)
so here:
Misses = 32*32 / (32 / 8) = 256
Since the cache size is only 2048 and the whole grid is 32 x 32 x 8 = 8192, nothing read into the cache in the first loop will generate cache hit in the second loop. In other words - both loops will have 256 misses.
So the total number of cache misses are 2 x 256 = 512.
Also notice that there seem to be a bug in the book.
Here:
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae.
^^^
Hmmm... 512 elements...
Here:
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
^^^^^^
hmmm... 32 x 32 is 1024
So the loop access 1024 elements but the text says 512. So something is wrong in the book.
Question C
Miss rate = 512 misses / 2048 reads = 25 %
note:
Being very strict we cannot say for sure that the element size is two times the integer size. The C standard allow that structs contain padding. So in principle there could be 8 bytes padding in the struct (i.e. element size being 16) and that would give the results that the book says.

Allocate maximum memory available

I currently try to write a programm which is supposed to allocate the maximum memory available. I came to a solution which limits the area of potential available memory until both borders are equal (see listing)
void enforceMemoryLeakage(void** arrayOfAllocMemory)
{
unsigned int maxMemory = 0x80000000;
unsigned int minMemory = 0x50000000;
unsigned int attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
void* pAllocMemory;
while((maxMemory - minMemory) > 1)
{
pAllocMemory = malloc(attempAllocatedMemory);
if (pAllocMemory != NULL)
{
minMemory = attempAllocatedMemory;
attempAllocatedMemory += (maxMemory - minMemory) / 2;
free(pAllocMemory);
}
else
{
maxMemory = attempAllocatedMemory;
attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
}
}
arrayOfAllocMemory[0] = malloc(maxMemory);
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
}
The above displayed code works fine. However if the command is executed
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
I would have expected that there is no further memory available. However it does not work which brings me to my actual question why the above shown approach does not work.
Best regards
Ratbald
You did not specify OS nor platform so I am completely guessing here so read with extreme prejudice...
Assuming you do not have a bug in your binary search code... My bet is you face memory fragmentation problems as you are successfully allocate/free memory during execution other processes can do the same so you might fragment your memory. Example:
OS has 2 MByte chunk of continuous free memory
you allocate 1.5 MByte of mem (0.5 MByte free)
some other process allocates 1 KByte (0.499 MByte free)
you free the 1 MByte (1.0 + 0.499 MByte free two fragments)
and attempt to allocate 1.25 MByte
but OS does not have 1.25 memory in single continuous chunk so it failed hence you allocate the 1MByte again (0.499 MByte free)
you succesffuly allocate 1 KByte (0.498 MByte free)
Depending on the OS memory management strategies you could sometimes even not need another process interfering to fragment memory ...
There are however another possibilities not related to Fragmentation. In case of emulations or WOW64 the OS will not allocate whole available RAM, also there are limits on single continuous chunk size. For example Win32 will not allow more than ~1.25 GByte but that does not mean there is only 1.25 GByte of free RAM ...

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);
The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <err.h>
int cpu_ms() {
return (int)(clock() * 1000 / CLOCKS_PER_SEC);
}
int main(int argc, char** argv) {
if (argc < 2) errx(EXIT_FAILURE, "provide the array size in KB on the command line");
size_t size = atol(argv[1]) * 1024;
unsigned char *p = malloc(size);
if (!p) errx(EXIT_FAILURE, "malloc of %zu bytes failed", size);
int fill = argv[2] ? argv[2][0] : 'x';
memset(p, fill, size);
int startms = cpu_ms();
printf("allocated %zu bytes at %p and set it to %d in %d ms\n", size, p, fill, startms);
// wait until 500ms has elapsed from start, so that perf gets the read phase
while (cpu_ms() - startms < 500) {}
startms = cpu_ms();
// we start measuring with perf here
unsigned char sum = 0;
for (size_t off = 0; off < 64; off++) {
for (size_t i = 0; i < size; i += 64) {
sum += p[i + off];
}
}
int delta = cpu_ms() - startms;
printf("sum was %u in %d ms \n", sum, delta);
return EXIT_SUCCESS;
}
This allocates an array of size bytes (which is passed in on the command line, in KiB), sets all bytes to the same value (the memset call), and finally loops over the array in a read-only manner, striding by one cache line (64 bytes), and repeats this 64 times so that each byte is accessed once.
If we turn prefetching off1, we expect this to hit 100% in a given level of cache if size fits in the cache, and to mostly miss at that level otherwise.
I'm interested in two events l2_lines_out.silent and l2_lines_out.non_silent (and also l2_trans.l2_wb - but the values end up identical to non_silent), which counts lines that are silently dropped from l2 and that are not.
If we run this from 16 KiB up through 1 GiB, and measure these two events (plus l2_lines_in.all) for the final loop only, we get:
The y-axis here is the number of events, normalized to the number of accesses in the loop. For example, the 16 KiB test allocates a 16 KiB region, and makes 16,384 accesses to that region, and so a value of 0.5 means that on average 0.5 counts of the given event occurred per access.
The l2_lines_in.all behaves almost as we'd expect. It starts off around zero and when the size exceeds the L2 size it goes up to 1.0 and stays there: every access brings in a line.
The other two lines behave weirdly. In the region where the test fits in the L3 (but not in the L2), the eviction are nearly all silent. However, as soon as the region moves into main memory, the evictions are all non-silent.
What explains this behavior? It's hard to understand why the evictions from L2 would depend on whether the underlying region fits in main memory.
If you do stores instead of loads, almost everything is a non-silent writeback as expected, since the update value has to be propagated to the outer caches:
We can also take a look at what level of the cache the accesses are hitting in, using the mem_inst_retired.l1_hit and related events:
If you ignore the L1 hit counters, which seem impossibly high at a couple of points (more than 1 L1 hit per access?), the results look more or less as expected: mostly L2 hits when the the region fits cleanly in L2, mostly L3 hits for the L3 region (up to 6 MiB on my CPU), and then misses to DRAM thereafter.
You can find the code on GitHub. The details on building and running can be found in the README file.
I observed this behavior on my Skylake client i7-6700HQ CPU. The same effect seems not to exist on Haswell2. On Skylake-X, the behavior is totally different, as expected, as the L3 cache design has changed to be something like a victim cache for the L2.
1 You can do it on recent Intel with wrmsr -a 0x1a4 "$((2#1111))". In fact, the graph is almost exactly the same with prefetch on, so turning it off is mostly just to eliminate a confounding factor.
2 See the comments for more details, but briefly l2_lines_out.(non_)silent doesn't exist there, but l2_lines_out.demand_(clean|dirty) does which seem to have a similar definition. More importantly, the l2_trans.l2_wb which mostly mirrors non_silent on Skylake exists also on Haswell and appears to mirror demand_dirty and it also does not exhibit the effect on Haswell.

Array of 10000 having 16bit elements, find bits set (unlimited RAM) - Google interview

This was asked in my Google interview recently and I offered an answer which involved bit shift and was O(n) but she said this is not the fastest way to go about doing it. I don't understand, is there a way to count the bits set without having to iterate over the entire bits provided?
Brute force: 10000 * 16 * 4 = 640,000 ops. (shift, compare, increment and iteration for each 16 bits word)
Faster way:
We can build table 00-FF -> number of bits set. 256 * 8 * 4 = 8096 ops
I.e. we build a table where for each byte we calculate a number of bits set.
Then for each 16-bit int we split it to upper and lower
for (n in array)
byte lo = n & 0xFF; // lower 8-bits
byte hi = n >> 8; // higher 8-bits
// simply add number of bits in the upper and lower parts
// of each 16-bits number
// using the pre-calculated table
k += table[lo] + table[hi];
}
60000 ops in total in the iteration. I.e. 68096 ops in total. It's O(n) though, but with less constant (~9 times less).
In other words, we calculate number of bits for every 8-bits number, and then split each 16-bits number into two 8-bits in order to count bits set using the pre-built table.
There's (almost) always a faster way. Read up about lookup tables.
I don't know what the correct answer was when this question was asked, but I believe the most sensible way to solve this today is to use the POPCNT instruction. Specifically, you should use the 64-bit version. Since we just want the total number of set bits, boundaries between 16-bit elements are of no interest to us. Since the 32-bit and 64-bit POPCNT instructions are equally fast, you should use the 64-bit version to count four elements' worth of bits per cycle.
I just implemented it in Java:
import java.util.Random;
public class Main {
static int array_size = 1024;
static int[] array = new int[array_size];
static int[] table = new int[257];
static int total_bits_in_the_array = 0;
private static void create_table(){
int i;
int bits_set = 0;
for (i = 0 ; i <= 256 ; i++){
bits_set = 0;
for (int z = 0; z <= 8 ; z++){
bits_set += i>>z & 0x1;
}
table[i] = bits_set;
//System.out.println("i = " + i + " bits_set = " + bits_set);
}
}
public static void main(String args[]){
create_table();
fill_array();
parse_array();
System.out.println("The amount of bits in the array is: " + total_bits_in_the_array);
}
private static void parse_array() {
int current;
for (int i = 0; i < array.length; i++){
current = array[i];
int down = current & 0xff;
int up = current & 0xff00;
int sum = table[up] + table[down];
total_bits_in_the_array += sum;
}
}
private static void fill_array() {
Random ran = new Random();
for (int i = 0; i < array.length; i++){
array[i] = Math.abs(ran.nextInt()%512);
}
}
}
Also at https://github.com/leitao/bits-in-a-16-bits-integer-array/blob/master/Main.java
You can pre-calculate the bit counts in bytes and then use that for lookup. It is faster, if you make certain assumptions.
Number of operations (just computation, not reading input) should take the following
Shift approach:
For each byte:
2 ops (shift, add) times 16 bits = 32 ops, 0 mem access times 10000 = 320 000 ops + 0 mem access
Pre-calculation approach:
255 times 2 ops (shift, add) times 8 bits = 4080 ops + 255 mem access (write the result)
For each byte:
2 ops (compute addresses) + 2 mem access + op (add the results) = 30 000 ops + 20 000 mem access
Total of 30 480 ops + 20 255 mem access
So a lot more memory access with lot fewer operations
Thus, assuming everything else being equal, pre-calculation for 10 000 bytes is faster if we can assume memory access is faster than an operation by a factor of (320 000 - 30 480)/20 255 = 14.29
Which is probably true if you are alone on a dedicated core on a reasonably modern box as the 255 bytes should fit into a cache. If you start getting cache misses, the assumption might no longer hold.
Also, this math assumes pointer arithmetic and direct memory access as well as atomic operations and atomic memory access. Depending on your language of choice (and, apparently, based on previous answers, your choice of compiler switches), that assumption might not hold.
Finally, things get more interesting if you consider scalability: shifting can be easily parallelised onto up to 10000 cores but pre-computation not necessarily. As byte number goes up, however, lookup gets more and more advantageous.
So, in short. Yes, pre-calculation is faster under pretty reasonable assumptions but no, it is not guaranteed to be faster.

Resources