256 KB in cache size is really 256 KiB?

256 KB in cache size is really 256 KiB? - caching

So, I am confused regarding the whole KB /KiB concept.
I read in a datasheet that a specific L2 cache has 256KB capacity. From other sources I have read the size to be 256 kB.
Sometimes when people write KB, or kB, they mean KiB, and sometimes not. My limited knowledge about memory leads me to believe that cache sizes should be a power of two bytes.
In the context of cache size, is it more likely that the size of the memory is 256 000 bytes or 2^10*256= 262 144 bytes?
Edit: Not the actual datasheet, but as an example have a look at the L1 cache on this AMD processor.
http://en.wikipedia.org/wiki/File:AMD_A64_Opteron_arch.svg

You've got:
RAM (always comes in powers of 2 sizes)
"256" (a power of 2)
capital K in "KB" (the standard is lowercase k = 1000, so capital K often implies the non-standard units)
All 3 of these things imply binary sizes, so yes, it's safe to assume they meant 256 kilobinary bytes = 256 KiB = 256×1024 B = 262,144 bytes.
Yes, writing it as "KB" is non-standard and confusing and wrong, but it's unfortunately common, so you need to use context to figure out what is actually meant.

Read these: http://en.wikipedia.org/wiki/Mebibyte and http://en.wikipedia.org/wiki/Kibibyte
Technical documentation and code generally uses the power of two sizes because of the underlying binary system. But often... for historical reasons... the unit is written as MB instead of MiB, or KB/kB instead KiB.
If you see something looking like a power of two, like ..., 32, 64, 128, 256, 512, 1024, 2048, ... it is very likely to be a binary size.

Without a reference to the specific data sheet you're talking about, it's hard to make a judgement.
In general, when talking about hardware these are powers of 2, thus 262144.
The base-10 version is mostly used around disk storage, where mere mortals pay less attention to detail.

As far as I know, it is mainly the (Sales guys at) disk drive manufacturers that use kiB and MiB, real engineers use powers of 2 :-)

Related

times of the Fortran matmul function with different multiplication sizes

I have calculated the time spent by the Fortran's MATMUL function with different multiplication sizes (32 × 32, 64 × 64, ...) and I have questions about the results.
These are the results:
SIZE ----- TIME IN SECONDS
32 ----- 0,000071
64 ----- 0,000032
128 ----- 0,001889
256 ----- 0,010866
512 ----- 0,043
1024 ----- 0,336
2048 ----- 2,878
4096 ----- 51,932
8192 ----- 405,921856
I guess the times should increase by a factor of 8 (m * 2 * n * 2 * k * 2). I do not know if it should be like that. If so, who can tell why it is not like that?
In addition, we see an increase of a factor of 18 with multiplications of 2048 a
4096. Could someone tell me why?
I have measured the times with CALL CPU_TIME() from Fortran and with CALL DATE_AND_TIME() from Fortran and both give very similar results.
My processor is an AMD Phenom (tm) II X4 945 Processor with 4 cores

#Steve is correct, there are many factors that affect performance especially when data sizes are small. Thats why all of your results at and below 2048 are pretty much semi-random and essentially irrelevant. All or most of the data is likely in several layers of CPU cache. So flushing CPU threads and other hardware related events are making these results very skewed. If you run these tests again you will find different results at these small sizes.
So, when you go from 2048 to 4096 you get a major jump. All the data no longer fits into the CPU caches. The computer needs to load blocks of data from RAM into the CPU caches. This explains the large jump in time.
It is at these sizes and larger that the computer has to do more typical operations (load data, perform operations, save data to RAM) and this is the performance you will get as data gets even larger. This is also where performance becomes very consistent as data grows larger. Notice that going from 4096 to 8192 is very close to exactly 8 times longer. At this point, going to 16384 will take almost exactly 8 times 406 seconds.
Any size smaller than 4096 is not giving your computer enough work to accurately measure the performance.

There should be a factor 8 between each timing, and deviations are generally due to memory management like cache alignment and cache- vs array-size. For small arrays there might be a calling overhead to matmul(). A triple do-loop can be faster, at least with some optimization (try -O3 -march=native), and should work equally well for small sizes.

Why are Bytes talked about in powers of 2?

2^10 = 1KB,
2^20 = 1MB,
etc.
etc.
Except, a byte is 8 bits so I do not understand why we are using powers of 2 as an explanation. To talk about Bits in powers of 2 I can completely understand but with Bytes, I am totally lost. Many textbooks / online resources talk about it in this way, what am I missing here?
By the way, I understand 2^10 = 1024 which is approximately 10^3 = 1000. What I don't understand is why we justify the use prefixes and bytes using powers of 2.

I'll ask the question you're really asking: Why don't we just use powers of 10?
To which we'll respond: why should we use powers of 10? Because the lifeforms using the computers happen to have 10 fingers?
Computers break everything down to 1s and 0s.
1024 in binary = 10000000000 (2^10), which is a nice round number.
1000 in binary = 1111101000 (not an even power of 2).
If you are actually working with a computer at a low level (ie looking at the raw memory), it is much easier to think using numbers that get represented as round numbers in the way they are stored.

From your question, I think that you understand about powers of two and measuring bytes. If not, the other answers explain that.
Is your question is why not use bits rather than bytes since bits are truly binary?
The reason that memory, disk space, etc is described in bytes rather than bits has to do with the word addressability of early computers. The bit, nibble and byte came about as workable amounts of memory in simple computers. The first computers had actual wires that linked the various bits together. 8-bit addressability was a significant step forward.
Bytes instead of bits is just a historical convention. Networks measurements are in (mega) bits for similar historical reasons.
Wikipedia has some interesting details.

The reason is that you do not only use bytes to store numbers, but also to address memory bytes that store numbers (or even other addresses). With 1 Byte you have 256 possible addresses, so you can access 256 different bytes. Using only 200 bytes, for example, just because it is a rounder number would be a waste of address space.
This example assumes 8 bit addresses for simplification, usually you have 64 bit addresses in modern PCs.
By the way, in the context of hard drives, the amount of memory is often a round number, e.g. 1 TB, because they address memory space differently. Powers of 2 are used in most memory types, like RAM, flash drives/SSDs, cache memory. In these cases, they are sometimes rounded, e.g. 1024 KB as 1 MB.
There are actually 2 different names for powers of 2 and powers of 10. Powers of ten are known as kilo-bytes, mega-bytes, giga-bytes, while powers of two are called kibi-bytes, mebi-bytes and gibi-bytes. Most people just use the former ones in both cases.

Okay so I figured my own question out. 2^3 bits = 2^0 Bytes. So if we have 2^13 bits and want to convert it to bytes then 2^13 bits = x * 1Byte / (2^3 bits) = 2^10 bytes which is a kilobyte. Now with this conversion, it makes much more sense to me why they choose to represent Bytes in powers of 2.
We can do the same thing with powers of ten, 10^1 ones = 10^0 tens. Then if we want to convert 10^25 ones to tens we get 10^25 ones = x * (10^0 tens / 10^1 ones) = 10^24 tens as expected.

I am not sure if I get what you are exactly asking, but:
2^10 bits = 1KBits
2^10 bytes = 1KBytes = ((2^3)(2^10)Bits = 2^13 Bits
These are two different numbers of bits and you should not confuse them with eachother

I think the part that you are hung up on is the conversion from byte, to KB, to MB, etc. We all know the conversion, but let me clarify:
1024 bytes is a kilobyte. 1024 kilobytes is a megabyte, etc.
As far as the machines go, they don't care about this conversion! They just store it as x bytes. Honestly I'm not sure if it cares are bytes, and just deals with bits.
While I'm not entirely sure, I think the 1024 rate is an arbitrary choice made by some human. It's close to 1000 which is used in the metric system. I thought the same thing as you did, like "this has nothing to do with binary!". As one of the other answers says, it's nothing more than "easy to work with".

Why do bytes exist? Why don't we just use bits?

A byte consists of 8 bits on most systems.
A byte typically represents the smallest data type a programmer may use. Depending on language, the data types might be called char or byte.
There are some types of data (booleans, small integers, etc) that could be stored in fewer bits than a byte. Yet using less than a byte is not supported by any programming language I know of (natively).
Why does this minimum of using 8 bits to store data exist? Why do we even need bytes? Why don't computers just use increments of bits (1 or more bits) rather than increments of bytes (multiples of 8 bits)?
Just in case anyone asks: I'm not worried about it. I do not have any specific needs. I'm just curious.

because at the hardware level memory is naturally organized into addressable chunks. Small chunks means that you can have fine grained things like 4 bit numbers; large chunks allow for more efficient operation (typically a CPU moves things around in 'chunks' or multiple thereof). IN particular larger addressable chunks make for bigger address spaces. If I have chunks that are 1 bit then an address range of 1 - 500 only covers 500 bits whereas 500 8 bit chunks cover 4000 bits.
Note - it was not always 8 bits. I worked on a machine that thought in 6 bits. (good old octal)

Paper tape (~1950's) was 5 or 6 holes (bits) wide, maybe other widths.
Punched cards (the newer kind) were 12 rows of 80 columns.
1960s:
B-5000 - 48-bit "words" with 6-bit characters
CDC-6600 -- 60-bit words with 6-bit characters
IBM 7090 -- 36-bit words with 6-bit characters
There were 12-bit machines; etc.
1970-1980s, "micros" enter the picture:
Intel 4004 - 4-bit chunks
8008, 8086, Z80, 6502, etc - 8 bit chunks
68000 - 16-bit words, but still 8-bit bytes
486 - 32-bit words, but still 8-bit bytes
today - 64-bit words, but still 8-bit bytes
future - 128, etc, but still 8-bit bytes
Get the picture? Americans figured that characters could be stored in only 6 bits.
Then we discovered that there was more in the world than just English.
So we floundered around with 7-bit ascii and 8-bit EBCDIC.
Eventually, we decided that 8 bits was good enough for all the characters we would ever need. ("We" were not Chinese.)
The IBM-360 came out as the dominant machine in the '60s-70's; it was based on an 8-bit byte. (It sort of had 32-bit words, but that became less important than the all-mighty byte.
It seemed such a waste to use 8 bits when all you really needed 7 bits to store all the characters you ever needed.
IBM, in the mid-20th century "owned" the computer market with 70% of the hardware and software sales. With the 360 being their main machine, 8-bit bytes was the thing for all the competitors to copy.
Eventually, we realized that other languages existed and came up with Unicode/utf8 and its variants. But that's another story.

Good way for me to write something late on night!
Your points are perfectly valid, however, history will always be that insane intruder how would have ruined your plans long before you were born.
For the purposes of explanation, let's imagine a ficticious machine with an architecture of the name of Bitel(TM) Inside or something of the like. The Bitel specifications mandate that the Central Processing Unit (CPU, i.e, microprocessor) shall access memory in one-bit units. Now, let's say a given instance of a Bitel-operated machine has a memory unit holding 32 billion bits (our ficticious equivalent of a 4GB RAM unit).
Now, let's see why Bitel, Inc. got into bankruptcy:
The binary code of any given program would be gigantic (the compiler would have to manipulate every single bit!)
32-bit addresses would be (even more) limited to hold just 512MB of memory. 64-bit systems would be safe (for now...)
Memory accesses would be literally a deadlock. When the CPU has got all of those 48 bits it needs to process a single ADD instruction, the floppy would have already spinned for too long, and you know what happens next...
Who the **** really needs to optimize a single bit? (See previous bankruptcy justification).
If you need to handle single bits, learn to use bitwise operators!
Programmers would go crazy as both coffee and RAM get too expensive. At the moment, this is a perfect synonym of apocalypse.
The C standard is holy and sacred, and it mandates that the minimum addressable unit (i.e, char) shall be at least 8 bits wide.
8 is a perfect power of 2. (1 is another one, but meh...)

In my opinion, it's an issue of addressing. To access individual bits of data, you would need eight times as many addresses (adding 3 bits to each address) compared to using accessing individual bytes. The byte is generally going to be the smallest practical unit to hold a number in a program (with only 256 possible values).

Some CPUs use words to address memory instead of bytes. That's their natural data type, so 16 or 32 bits. If Intel CPUs did that it would be 64 bits.
8 bit bytes are traditional because the first popular home computers used 8 bits. 256 values are enough to do a lot of useful things, while 16 (4 bits) are not quite enough.
And, once a thing goes on for long enough it becomes terribly hard to change. This is also why your hard drive or SSD likely still pretends to use 512 byte blocks. Even though the disk hardware does not use a 512 byte block and the OS doesn't either. (Advanced Format drives have a software switch to disable 512 byte emulation but generally only servers with RAID controllers turn it off.)
Also, Intel/AMD CPUs have so much extra silicon doing so much extra decoding work that the slight difference in 8 bit vs 64 bit addressing does not add any noticeable overhead. The CPU's memory controller is certainly not using 8 bits. It pulls data into cache in long streams and the minimum size is the cache line, often 64 bytes aka 512 bits. Often RAM hardware is slow to start but fast to stream so the CPU reads kilobytes into L3 cache, much like how hard drives read an entire track into their caches because the drive head is already there so why not?

First of all, C and C++ do have native support for bit-fields.
#include <iostream>
struct S {
// will usually occupy 2 bytes:
// 3 bits: value of b1
// 2 bits: unused
// 6 bits: value of b2
// 2 bits: value of b3
// 3 bits: unused
unsigned char b1 : 3, : 2, b2 : 6, b3 : 2;
};
int main()
{
std::cout << sizeof(S) << '\n'; // usually prints 2
}
Probably an answer lies in performance and memory alignment, and the fact that (I reckon partly because byte is called char in C) byte is the smallest part of machine word that can hold a 7-bit ASCII. Text operations are common, so special type for plain text have its gain for programming language.

Why bytes?
What is so special about 8 bits that it deserves its own name?
Computers do process all data as bits, but they prefer to process bits in byte-sized groupings. Or to put it another way: a byte is how much a computer likes to "bite" at once.
The byte is also the smallest addressable unit of memory in most modern computers. A computer with byte-addressable memory can not store an individual piece of data that is smaller than a byte.
What's in a byte?
A byte represents different types of information depending on the context. It might represent a number, a letter, or a program instruction. It might even represent part of an audio recording or a pixel in an image.
Source

Page boundaries, implementing memory pool

I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?

Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.

I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?

With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.

If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio