I probably won't pursue this but I had this idea of generating a procedural universe in the most memory efficient way possible.
Like in the game Elite, you could use a random number generator based on a seed, and so each star system can be represented by a single seed number instead of lists of stats and other info. But if each star system is a 64 bit number, to make the milky way, 100 billion stars, that is 6.4 terabytes of memory. But if you use only 8-bits per star system, you'll only have 256 unique star systems in your game. So my other idea was to have each star system represented by 8 bits, but simply grab the next 7 star system's bytes in memory and use that combination to form a 64 bit number for the planet's seed. Obviously there would be 7 extra bytes at the end to account for the last star system in memory.
So is there any way to organize the values in these bytes such that every set of 8 bytes over the entire file covers all 64 bit values (hypothetically) with no repeats? Or is it impossible and I should just accept repeats? Or could I possibly use the address of the byte itself as part of the seed? So how would that work in C? Like if I have a file of 100 billion bytes, does that actually take up exactly 100 billion bytes in memory or is it more and how are the addresses for those bytes stored? And is accessing large files like that (like 100gb+) in a server client relationship practical? Thank you.
Related
2^10 = 1KB,
2^20 = 1MB,
etc.
etc.
Except, a byte is 8 bits so I do not understand why we are using powers of 2 as an explanation. To talk about Bits in powers of 2 I can completely understand but with Bytes, I am totally lost. Many textbooks / online resources talk about it in this way, what am I missing here?
By the way, I understand 2^10 = 1024 which is approximately 10^3 = 1000. What I don't understand is why we justify the use prefixes and bytes using powers of 2.
I'll ask the question you're really asking: Why don't we just use powers of 10?
To which we'll respond: why should we use powers of 10? Because the lifeforms using the computers happen to have 10 fingers?
Computers break everything down to 1s and 0s.
1024 in binary = 10000000000 (2^10), which is a nice round number.
1000 in binary = 1111101000 (not an even power of 2).
If you are actually working with a computer at a low level (ie looking at the raw memory), it is much easier to think using numbers that get represented as round numbers in the way they are stored.
From your question, I think that you understand about powers of two and measuring bytes. If not, the other answers explain that.
Is your question is why not use bits rather than bytes since bits are truly binary?
The reason that memory, disk space, etc is described in bytes rather than bits has to do with the word addressability of early computers. The bit, nibble and byte came about as workable amounts of memory in simple computers. The first computers had actual wires that linked the various bits together. 8-bit addressability was a significant step forward.
Bytes instead of bits is just a historical convention. Networks measurements are in (mega) bits for similar historical reasons.
Wikipedia has some interesting details.
The reason is that you do not only use bytes to store numbers, but also to address memory bytes that store numbers (or even other addresses). With 1 Byte you have 256 possible addresses, so you can access 256 different bytes. Using only 200 bytes, for example, just because it is a rounder number would be a waste of address space.
This example assumes 8 bit addresses for simplification, usually you have 64 bit addresses in modern PCs.
By the way, in the context of hard drives, the amount of memory is often a round number, e.g. 1 TB, because they address memory space differently. Powers of 2 are used in most memory types, like RAM, flash drives/SSDs, cache memory. In these cases, they are sometimes rounded, e.g. 1024 KB as 1 MB.
There are actually 2 different names for powers of 2 and powers of 10. Powers of ten are known as kilo-bytes, mega-bytes, giga-bytes, while powers of two are called kibi-bytes, mebi-bytes and gibi-bytes. Most people just use the former ones in both cases.
Okay so I figured my own question out. 2^3 bits = 2^0 Bytes. So if we have 2^13 bits and want to convert it to bytes then 2^13 bits = x * 1Byte / (2^3 bits) = 2^10 bytes which is a kilobyte. Now with this conversion, it makes much more sense to me why they choose to represent Bytes in powers of 2.
We can do the same thing with powers of ten, 10^1 ones = 10^0 tens. Then if we want to convert 10^25 ones to tens we get 10^25 ones = x * (10^0 tens / 10^1 ones) = 10^24 tens as expected.
I am not sure if I get what you are exactly asking, but:
2^10 bits = 1KBits
2^10 bytes = 1KBytes = ((2^3)(2^10)Bits = 2^13 Bits
These are two different numbers of bits and you should not confuse them with eachother
I think the part that you are hung up on is the conversion from byte, to KB, to MB, etc. We all know the conversion, but let me clarify:
1024 bytes is a kilobyte. 1024 kilobytes is a megabyte, etc.
As far as the machines go, they don't care about this conversion! They just store it as x bytes. Honestly I'm not sure if it cares are bytes, and just deals with bits.
While I'm not entirely sure, I think the 1024 rate is an arbitrary choice made by some human. It's close to 1000 which is used in the metric system. I thought the same thing as you did, like "this has nothing to do with binary!". As one of the other answers says, it's nothing more than "easy to work with".
A byte consists of 8 bits on most systems.
A byte typically represents the smallest data type a programmer may use. Depending on language, the data types might be called char or byte.
There are some types of data (booleans, small integers, etc) that could be stored in fewer bits than a byte. Yet using less than a byte is not supported by any programming language I know of (natively).
Why does this minimum of using 8 bits to store data exist? Why do we even need bytes? Why don't computers just use increments of bits (1 or more bits) rather than increments of bytes (multiples of 8 bits)?
Just in case anyone asks: I'm not worried about it. I do not have any specific needs. I'm just curious.
because at the hardware level memory is naturally organized into addressable chunks. Small chunks means that you can have fine grained things like 4 bit numbers; large chunks allow for more efficient operation (typically a CPU moves things around in 'chunks' or multiple thereof). IN particular larger addressable chunks make for bigger address spaces. If I have chunks that are 1 bit then an address range of 1 - 500 only covers 500 bits whereas 500 8 bit chunks cover 4000 bits.
Note - it was not always 8 bits. I worked on a machine that thought in 6 bits. (good old octal)
Paper tape (~1950's) was 5 or 6 holes (bits) wide, maybe other widths.
Punched cards (the newer kind) were 12 rows of 80 columns.
1960s:
B-5000 - 48-bit "words" with 6-bit characters
CDC-6600 -- 60-bit words with 6-bit characters
IBM 7090 -- 36-bit words with 6-bit characters
There were 12-bit machines; etc.
1970-1980s, "micros" enter the picture:
Intel 4004 - 4-bit chunks
8008, 8086, Z80, 6502, etc - 8 bit chunks
68000 - 16-bit words, but still 8-bit bytes
486 - 32-bit words, but still 8-bit bytes
today - 64-bit words, but still 8-bit bytes
future - 128, etc, but still 8-bit bytes
Get the picture? Americans figured that characters could be stored in only 6 bits.
Then we discovered that there was more in the world than just English.
So we floundered around with 7-bit ascii and 8-bit EBCDIC.
Eventually, we decided that 8 bits was good enough for all the characters we would ever need. ("We" were not Chinese.)
The IBM-360 came out as the dominant machine in the '60s-70's; it was based on an 8-bit byte. (It sort of had 32-bit words, but that became less important than the all-mighty byte.
It seemed such a waste to use 8 bits when all you really needed 7 bits to store all the characters you ever needed.
IBM, in the mid-20th century "owned" the computer market with 70% of the hardware and software sales. With the 360 being their main machine, 8-bit bytes was the thing for all the competitors to copy.
Eventually, we realized that other languages existed and came up with Unicode/utf8 and its variants. But that's another story.
Good way for me to write something late on night!
Your points are perfectly valid, however, history will always be that insane intruder how would have ruined your plans long before you were born.
For the purposes of explanation, let's imagine a ficticious machine with an architecture of the name of Bitel(TM) Inside or something of the like. The Bitel specifications mandate that the Central Processing Unit (CPU, i.e, microprocessor) shall access memory in one-bit units. Now, let's say a given instance of a Bitel-operated machine has a memory unit holding 32 billion bits (our ficticious equivalent of a 4GB RAM unit).
Now, let's see why Bitel, Inc. got into bankruptcy:
The binary code of any given program would be gigantic (the compiler would have to manipulate every single bit!)
32-bit addresses would be (even more) limited to hold just 512MB of memory. 64-bit systems would be safe (for now...)
Memory accesses would be literally a deadlock. When the CPU has got all of those 48 bits it needs to process a single ADD instruction, the floppy would have already spinned for too long, and you know what happens next...
Who the **** really needs to optimize a single bit? (See previous bankruptcy justification).
If you need to handle single bits, learn to use bitwise operators!
Programmers would go crazy as both coffee and RAM get too expensive. At the moment, this is a perfect synonym of apocalypse.
The C standard is holy and sacred, and it mandates that the minimum addressable unit (i.e, char) shall be at least 8 bits wide.
8 is a perfect power of 2. (1 is another one, but meh...)
In my opinion, it's an issue of addressing. To access individual bits of data, you would need eight times as many addresses (adding 3 bits to each address) compared to using accessing individual bytes. The byte is generally going to be the smallest practical unit to hold a number in a program (with only 256 possible values).
Some CPUs use words to address memory instead of bytes. That's their natural data type, so 16 or 32 bits. If Intel CPUs did that it would be 64 bits.
8 bit bytes are traditional because the first popular home computers used 8 bits. 256 values are enough to do a lot of useful things, while 16 (4 bits) are not quite enough.
And, once a thing goes on for long enough it becomes terribly hard to change. This is also why your hard drive or SSD likely still pretends to use 512 byte blocks. Even though the disk hardware does not use a 512 byte block and the OS doesn't either. (Advanced Format drives have a software switch to disable 512 byte emulation but generally only servers with RAID controllers turn it off.)
Also, Intel/AMD CPUs have so much extra silicon doing so much extra decoding work that the slight difference in 8 bit vs 64 bit addressing does not add any noticeable overhead. The CPU's memory controller is certainly not using 8 bits. It pulls data into cache in long streams and the minimum size is the cache line, often 64 bytes aka 512 bits. Often RAM hardware is slow to start but fast to stream so the CPU reads kilobytes into L3 cache, much like how hard drives read an entire track into their caches because the drive head is already there so why not?
First of all, C and C++ do have native support for bit-fields.
#include <iostream>
struct S {
// will usually occupy 2 bytes:
// 3 bits: value of b1
// 2 bits: unused
// 6 bits: value of b2
// 2 bits: value of b3
// 3 bits: unused
unsigned char b1 : 3, : 2, b2 : 6, b3 : 2;
};
int main()
{
std::cout << sizeof(S) << '\n'; // usually prints 2
}
Probably an answer lies in performance and memory alignment, and the fact that (I reckon partly because byte is called char in C) byte is the smallest part of machine word that can hold a 7-bit ASCII. Text operations are common, so special type for plain text have its gain for programming language.
Why bytes?
What is so special about 8 bits that it deserves its own name?
Computers do process all data as bits, but they prefer to process bits in byte-sized groupings. Or to put it another way: a byte is how much a computer likes to "bite" at once.
The byte is also the smallest addressable unit of memory in most modern computers. A computer with byte-addressable memory can not store an individual piece of data that is smaller than a byte.
What's in a byte?
A byte represents different types of information depending on the context. It might represent a number, a letter, or a program instruction. It might even represent part of an audio recording or a pixel in an image.
Source
I am compressing 8 bit bytes and the algorithm works only if the number of unique single bytes found on the data is 128 or less.
I take all the unique bytes. At the start I store a table containing once each unique byte. If they are 120 I store 120 bytes.
Then, instead of storing each item in space of 8 bits, I store each item in 7 bits, one after another. Those 7 bits contain the item's position on the table.
Question: how can I avoid storing those 120 bytes at the start, by storing the possible tables in my code?
What you are trying do is special case of huffman coding where you are only considering unique byte not their frequency hence giving each byte fixed length code but you can do better use their frequency to give them variable length codes using huffman coding and get more compression.
But if you intend to use the same algorithm then consider this way :-
Dont store 120 bytes store 256 bits (32 bytes) where 1 indicate if value is present
because it will give you all info. You use bit to get the values which
are found in the file and construct the mapping tables again
I don't know the exact algorithm, but probably the idea of the compression algorithm is that you cannot. It has to store those values, so it can write a shortcut for all other bytes in the data.
There is one way in which you could avoid writing those 120 bytes: when you know the contents of those bytes beforehand. For example, when you know that whatever you are going to send, will only contain those bytes. Then you can simply make the table known on both sides, and simply store everything but those 120 bytes.
How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB
I have a query that returns me around 6 million rows, which is too big to process all at once in memory.
Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.
How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?
I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.
Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)
I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.
Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).
How much memory have you got at your disposal? 6 million instances of a triple is really not very much!
Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).
So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.
Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.