What postscript object is 8-bytes normally, but 9-bytes packed? - memory-management

A packed array in Postscript is supposed to be a space-saving feature, where objects can be squeezed tightly in memory by omitting extraneous information. Like a null can be just a single byte because it carries no information. Booleans could be a signel byte, too. Integers could be 5 (or 3) bytes (if it's a small number). Reference objects would need the full 8-bytes that a normal object does. But the Postscript Manual says that packed objects occupy 1-9 bytes!
When the PostScript language scanner encounters a procedure delimited by
{ … }, it creates either an array or a packed array, according to the current packing
mode (see the description of the setpacking operator in Chapter 8). An
array value occupies 8 bytes per element. A packed array value occupies 1 to 9
bytes per element, depending on each element’s type and value; a typical average
is 2.5 bytes per element. --PLRM 3ed, B.2. Virtual Memory Use, p. 742
So what object gets bigger when packed? And Why? Hydrogen-bonding??!

Any value that you can't represent in 7 bytes or less, will need 9 bytes.
The packed format starts with a byte that contains how many data bytes follow, so any value that needs all 8 bytes of data will be 9 bytes including the leading length byte.

Related

Why does an empty slice have 24 bytes?

I'm want to understand what happens when created an empty slice with make([]int, 0). I do this code for test:
emptySlice := make([]int, 0)
fmt.Println(len(emptySlice))
fmt.Println(cap(emptySlice))
fmt.Println(unsafe.Sizeof(emptySlice))
The size and capacity return is obvious, both are 0, but the size of slice is 24 bytes, why?
24 bytes should be 3 int64 right? One internal array for a slice with 24 bytes should be something like: [3]int{}, then why one empty slice have 24 bytes?
If you read the documentation for unsafe.Sizeof, it explains what's going on here:
The size does not include any memory possibly referenced by x. For instance, if x is a slice, Sizeof returns the size of the slice descriptor, not the size of the memory referenced by the slice.
All data types in Go are statically sized. Even though slices have a dynamic number of elements, that cannot be reflected in the size of the data type, because then it would not be static.
The "slice descriptor", as implied by the name, is all the data that describes a slice. That is what is actually stored in a slice variable.
Slices in Go have 3 attributes: The underlying array (memory address), the length of the slice (memory offset), and the capacity of the slice (memory offset). In a 64-bit application, memory addresses and offsets tend to be stored in 64-bit (8-byte) values. That is why you see a size of 24 (= 3 * 8 ) bytes.
unsafe.Sizeof is the size of the object in memory, exactly the same as sizeof in C and C++. See How to get memory size of variable?
A slice has size, but also has the ability to resize, so the maximum resizing ability must also be stored somewhere. But being resizable also means that it can't be a static array but needs to store a pointer to some other (possibly dynamically allocated) array
The whole thing means it needs to store its { begin, end, last valid index } or { begin, size, capacity }. That's a tuple of 3 values which means its in-memory representation is at least 3×8 bytes on 64-bit platforms, unless you want to limit the maximum size and capacity to much smaller than 264 bytes
It's exactly the same situation in many C++ types with the same dynamic sizing capability like std::string or std::vector is also a 24-byte type although on some implementations 8 bytes of padding is added for alignment reasons, resulting in a 32-byte string type. See
C++ sizeof Vector is 24?
Why is sizeof array(type string) 24 bytes with a single space element?
Why is sizeof(string) == 32?
Why is sizeof array(type string) 24 bytes with a single space element?
In fact golang's strings.Builder which is the closest to C++'s std::string has size of 32 bytes. See demo

Addressing Size Regarding Bytes

Just to make sure, does every single address contain one byte? So say you had theoretical addresses FFF0 and FFFF: there are 16 values between these two addresses, which means between them they contain 16 bytes, or 8 x 16 bits? Every individual address is linked to a single byte?
Just to make sure, does every single address contain one byte?
...which means between them they contain 16 bytes, or 8 x 16 bits?
Every individual address is linked to a single byte?
Yes to all three questions.
Which is why the limitation with 32-bit addressing, you can only access 2^32 bytes == 4,294,967,296 bytes == 4 GiB. Each addressable memory location gives access to 1 byte.
If we could access 2 bytes with one address, then that limit would have been 8 GiB. And the architecture of modern chips and all software would have to be modified to determine whether they want both bytes or just the first or the second. So you'd need, say, 1 more bit to determine that. Guess what, if you had 33-bit machines, that's what we'd get...max address-able space of 8 GiB. Which is still effectively 1-byte-containing addresses. Workarounds do exist but that's not related to your questions.
* GiB = Binary GigaBytes.
Note that this is not related to "types" where a char is 1 byte and an int is 4 bytes. Programming languages compensate for that when trying to access the value of a stored variable/data stored at a location(s). And they are actually calculated as total bits rather than total bytes. So an int is considered as 32 bits rather than 4 bytes. When C fetches an int's value from memory, it will fetch all 4 bytes even though the address of the int refers to just one, the address of the first byte.
Yes. Addresses map to bytes 1 to 1, even if they expect you to work with a word size of two or four bytes at a time.

Cache calculating block offset and index

I've read several topics about this theme but I could not get the answer. So my question is:
1) How is the block offset calculated?
I want to know not the formula but the concept of it. As I know it is quantity of cases which a block can store the address. For example If there is a block with 8 byte storage and has to store 2 byte addresses. Does its block offset is 2 bit?(So there is 4 cases to store the address (the diagram below might make easier to see what I am saying).
The block offset is simply calculated as log2 cache_line_size.
The reason is that all system that I know of are byte addressable. So you need enough bits to index any byte in the block. Although most systems have a word size that is larger than a single byte, they still support offsets of a single byte gradulatrity, even if that is not the common case.
So for the example you mentioned of an 8-byte block size with 2-byte word, you would still need 3 bits in order to allow accessing any byte. If you had a system that was not byte addressable then you could use just 2 bits for the block offset. But in practice all systems that I know of are byte addressable.

Improve this compression algorithm?

I am compressing 8 bit bytes and the algorithm works only if the number of unique single bytes found on the data is 128 or less.
I take all the unique bytes. At the start I store a table containing once each unique byte. If they are 120 I store 120 bytes.
Then, instead of storing each item in space of 8 bits, I store each item in 7 bits, one after another. Those 7 bits contain the item's position on the table.
Question: how can I avoid storing those 120 bytes at the start, by storing the possible tables in my code?
What you are trying do is special case of huffman coding where you are only considering unique byte not their frequency hence giving each byte fixed length code but you can do better use their frequency to give them variable length codes using huffman coding and get more compression.
But if you intend to use the same algorithm then consider this way :-
Dont store 120 bytes store 256 bits (32 bytes) where 1 indicate if value is present
because it will give you all info. You use bit to get the values which
are found in the file and construct the mapping tables again
I don't know the exact algorithm, but probably the idea of the compression algorithm is that you cannot. It has to store those values, so it can write a shortcut for all other bytes in the data.
There is one way in which you could avoid writing those 120 bytes: when you know the contents of those bytes beforehand. For example, when you know that whatever you are going to send, will only contain those bytes. Then you can simply make the table known on both sides, and simply store everything but those 120 bytes.

How can I approximate the size of a data structure in scala?

I have a query that returns me around 6 million rows, which is too big to process all at once in memory.
Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.
How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?
I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.
Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)
I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.
Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).
How much memory have you got at your disposal? 6 million instances of a triple is really not very much!
Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).
So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.
Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

Resources