How is an array stored in memory? - memory-management

In an interest to delve deeper into how memory is allocated and stored, I have written an application that can scan memory address space, find a value, and write out a new value.
I developed a sample application with the end goal to be able to programatically locate my array, and overwrite it with a new sequence of numbers. In this situation, I created a single dimensional array, with 5 elements, e.g.
int[] array = new int[] {8,7,6,5,4};
I ran my application and searched for a sequence of the five numbers above. I was looking for any value that fell between 4 and 8, for a total of 5 numbers in a row. Unfortunately, my sequential numbers within the array matched hundreds of results, as the numbers 4 through 8, in no particular sequence happened to be next to each other, in memory, in many situations.
Is there any way to distinguish that a set of numbers within memory, represents an array, not simply integers that are next to each other? Is there any way of knowing that if I find a certain value, that the matching values proceeding it are that of an array?
I would assume that when I declare int[] array, its pointing at the first address of my array, which would provide some kind of meta-data to what existed in the array, e.g.
0x123456789 meta-data, 5 - 32 bit integers
0x123456789 + 32 "8"
0x123456789 + 64 "7"
0x123456789 + 96 "6"
0x123456789 + 128 "5"
0x123456789 + 160 "4"
Am I way off base?

Debug + Windows + Memory + Memory 1, set the Address field to "array". You'll see this when you switch the view to "4-byte Integer":
0x018416BC 6feb2c84 00000005 00000008 00000007 00000006 00000005 00000004
The first address is the address of the object in the garbage collected heap, plus the part of the object header that's at a negative offset (syncblk index). You cannot guess this value, the GC moves it around. The 2nd hex number is the 'type handle' for the array type (aka method table pointer). You cannot guess this value, type handles are created by the CLR on demand. The 3rd number is the array length. The rest of them are the array element values.
The odds of reliably finding this array back at runtime without a debugger are quite low. There isn't much point in trying.

Don't. Array is stored on the heap and subject to re-location due to garbage collection. You have to use fixed if you need to make sure memory is not moved in which can you can use but only very carefully.
If you are after high-performance arrays, use stackalloc and use your code scheme.

I don't know exactly but this article seems to suggest that you can get a pointer to your array, with which i would think you can determine the actual address.

Although I see you are using C# and, presumably, .NET, most of your question is in very general terms about memory. Keep mind that, in the most general sense, all memory is just bits whether that memory holds an array, strings, or code.
With that in mind, unless you can find tell-tale signs of your current platform's way of allocating different data types, there is no difference between memory that contains arrays, strings, or code.
Also, I wouldn't make any assumptions about if an array "points" to the first item in the array. Perhaps someone else can address this issue specifically, but I would assume some sort of header is involved.

Memory is not always stored contiguously. If you can ensure that it is, what you are asking is possible.

Related

Why can't ruby use most of the 2^X numbers as object ids?

ObjectSpace._id2ref gives us the object from the Ruby's Object Space, it has an object against id in sequence starting from 0, however, if we try to see object on id 4 it gives an error as
2.6.3 :121 > ObjectSpace._id2ref(4)
Traceback (most recent call last):
2: from (irb):121
1: from (irb):121:in `_id2ref'
RangeError (0x0000000000000004 is not id value)
Also, I figured that it's the same behaviour for 2^x values(except 1, 2, 8).
(0..10).each do |exp|
object_id = 2**exp
begin
puts "Number: #{object_id} : #{ObjectSpace._id2ref(object_id)}"
rescue Exception => e
puts "Number: #{object_id} : #{e.message}"
end
end
Number: 1 : 0
Number: 2 : 2.0
Number: 4 : 0x0000000000000004 is not id value
Number: 8 : nil
Number: 16 : 0x0000000000000010 is not id value
Number: 32 : 0x0000000000000020 is not id value
Number: 64 : 0x0000000000000040 is not id value
Number: 128 : 0x0000000000000080 is not symbol id value
Number: 256 : 0x0000000000000100 is not id value
Number: 512 : 0x0000000000000200 is not id value
Number: 1024 : 0x0000000000000400 is not id value
Why can't ruby use these specific numbers as object ids?
Also, what's different for (1,2,8)? and why error is different for 128?
First, it is very important to make a couple of things crystal clear:
There are exactly two guarantees Ruby makes about object IDs. These two guarantees are the only thing you are allowed to rely on. You must not make any assumptions about object IDs other than these two guarantees:
An object has the same ID for its entire lifetime.
No two objects have the same ID at the same time.
[Note: this means in particular that different objects can have the same ID at different times, i.e. that IDs can be recycled.]
An object ID is an opaque identifier. You must not make any assumptions about its structure or about any particular value.
Any particular implementation of object IDs is a private internal implementation detail of a specific version of a specific implementation running in a specific environment at a specific moment. There is no guarantee that the results will be the same with a different implementation. There is no guarantee that the results will be the same with a different version of the same implementation. There is no guarantee that the results will be the same with the same version of the same implementation running in a different environment. In fact, there is not even a guarantee that the results will be the same between two runs of the same code on the same version of the same implementation in the same environment.
ObjectSpace::_id2ref is an abomination. It should not even exist. It most certainly should not be used. It breaks object-orientation, it breaks encapsulation, it breaks safety.
Just as an example: unfortunately, you don't say which version of which implementation you are running in which environment. However, it looks like you are running YARV 2.6.3 in a 64-bit environment.
If you were to run that exact same code on the exact same version of YARV in a 32-bit environment, you would get different results. If you were to run that exact same code on an older version of YARV (pre-2.0) in the exact same environment, you would get different results.
Let's address the first, implicit, assumption which I think I see in your question. You seem to think that any ID should resolve to an object. It's easy to see that this cannot be true: there are infinitely many IDs, but for every run of a program, there are only finitely many objects, so there will always be infinitely many IDs which don't resolve to an object.
This already explains most of your results, namely the ones for 4, 16, 32, 64, 256, 512, and 1024.
So, with that out of the way, here's a high-level explanation of why there seems to some sort of structure to the IDs, and what that structure is. (But let me remind you again, that this explanation only applies to 64 bit systems, not to 32 bit, it only applies to YARV, it only applies to versions of YARV 2.0 or newer, and it is quite possible that it will no longer apply to YARV 3.0.)
In YARV, the developers made the decision that the object ID is the same thing as the memory address of the object header. This makes it easy to ensure the "rules" of object IDs: you can't have multiple objects at the same memory address at the same time, and an object will not change its memory address.
(Actually, it turns out that the second one is already a quite severe restriction: many modern high-performance garbage collectors depend on being able to move objects around in memory. This is not possible if you assume that object ID == memory address. Which means you will not be able to use any of those high-performance algorithms.)
On pretty much all modern machines, memory access is word-aligned. While it is possible to address individual bytes, that is generally slower or more awkward. So, we can basically assume that if we allocate memory, it will be aligned on a word-boundary. Which means that all memory addresses will be divisible by 8 on 64-bit systems and 4 on 32-bit systems, or in other words, that all memory addresses will end in 3 (64-bit) or 2 (32-bit) zero bits. Or, in other words: 87.5% (75%) of the address space are unused.
On the other hand, it would be quite a waste to represent Integers as a full-blown Ruby object:
They are immutable, which means we don't have to store any state.
They can't have instance variables, which means we don't have to store an instance variable table.
They can't have a singleton class, which means we don't have to store a __klass__ pointer.
They can't be extended.
And so on …
What this means, is that we can optimize the representation of Integers by not storing them as objects at all. All we need is some special case in the engine, so that if someone asks for the class of, say, 42, instead of trying to look at 42's __klass__ pointer, the engine "magically" knows to just return the Integer class.
Once we have that in place, we can do a really cool trick, which is actually as old as the very first LISP and Smalltalk VMs, and it is called a tagged pointer representation. Normally, the value of a variable is a pointer to the object (header), but with a tagged pointer representation we can store the value of the object inside the pointer to the object itself!
All we need to do is to have some sort of tag on the pointer that tells the engine that this is actually not a pointer but a value disguised as a pointer. In some older machines, especially those specifically designed for running high-level languages, pointers did have a tag field specifically for holding, e.g. type information or access control. Modern machines don't have that, but we have those unused bits we can (ab)use as tag bits.
And that is what YARV is doing: When the last bit of a pointer is 1, then it's not actually a pointer, it's an Integer. In particular, an Integer is encoded in YARV by shifting it one bit to the left and setting the last bit to 1. This allows us to encode a 63-bit Integer in a 64-bit pointer, and do native integer arithmetic at it with no object overhead and only a little bit of bit shifting overhead.
And if you think about what this encoding means:
shifting one bit to the left is equivalent to multiplying by two
setting the last bit to 1 is equivalent to incrementing by 1
Then you can explain the first pattern: a small Integer with value n is encoded as the "quasi-pointer" 2n + 1, and since "memory address" and object ID are the same in YARV (even though this is not actually a memory address, because there is no object which could have an address), it will have the object ID 2n + 1.
Integers that don't fit into 63 bit (31 bit), are allocated as objects like any other object. In different engines, these have different names, e.g. in the Smalltalk-80 VM, they are called SmallInts, in YARV, they are called Fixnums (and the ones that don't fit into a Fixnum are called Bignums). They actually used to be different subclasses of a fully-abstract Integer class in older versions of YARV, but this was considered a mistake. (It's really an internal optimization and should not be visible to the programmer.) In current versions of YARV, Fixnum and Bignum are aliases for Integer and using them gives a deprecation warning.
This explains your result for 1. If you had tried out ObjectSpace._id2ref(3), the result would have 1, then ObjectSpace._id2ref(5) would be 2, and so on.
And we still are using only 62.5% of the address space (on a 64-bit system)!
So, let's think about what else we might want to represent in this way.
YARV has a very similar optimization for Floats. Floating point numbers that fit into 62-bits are called flonums and are represented similar, with a tag of 10 at the end. (YARV does not use flonums on 32-bit platforms.)
This explains your result for ObjectSpace._id2ref(2). If you had tried ObjectSpace._id2ref(6), the result would have been -2.0.
And a similar trick is also played for Symbols. I won't explain it here in detail, because a) I don't actually fully know how it works, and b) it is slightly more complex, because the value being encoded isn't directly the Symbol value, rather it is an index into the Symbol table. However, that explains your result for 128.
Now, lastly, there is a completely different part of the address space that is also unused: the low addresses. On most modern Operating Systems, the low addresses are reserved for mapping the kernel memory directly into the user process in order to speed up the user space ↔︎ kernel space transition. Plus, there is another reason the very low addresses are kept free: in C, it is illegal to dereference a NULL pointer. Now, one way of implementing this, would be for the runtime to track all pointer dereferences and check whether they are dereferencing the NULL pointer. But there is an easier way: just give the NULL pointer an actual memory address, but one that is never allocated. That way, you don't have to do anything: if the code tries to dereference the pointer, the address doesn't exist, and the MMU will take care of raising an error. So, most C compilers compile the NULL pointer to the actual memory address 0, and in order to make sure that there is never any real data allocated at that address, they keep a whole area around address 0 free.
This means that the low addresses are never used, and we can (ab)use them to represent even more "interesting" objects. Now, YARV uses the very low addresses to represent the following objects:
false at address 0, which has the additional advantage that 0 is considered false in C.
nil at address 8 (4 in 32-bit).
true at address 20 (2 in 32-bit).
Qundef (a special internal value inside the engine that denotes an undefined value) at address 52 (6 in 32-bit).
And that explains your number 8.
This also means that your 4, 16, 32, 64, 256, 512, and 1024 will probably never resolve to an object, because they are in the low address range where the C library will simply never allocate memory.
As a closing remark, I want to repeat one last time that all of this is a private internal implementation detail of a specific version of YARV running in a specific environment. You must not rely on any of this, ever.
When flonums were introduced in YARV, and on some platforms nil no longer had object ID 4, this did break some code, and it did cause some confusion (as evidenced e.g. by questions on Stack Overflow), even though the YARV developers are allowed to change object IDs at will, because there are no guarantees being made about any particular ID values or the structure of IDs. Please, do not make the same mistake.

What does "run-time representation" means in programming languages"?

I'm taking a "Programming Languages: Design and Implementation" course and want to know what "run-time representation" means in programming languages?
To understand it, you need to remember that (contemporary) computers only know integral numbers (in various lengths: 1, 2, 4 or 8 bytes), IEEE floating point numbers (4 or 8 bytes) and memory addresses (pointers, 4 or 8 bytes).
Thus when you want to have a list, for example, you (or, at least the compiler writer of the language you are using) need to think about how lists will be represented (!) in memory at runtime.
One possible representation for elements of singly linked lists:
|________|________|
DataPtr NextPtr
The list node takes two adjacent pointer sized memory words, the first one points to the actual data, the second word points to the next list node.
There are two things to note here:
The representation is quite arbitrary. For example, we could switch the data and the next pointer, and this would be another representation that is as good as the former one.
Consider the run-time representation of a tuple or pair. It could be:
|________|________|
PtrFirst PtrSecond
that is, two memory words that hold pointers to the first and second component, respectively. Sounds familiar?
Well, how can we tell whether two subsequent words that hold pointers represent a pair or a list element? We can't! Many of our data abstraction will end up using the same run-time representation.

Efficient Algorithm for Parsing OpCodes

Let's say I'm writing a virtual machine. I read in the program data into an array of bytes. Now I need to loop through those bytes (instructions are two bytes) and instantiate a little class representing each instruction and it's arguments.
What would be a fast parsing approach? Here are the two way's I've thought of:
Logically branching by inspecting each bit from the left to the right until I narrowed it down to a particular op code. This would be like a binary search.
Inspecting some programs to come up with a list of opcodes ordered by frequency of use, and then checking the for the full opcode in that order.
Note: I will be using bit shifting and masking in C to check, not regexes or string comps or anything high-level like that.
You don't need to parse anything. If this is in C, you make a table of function pointers which has 256 entries in it, one for each possible byte value, then jump to the appropriate function based on the first byte value. If the second byte is significant then a switch statement can be used within the function to handle the second byte. This is how the original Visual Basic interpreter (versions 1-6) worked.

Array size too big - ruby

I am getting a 'ArgumentError: array size too big' message with the following code:
MAX_NUMBER = 600_000_000
my_array = Array.new(MAX_NUMBER)
Question. What is the max value that the Array.new function takes in Ruby?
An array with 500 million elements is 2 GiBytes in size, which – depending on the specific OS you are using – is typically the maximum that a process can address. In other words: your array is bigger than your address space.
So, the solutions are obvious: either make the array smaller (by, say, breaking it up in chunks) or make the address space bigger (in Linux, you can patch the kernel to get 3, 3.5 and even 4 GiByte of address space, and of course switching to a 64 bit OS and a 64 bit Ruby implementation(!) would also work).
Alternatively, you need to rethink your approach. Maybe use mmap instead of an array, or something like that. Maybe lazy-load only the parts you need.

Hashing of pointer values

Sometimes you need to take a hash function of a pointer; not the object the pointer points to, but the pointer itself. Lots of the time, folks just punt and use the pointer value as an integer, chop off some high bits to make it fit, maybe shift out known-zero bits at the bottom. Thing is, pointer values aren't necessarily well-distributed in the code space; in fact, if your allocator is doing its job, there's an excellent chance they're all clustered close together.
So, my question is, has anyone developed hash functions that are good for this? Take a 32- or 64-bit value that's maybe got 12 bits of entropy in it somewhere and spread it evenly across a 32-bit number space.
This page lists several methods that might be of use. One of them, due to Knuth, is a simple as multiplying (in 32 bits) by 2654435761, but "Bad hash results are produced if the keys vary in the upper bits." In the case of pointers, that's a rare enough situation.
Here are some more algorithms, including performance tests.
It seems that the magic words are "integer hashing".
They'll likely exhibit locality, yes - but in the lower bits, which means objects will be distributed through the hashtable. You'll only see collisions if a pointer's address is a multiple of the hashtable's length from another pointer.
If you know the lowest possible pointer address (which is often the case if you're working within a large buffer), just convert the pointer to an integer by subtracting the lowest possible pointer value; eg. that could be the buffer's base address.
-Remember: pointer subtracted from pointer equals an offset (integer).
So: Don't "chop off" bits; it's much better to convert to an offset.
This will result in that the offset value is much smaller than a pointer value.
It may help further to shift the pointer value right twice (eg. divide by 4) in some cases as well, before hashing it.
The problem with pointers is often that small blocks of memory is likely to be allocated on the same address (eg. a block being freed and another block is taking the freed block's place).
Why not just use an existing hash function?

Resources