How does Ruby differentiate VALUE with value and pointer? - ruby

For values such as true, nil, or small integers, Ruby does optimization. Instead of using VALUE pointer as a pointer, it directly uses VALUE to store data.
I wonder how Ruby makes a difference between these uses:
def foo(x)
...
with x that will be associated to VALUE. In low level terms, they are just a number. How can I tell whether or not a certain number is a pointer to an object? All that comes to my mind is to limit pointers to have the MSB set to 0, and direct values with MSB equal to 1. But this is just my guess. How is it done in Ruby?

There are many different implementations of Ruby. The Ruby Language Specification doesn't prescribe any particular internal representation for objects – why should it? It's an internal representation, after all!
For example, JRuby doesn't represent objects as C pointers at all, it represents them as Java objects. IronRuby represents them as .NET objects. Opal represents them as ECMAScript objects. MagLev represents them as Smalltalk objects.
However, there are indeed some implementations that use the strategy you describe. The now abandoned MRI did it that way, YARV and Rubinius also do it.
This is actually a very old trick, dating back to at least the 1960s. It's called a tagged pointer representation, and like the name suggests, you need to tag the pointer with some additional metadata in order to know whether or not it is actually a pointer to an object or an encoding of some other datatype.
Some CPUs have special tag bits specifically for that purpose. (For example, on the AS/400, the CPU doesn't even have pointers, it has 128bit object references, even though the original CPU was only 48bit wide, and the newer POWER-based CPUs 64 bit; the extra bits are used to encode all sorts of metadata like type, owner, access restrictions, etc.) Some CPUs have tag bits for other purposes that can be "abused" for this purpose. However, most modern mainstream CPUs don't have tag bits.
But, you can use a trick! On many modern CPUs, unaligned memory accesses (accessing an address that does not start at a word boundary) are really slow (on some, they aren't even possible at all), which means that on a 32bit CPU, all pointers that are realistically being used, end with two 00 bits and on 64 bit CPUs with three 000 bits. You can use these bits as tag bits: pointers that end with 00 are indeed pointers, pointers that end with 01, 10, or 11 are an encoding of some other data type.
In MRI, the pointers ending in 1 were used to encode 31/63 bit Fixnums. In YARV, they are used to encode 31/63 bit Fixnums, i.e. integers that are encoded as actual machine integers according to the formula 2n+1 (arithmetically speaking) or (n << 1) | 1 (as a bit pattern). On 64 bit platforms, YARV also uses pointers that end in 10 to encode 62 bit flonums using a similar scheme. (If you ever wondered why the object_id of a Fixnum in YARV is 2n+1, now you know: YARV uses the memory address for the object ID, and 2n+1 is the "memory address" of n.)
Now, what about nil, false and true? Well, there is no space for them in our current scheme. However, the very low memory addresses are usually reserved for the operating system kernel, which means that a pointer like 0 or 2 or 4 cannot realistically occur in a program. YARV uses that space to encode nil, false and true: false is encoded as 0 (which is convenient because that's also the encoding of false in C), nil is encoded as 0b1000 and true is encoded as 0b10100 (it used to be 0, 0b10 and 0b100 in older versions before the introduction of flonums).
Theoretically, there is a lot of space there to encode other objects as well, but YARV doesn't do that. Some Smalltalk or Lisp VMs, for example, encode ASCII or BMP Unicode character objects there, or some often used objects such as the empty list, empty array, or empty string.
There is still some piece missing, though: without an object header, with just the bare bit pattern, how can the VM access the class, the methods, the instance variables, etc.? Well, it can't. Those have to be special-cased and hardcoded into the VM. The VM simply has to know that a pointer ending in 1 is an encoded Fixnum and has to know that the class is Fixnum and the methods can be found there. And as for instance variables? Well, you could store them separately from the objects in a dictionary on the side. Or you go the Ruby route and simply disallow them altogether.

This answer is merely a distillation of #Jörg always-excellent treatise.
In MRI, true, false, nil and Fixnums are mapped to fixed object_id's; all other objects are assigned dynamically-generated values. The object_id for false is 0. For true and nil they are 20 and 8 (2 and 4 prior to v2.0), respectively. The integer i has object_id i*2+1. Dynamically-generated object_id's cannot be any of these values. Therefore, (in MRI) one can merely check to see if the object_id is one of these values to determine if the associated object has a fixed object_id.
Incidentally, objects can be obtained from their object_id's with the method ObjectSpace#_id2ref.
For more on this, see #sepp2k's answer here.

Related

Size of numeric data types in Ruby

I know that Ruby has Float for real, Fixnum and Bignum for int.
But what about sizes of this types?
a = 1.23 // size of one Float in bytes?
b = 1 // size of one Fixnum in bytes?
c = 2**65 // = size of one Bignum in bytes?
I am trying to find a standard or specification
I know that Ruby has Float for real, Fixnum and Bignum for int.
This is not true.
Float does not represent real numbers, it represents floating point numbers. In fact, in the general case, representing real numbers in a physical computer is impossible, as it requires unbounded memory.
Fixnum and Bignum are not part of Ruby. Ruby only has Integer. The Ruby Specification allows different implementations to have implementation-specific subclasses, but these are then specific to that particular implementation (e.g. YARV, Opal, TruffleRuby, Artichoke, JRuby, IronRuby, MRuby, etc.), they don't have anything to do with Ruby.
In fact, even knowing the implementation is not enough, you have to know the exact version. For example, YARV had Fixnum and Bignum as subclasses in the past, but now it doesn't anymore, it only has Integer. And even back when it had them, that was still not enough, because they actually had different sizes depending on the platform.
But what about sizes of this types?
a = 1.23 // size of one Float in bytes?
b = 1 // size of one Fixnum in bytes?
c = 2**65 // = size of one Bignum in bytes?
I am trying to find a standard or specification
Here is what the ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification has to say on the matter [bold emphasis mine]:
15.2.8 Integer
15.2.8.1 General description
Instances of the class Integer represent integers. The ranges of these integers are unbounded. However the actual values computable depend on resource limitations, and the behavior when the resource limits are exceeded is implementation-defined.
[…]
Subclasses of the class Integer may be defined as built-in classes. In this case:
The class Integer shall not have its direct instances. Instead of a direct instance of the class Integer, a direct instance of a subclass of the class Integer shall be created.
Instance methods of the class Integer need not be defined in the class Integeritself if the instance methods are defined in all subclasses of the class Integer.
For each subclass of the class Integer, the ranges of the values of its instances may be bounded.
The ISO Ruby Language Specification does not mandate any particular size or representation for Integers, nor does it specify any methods for querying this information.
15.2.9 Float
15.2.9.1 General description
Instances of the class Float represent floating-point numbers.
The precision of the value of an instance of the class Float is implementation-defined; however, if the underlying system of a conforming processor supports IEC 60559, the representation of an instance of the class Float shall be the 64-bit double format as specified in IEC 60559, 3.2.2.
The ISO Ruby Language Specification does not mandate any particular size or representation for Floats, unless the underlying system supports ISO/IEC 60559, in which case the representation must be as an ISO/IEC 60559 binary64 double float. The ISO Ruby Language Specification does not specify any methods for querying this information.
The ruby/spec does not say anything about the size or precision of Float. In fact, it is very careful to not say anything. For example, if you look at the spec for Float#prev_float, you can see that they are very careful to specify the behavior of Float#prev_float without ever referring to the actual precision.
The ruby/spec for Integer#size does say something about the sizes of specific machine integers in bytes. However, unfortunately, ruby/spec is kind of a mixture between specifications for the behavior of the Ruby Programming Language and the behavior of the YARV Ruby Implementation. I have the feeling that this spec is more like the latter.
For example, the cutoff point between fixnums and bignums in YARV is 231 on 32 bit platforms and 263 on 64 bit platforms, but on JRuby, it is 264 on both 32 bit and 64 bit platforms (and I think TruffleRuby is the same). So, 3000000000 will be a bignum on 32 bit YARV, but a fixnum on 64 bit YARV and JRuby, and 10000000000000000000 will be a bignum on both 32 bit and 64 bit YARV, but will be a fixnum even on 32 bit JRuby. For Opal, I think the cutoff point is different again, I think it is 253. Other implementations may not even distinguish between fixnums and bignums at all, or they may have three or more different kinds of integers.
Also, it is very important to remember that this spec only specifies the return value of the method Integer#size. Nowhere does it say that this is actually how an Integer must be represented in memory.
By the way, you may be confused why I am talking about fixnums and bignums in YARV, when I said above that Fixnum and Bignum have been removed from YARV. Well, the reason is that the separate classes have been removed, but the optimization is still there. Which is another thing: the ISO Ruby Language Specification says that you are allowed to have implementation-specific subclasses of Integer, but it doesn't say what those classes are for. Neither does the ISO Ruby Language Specification force implementors to have optimized implementations for their subclasses, nor does it force implementors to have subclasses for their implementation-specific optimizations.
Compare that to YARV's flonums which are an optimized representation of 62 bit Floats, but they don't have their own class.
So, in (not so) short: the ISO Ruby Language Specification does not say anything about the sizes of Integers, but it does say that if the underlying system supports ISO/IEC 60559, then the representation must be an ISO/IEC 60559 binary64 double float. It does, however, say nothing about the size or representation for cases where the underlying system does not support ISO/IEC 60559.
The ruby/spec is careful not to specify anything about the sizes of Floats, but it does specify the return value of the Integer#size method. It does, however, not specify that this return value must in any way correspond to how Integers are actually represented.
Here's what the RDoc for Integer#size in YARV has to say [bold emphasis mine]:
size → int
Returns the number of bytes in the machine representation of int (machine dependent).
So, it only says that it returns the number of bytes the Integer would have in the machine representation, but not that this is necessarily the way that it is actually represented. And it clearly states that the value is machine dependent.
Fixnum has the size method
2**65.size # => 256
1.size # => 8
Float - from this SO post and also what Stefan pointed out in the comments, they're 64 bits.
String has the bytesize method
"test".bytesize # 4

Why can't ruby use most of the 2^X numbers as object ids?

ObjectSpace._id2ref gives us the object from the Ruby's Object Space, it has an object against id in sequence starting from 0, however, if we try to see object on id 4 it gives an error as
2.6.3 :121 > ObjectSpace._id2ref(4)
Traceback (most recent call last):
2: from (irb):121
1: from (irb):121:in `_id2ref'
RangeError (0x0000000000000004 is not id value)
Also, I figured that it's the same behaviour for 2^x values(except 1, 2, 8).
(0..10).each do |exp|
object_id = 2**exp
begin
puts "Number: #{object_id} : #{ObjectSpace._id2ref(object_id)}"
rescue Exception => e
puts "Number: #{object_id} : #{e.message}"
end
end
Number: 1 : 0
Number: 2 : 2.0
Number: 4 : 0x0000000000000004 is not id value
Number: 8 : nil
Number: 16 : 0x0000000000000010 is not id value
Number: 32 : 0x0000000000000020 is not id value
Number: 64 : 0x0000000000000040 is not id value
Number: 128 : 0x0000000000000080 is not symbol id value
Number: 256 : 0x0000000000000100 is not id value
Number: 512 : 0x0000000000000200 is not id value
Number: 1024 : 0x0000000000000400 is not id value
Why can't ruby use these specific numbers as object ids?
Also, what's different for (1,2,8)? and why error is different for 128?
First, it is very important to make a couple of things crystal clear:
There are exactly two guarantees Ruby makes about object IDs. These two guarantees are the only thing you are allowed to rely on. You must not make any assumptions about object IDs other than these two guarantees:
An object has the same ID for its entire lifetime.
No two objects have the same ID at the same time.
[Note: this means in particular that different objects can have the same ID at different times, i.e. that IDs can be recycled.]
An object ID is an opaque identifier. You must not make any assumptions about its structure or about any particular value.
Any particular implementation of object IDs is a private internal implementation detail of a specific version of a specific implementation running in a specific environment at a specific moment. There is no guarantee that the results will be the same with a different implementation. There is no guarantee that the results will be the same with a different version of the same implementation. There is no guarantee that the results will be the same with the same version of the same implementation running in a different environment. In fact, there is not even a guarantee that the results will be the same between two runs of the same code on the same version of the same implementation in the same environment.
ObjectSpace::_id2ref is an abomination. It should not even exist. It most certainly should not be used. It breaks object-orientation, it breaks encapsulation, it breaks safety.
Just as an example: unfortunately, you don't say which version of which implementation you are running in which environment. However, it looks like you are running YARV 2.6.3 in a 64-bit environment.
If you were to run that exact same code on the exact same version of YARV in a 32-bit environment, you would get different results. If you were to run that exact same code on an older version of YARV (pre-2.0) in the exact same environment, you would get different results.
Let's address the first, implicit, assumption which I think I see in your question. You seem to think that any ID should resolve to an object. It's easy to see that this cannot be true: there are infinitely many IDs, but for every run of a program, there are only finitely many objects, so there will always be infinitely many IDs which don't resolve to an object.
This already explains most of your results, namely the ones for 4, 16, 32, 64, 256, 512, and 1024.
So, with that out of the way, here's a high-level explanation of why there seems to some sort of structure to the IDs, and what that structure is. (But let me remind you again, that this explanation only applies to 64 bit systems, not to 32 bit, it only applies to YARV, it only applies to versions of YARV 2.0 or newer, and it is quite possible that it will no longer apply to YARV 3.0.)
In YARV, the developers made the decision that the object ID is the same thing as the memory address of the object header. This makes it easy to ensure the "rules" of object IDs: you can't have multiple objects at the same memory address at the same time, and an object will not change its memory address.
(Actually, it turns out that the second one is already a quite severe restriction: many modern high-performance garbage collectors depend on being able to move objects around in memory. This is not possible if you assume that object ID == memory address. Which means you will not be able to use any of those high-performance algorithms.)
On pretty much all modern machines, memory access is word-aligned. While it is possible to address individual bytes, that is generally slower or more awkward. So, we can basically assume that if we allocate memory, it will be aligned on a word-boundary. Which means that all memory addresses will be divisible by 8 on 64-bit systems and 4 on 32-bit systems, or in other words, that all memory addresses will end in 3 (64-bit) or 2 (32-bit) zero bits. Or, in other words: 87.5% (75%) of the address space are unused.
On the other hand, it would be quite a waste to represent Integers as a full-blown Ruby object:
They are immutable, which means we don't have to store any state.
They can't have instance variables, which means we don't have to store an instance variable table.
They can't have a singleton class, which means we don't have to store a __klass__ pointer.
They can't be extended.
And so on …
What this means, is that we can optimize the representation of Integers by not storing them as objects at all. All we need is some special case in the engine, so that if someone asks for the class of, say, 42, instead of trying to look at 42's __klass__ pointer, the engine "magically" knows to just return the Integer class.
Once we have that in place, we can do a really cool trick, which is actually as old as the very first LISP and Smalltalk VMs, and it is called a tagged pointer representation. Normally, the value of a variable is a pointer to the object (header), but with a tagged pointer representation we can store the value of the object inside the pointer to the object itself!
All we need to do is to have some sort of tag on the pointer that tells the engine that this is actually not a pointer but a value disguised as a pointer. In some older machines, especially those specifically designed for running high-level languages, pointers did have a tag field specifically for holding, e.g. type information or access control. Modern machines don't have that, but we have those unused bits we can (ab)use as tag bits.
And that is what YARV is doing: When the last bit of a pointer is 1, then it's not actually a pointer, it's an Integer. In particular, an Integer is encoded in YARV by shifting it one bit to the left and setting the last bit to 1. This allows us to encode a 63-bit Integer in a 64-bit pointer, and do native integer arithmetic at it with no object overhead and only a little bit of bit shifting overhead.
And if you think about what this encoding means:
shifting one bit to the left is equivalent to multiplying by two
setting the last bit to 1 is equivalent to incrementing by 1
Then you can explain the first pattern: a small Integer with value n is encoded as the "quasi-pointer" 2n + 1, and since "memory address" and object ID are the same in YARV (even though this is not actually a memory address, because there is no object which could have an address), it will have the object ID 2n + 1.
Integers that don't fit into 63 bit (31 bit), are allocated as objects like any other object. In different engines, these have different names, e.g. in the Smalltalk-80 VM, they are called SmallInts, in YARV, they are called Fixnums (and the ones that don't fit into a Fixnum are called Bignums). They actually used to be different subclasses of a fully-abstract Integer class in older versions of YARV, but this was considered a mistake. (It's really an internal optimization and should not be visible to the programmer.) In current versions of YARV, Fixnum and Bignum are aliases for Integer and using them gives a deprecation warning.
This explains your result for 1. If you had tried out ObjectSpace._id2ref(3), the result would have 1, then ObjectSpace._id2ref(5) would be 2, and so on.
And we still are using only 62.5% of the address space (on a 64-bit system)!
So, let's think about what else we might want to represent in this way.
YARV has a very similar optimization for Floats. Floating point numbers that fit into 62-bits are called flonums and are represented similar, with a tag of 10 at the end. (YARV does not use flonums on 32-bit platforms.)
This explains your result for ObjectSpace._id2ref(2). If you had tried ObjectSpace._id2ref(6), the result would have been -2.0.
And a similar trick is also played for Symbols. I won't explain it here in detail, because a) I don't actually fully know how it works, and b) it is slightly more complex, because the value being encoded isn't directly the Symbol value, rather it is an index into the Symbol table. However, that explains your result for 128.
Now, lastly, there is a completely different part of the address space that is also unused: the low addresses. On most modern Operating Systems, the low addresses are reserved for mapping the kernel memory directly into the user process in order to speed up the user space ↔︎ kernel space transition. Plus, there is another reason the very low addresses are kept free: in C, it is illegal to dereference a NULL pointer. Now, one way of implementing this, would be for the runtime to track all pointer dereferences and check whether they are dereferencing the NULL pointer. But there is an easier way: just give the NULL pointer an actual memory address, but one that is never allocated. That way, you don't have to do anything: if the code tries to dereference the pointer, the address doesn't exist, and the MMU will take care of raising an error. So, most C compilers compile the NULL pointer to the actual memory address 0, and in order to make sure that there is never any real data allocated at that address, they keep a whole area around address 0 free.
This means that the low addresses are never used, and we can (ab)use them to represent even more "interesting" objects. Now, YARV uses the very low addresses to represent the following objects:
false at address 0, which has the additional advantage that 0 is considered false in C.
nil at address 8 (4 in 32-bit).
true at address 20 (2 in 32-bit).
Qundef (a special internal value inside the engine that denotes an undefined value) at address 52 (6 in 32-bit).
And that explains your number 8.
This also means that your 4, 16, 32, 64, 256, 512, and 1024 will probably never resolve to an object, because they are in the low address range where the C library will simply never allocate memory.
As a closing remark, I want to repeat one last time that all of this is a private internal implementation detail of a specific version of YARV running in a specific environment. You must not rely on any of this, ever.
When flonums were introduced in YARV, and on some platforms nil no longer had object ID 4, this did break some code, and it did cause some confusion (as evidenced e.g. by questions on Stack Overflow), even though the YARV developers are allowed to change object IDs at will, because there are no guarantees being made about any particular ID values or the structure of IDs. Please, do not make the same mistake.

Patterns on Fixnum object_ids in Ruby?

The object_id of 0 is 1, of 1 is 3, of 2 is 5.
Why is this pattern like this? What is behind the Fixnums that make to create that pattern of object_ids? I would expect that if 0 has id 1, 1 has id 2, 2 has id 3.. And so on.
What am I missing?
First things first: the only thing that the Ruby Language Specification guarantees about object_ids is that they are unique in space. That's it. They aren't even unique in time.
So, at any given time there can only be one object with a specific object_id at the same time, however, at different times, object_ids may be reused for different objects.
To be fully precise: what Ruby guarantees is that
object_id will be an Integer
no two objects will have the same object_id at the same time
an object will have the same object_id over its entire lifetime
What you are seeing is a side-effect of how object_id and Fixnums are implemented in YARV. This is a private internal implementation detail of YARV that is not guaranteed in any way. Other Ruby implementations may (and do) implement them differently, so this is not guaranteed to be true across Ruby implementations. It is not even guaranteed to be true across different versions of YARV, or even for the same version on different platforms.
And in fact, it actually did change quite recently, and it is different between 32-bit and 64-bit platforms.
In YARV, object_id is simply implemented as returning the memory address of the object. That's one piece of the puzzle.
Nut, why are the memory addresses of Fixnums so regular? Well, actually, in this case, they aren't memory addresses! YARV uses a special trick to encode some objects into pointers. There are some pointers which aren't actually being used, and so you can use them to encode certain things.
This is called a tagged pointer representation, and is a pretty common optimization trick used in many different interpreters, VMs and runtime systems for decades. Pretty much every Lisp implementation uses them, many Smalltalk VMs, many Ruby interpreters, and so on.
Usually, in those languages, you always pass around pointers to objects. An object itself consists of an object header, which contains object metadata (like the type of an object, its class(es), maybe access control restrictions or security annotations and so on), and then the actual object data itself. So, a simple integer would be represented as a pointer plus an object consisting of metadata and the actual integer. Even with a very compact representation, that's something like 6 Byte for a simple integer.
Also, you cannot pass such an integer object to the CPU to perform fast integer arithmetic. If you want to add two integers, you really only have two pointers, which point to the beginning of the object headers of the two integer objects you want to add. So, you first need to perform integer arithmetic on the first pointer to add the offset into the object to it where the integer data is stored. Then you have to dereference that address. Do the same again with the second integer. Now you have two integers you can actually ask the CPU to add. Of course, you need to now construct a new integer object to hold the result.
So, in order to perform one integer addition, you actually need to perform three integer additions plus two pointer dererefences plus one object construction. And you take up almost 20 bytes.
However, the trick is that with so-called immutable value types like integers, you usually don't need all the metadata in the object header: you can just leave all that stuff out, and simply synthesize it (which is VM-nerd-speak for "fake it"), when anyone cares to look. A fixnum will always have class Fixnum, there's no need to separately store that information. If someone uses reflection to figure out the class of a fixnum, you simply reply Fixnum and nobody will ever know that you didn't actually store that information in the object header and that in fact, there isn't even an object header (or an object).
So, the trick is to store the value of the object within the pointer to the object, effectively collapsing the two into one.
There are CPUs which actually have additional space within a pointer (so-called tag bits) that allow you to store extra information about the pointer within the pointer itself. Extra information like "this isn't actually a pointer, this is an integer". Examples include the Burroughs B5000, the various Lisp Machines or the AS/400. Unfortunately, most of the current mainstream CPUs don't have that feature.
However, there is a way out: most current mainstream CPUs work significantly slower when addresses aren't aligned on word boundaries. Some even don't support unaligned access at all.
What this means is that in practice, all pointers will be divisible by 4 (on a 32-bit system, 8 on a 64-bit system), which means they will always end with two (three on a 64-bit system) 0 bits. This allows us to distinguish between real pointers (that end in 00) and pointers which are actually integers in disguise (those that end with 1). And it still leaves us with all pointers that end in 10 free to do other stuff. Also, most modern operating systems reserve the very low addresses for themselves, which gives us another area to mess around with (pointers that start with, say, 24 0s and end with 00).
So, you can encode a 31-bit (or 63-bit) integer into a pointer, by simply shifting it 1 bit to the left and adding 1 to it. And you can perform very fast integer arithmetic with those, by simply shifting them appropriately (sometimes not even that is necessary).
What do we do with those other address spaces? Well, typical examples include encoding floats in the other large address space and a number of special objects like true, false, nil, the 127 ASCII characters, some commonly used short strings, the empty list, the empty object, the empty array and so on near the 0 address.
In YARV, integers are encoded the way I described above, false is encoded as address 0 (which just so happens also to be the representation of false in C), true as address 2 (which just so happens to be the C representation of true shifted by one bit) and nil as 4.
In YARV, the following bit patterns are used to encode certain special objects:
xxxx xxxx … xxxx xxx1 Fixnum
xxxx xxxx … xxxx xx10 flonum
0000 0000 … 0000 1100 Symbol
0000 0000 … 0000 0000 false
0000 0000 … 0000 1000 nil
0000 0000 … 0001 0100 true
0000 0000 … 0011 0100 undefined
Fixnums are 63-bit integers that fit into a single machine word, flonums are 62-bit Floats that fit into a single machine word. false, nil and true are what you would expect, undefined is a value that is only used inside the implementation but not exposed to the programmer.
Note that on 32-bit platforms, flonums aren't used (there's no point in using 30-bit Floats), and so the bit patterns are different. nil.object_id is 4 on 32-bit platforms, not 8 like on 64-bit platforms, for example.
So, there you have it:
certain small integers are encoded as pointers
pointers are used for object_ids
Therefore
certain small integers have predictable object_ids
For Fixnum i, the object_id is i * 2 + 1.
For object_id of 0, 2, 4, what are them? They are false, true, nil in ruby.

XChangeProperty for an atom property on a system where Atom is 64 bits

The X11 protocol defines an atom as a 32-bit integer, but on my system, the Atom type in is a typedef for unsigned long, which is a 64-bit integer. The manual for Xlib says that property types have a maximum size of 32 bits. There seems to be some conflict here. I can think of three possible solutions.
If Xlib treats properties of type XA_ATOM as a special case, then you can simply pass 32 for 'format' and an array of atoms for 'data'. This seems unclean and hackish, and I highly doubt that this is correct.
The manual for Xlib appears to be ancient. Since Atom is 64 bits long on my system, should I pass 64 for the 'format' parameter even though 64 is not listed as an allowed value?
Rather than an array of Atoms, should I pass an array of uint32_t values for the 'data' parameter? This seems like it would most likely be the correct solution to me, but this is not what they did in some sources I've looked up that use XChangeProperty, such as SDL.
SDL appears to use solution 1 when setting the _NET_WM_WINDOW_TYPE property, but I suspect that this may be a bug. On systems with little endian byte order (LSB first), this would appear to work if the property has only one element.
Has anyone else encountered this problem? Any help is appreciated.
For the property routines you always want to pass an array of 'long', 'short' or 'char'. This is always true independent of the actual bit width. So, even if your long or atom is 64 bits, it will be translated to 32 bits behind the scenes.
The format is the number of server side bits used, not client side. So, for format 8, you must pass a char array, for format 16, you always use a short array and for format 32 you always use a long array. This is completely independent of the actual lengths of short or long on a given machine. 32 bit values such as Atom or Window always are in a 'long'.
This may seem odd, but it is for a good reason, the C standard does not guarantee types exist that have exactly the same widths as on the server. For instance, a machine with no native 16 bit type. However a 'short' is guaranteed to have at least 16 bits and a long is guaranteed to have at least 32 bits. So by making the client API in terms of 'short' and 'long' you can both write portable code and always have room for the full X id in the C type.

Hashing of pointer values

Sometimes you need to take a hash function of a pointer; not the object the pointer points to, but the pointer itself. Lots of the time, folks just punt and use the pointer value as an integer, chop off some high bits to make it fit, maybe shift out known-zero bits at the bottom. Thing is, pointer values aren't necessarily well-distributed in the code space; in fact, if your allocator is doing its job, there's an excellent chance they're all clustered close together.
So, my question is, has anyone developed hash functions that are good for this? Take a 32- or 64-bit value that's maybe got 12 bits of entropy in it somewhere and spread it evenly across a 32-bit number space.
This page lists several methods that might be of use. One of them, due to Knuth, is a simple as multiplying (in 32 bits) by 2654435761, but "Bad hash results are produced if the keys vary in the upper bits." In the case of pointers, that's a rare enough situation.
Here are some more algorithms, including performance tests.
It seems that the magic words are "integer hashing".
They'll likely exhibit locality, yes - but in the lower bits, which means objects will be distributed through the hashtable. You'll only see collisions if a pointer's address is a multiple of the hashtable's length from another pointer.
If you know the lowest possible pointer address (which is often the case if you're working within a large buffer), just convert the pointer to an integer by subtracting the lowest possible pointer value; eg. that could be the buffer's base address.
-Remember: pointer subtracted from pointer equals an offset (integer).
So: Don't "chop off" bits; it's much better to convert to an offset.
This will result in that the offset value is much smaller than a pointer value.
It may help further to shift the pointer value right twice (eg. divide by 4) in some cases as well, before hashing it.
The problem with pointers is often that small blocks of memory is likely to be allocated on the same address (eg. a block being freed and another block is taking the freed block's place).
Why not just use an existing hash function?

Resources