In Ruby, can I make a reference to an array offset? - ruby

In Ruby, can I do something C-like, like this (with my made-up operator '&'):
a = [1,2,3,4] and b = &a[2], b => [3,4], and if I set b[0] = 99, a => [1,2,-9,4]?
If the elements of an array are integers, does Ruby necessary store them consecutively in a
contiguous part of memory? I'm guessing "no", that only addresses are stored, integers being
objects, like everything else in Ruby.
If the answer to #2 is "yes" (which I doubt), is there a way to efficiently shift blocks of
memory, as one can do in C, for example.

There is no such functionality built into Ruby (Ruby arrays are not built of cons cells, and taking the address is much lower level than Ruby operates), though honestly it would not be hard to write something like that.
To answer the second question: It wouldn't necessarily be a contiguous array of integers. MRI treats integers as immediate values (with the least significant bit as a flag indicating whether a word represents an integer or an object address), so it would probably store it that way. Other implementations do it their own way.

Related

What is Ruby's equivalent of Python's hash()?

Suppose I have an Array: ['a', 'b', 'c']. I want to record whether I have seen a particular array before.
I can put the array in a Set, but that is wasteful if I don't need to store the contents of the array, only that I have seen it before.
In Python, I could hash a tuple (i.e. hash(('a', 'b', 'c'))) and store the result in a set to achieve this. What is the way to do this in Ruby?
Ruby has #hash on most objects, including Array, but these values are not unique and will eventually collide.
For any serious use I'd strongly suggest using something like SHA2-256 or stronger as these are cryptographic hashes designed to minimize collisions.
For example:
require 'digest/sha2'
array = %w[ a b c ]
array.hash
# => 3218529217224510043
Digest::SHA2.hexdigest(array.inspect)
# => "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"
Where that value is going to be relatively unique. SHA2-256 collisions are really infrequent due to the sheer size of that hash, 256 bits vs. the 64 bit #hash value. That's not 4x stronger, it's 6.2 octodecillion times stronger. That number may as well be a "zillion" given how it has 57 zeroes in it.

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

How does Ruby differentiate VALUE with value and pointer?

For values such as true, nil, or small integers, Ruby does optimization. Instead of using VALUE pointer as a pointer, it directly uses VALUE to store data.
I wonder how Ruby makes a difference between these uses:
def foo(x)
...
with x that will be associated to VALUE. In low level terms, they are just a number. How can I tell whether or not a certain number is a pointer to an object? All that comes to my mind is to limit pointers to have the MSB set to 0, and direct values with MSB equal to 1. But this is just my guess. How is it done in Ruby?
There are many different implementations of Ruby. The Ruby Language Specification doesn't prescribe any particular internal representation for objects – why should it? It's an internal representation, after all!
For example, JRuby doesn't represent objects as C pointers at all, it represents them as Java objects. IronRuby represents them as .NET objects. Opal represents them as ECMAScript objects. MagLev represents them as Smalltalk objects.
However, there are indeed some implementations that use the strategy you describe. The now abandoned MRI did it that way, YARV and Rubinius also do it.
This is actually a very old trick, dating back to at least the 1960s. It's called a tagged pointer representation, and like the name suggests, you need to tag the pointer with some additional metadata in order to know whether or not it is actually a pointer to an object or an encoding of some other datatype.
Some CPUs have special tag bits specifically for that purpose. (For example, on the AS/400, the CPU doesn't even have pointers, it has 128bit object references, even though the original CPU was only 48bit wide, and the newer POWER-based CPUs 64 bit; the extra bits are used to encode all sorts of metadata like type, owner, access restrictions, etc.) Some CPUs have tag bits for other purposes that can be "abused" for this purpose. However, most modern mainstream CPUs don't have tag bits.
But, you can use a trick! On many modern CPUs, unaligned memory accesses (accessing an address that does not start at a word boundary) are really slow (on some, they aren't even possible at all), which means that on a 32bit CPU, all pointers that are realistically being used, end with two 00 bits and on 64 bit CPUs with three 000 bits. You can use these bits as tag bits: pointers that end with 00 are indeed pointers, pointers that end with 01, 10, or 11 are an encoding of some other data type.
In MRI, the pointers ending in 1 were used to encode 31/63 bit Fixnums. In YARV, they are used to encode 31/63 bit Fixnums, i.e. integers that are encoded as actual machine integers according to the formula 2n+1 (arithmetically speaking) or (n << 1) | 1 (as a bit pattern). On 64 bit platforms, YARV also uses pointers that end in 10 to encode 62 bit flonums using a similar scheme. (If you ever wondered why the object_id of a Fixnum in YARV is 2n+1, now you know: YARV uses the memory address for the object ID, and 2n+1 is the "memory address" of n.)
Now, what about nil, false and true? Well, there is no space for them in our current scheme. However, the very low memory addresses are usually reserved for the operating system kernel, which means that a pointer like 0 or 2 or 4 cannot realistically occur in a program. YARV uses that space to encode nil, false and true: false is encoded as 0 (which is convenient because that's also the encoding of false in C), nil is encoded as 0b1000 and true is encoded as 0b10100 (it used to be 0, 0b10 and 0b100 in older versions before the introduction of flonums).
Theoretically, there is a lot of space there to encode other objects as well, but YARV doesn't do that. Some Smalltalk or Lisp VMs, for example, encode ASCII or BMP Unicode character objects there, or some often used objects such as the empty list, empty array, or empty string.
There is still some piece missing, though: without an object header, with just the bare bit pattern, how can the VM access the class, the methods, the instance variables, etc.? Well, it can't. Those have to be special-cased and hardcoded into the VM. The VM simply has to know that a pointer ending in 1 is an encoded Fixnum and has to know that the class is Fixnum and the methods can be found there. And as for instance variables? Well, you could store them separately from the objects in a dictionary on the side. Or you go the Ruby route and simply disallow them altogether.
This answer is merely a distillation of #Jörg always-excellent treatise.
In MRI, true, false, nil and Fixnums are mapped to fixed object_id's; all other objects are assigned dynamically-generated values. The object_id for false is 0. For true and nil they are 20 and 8 (2 and 4 prior to v2.0), respectively. The integer i has object_id i*2+1. Dynamically-generated object_id's cannot be any of these values. Therefore, (in MRI) one can merely check to see if the object_id is one of these values to determine if the associated object has a fixed object_id.
Incidentally, objects can be obtained from their object_id's with the method ObjectSpace#_id2ref.
For more on this, see #sepp2k's answer here.

How is an array stored in memory?

In an interest to delve deeper into how memory is allocated and stored, I have written an application that can scan memory address space, find a value, and write out a new value.
I developed a sample application with the end goal to be able to programatically locate my array, and overwrite it with a new sequence of numbers. In this situation, I created a single dimensional array, with 5 elements, e.g.
int[] array = new int[] {8,7,6,5,4};
I ran my application and searched for a sequence of the five numbers above. I was looking for any value that fell between 4 and 8, for a total of 5 numbers in a row. Unfortunately, my sequential numbers within the array matched hundreds of results, as the numbers 4 through 8, in no particular sequence happened to be next to each other, in memory, in many situations.
Is there any way to distinguish that a set of numbers within memory, represents an array, not simply integers that are next to each other? Is there any way of knowing that if I find a certain value, that the matching values proceeding it are that of an array?
I would assume that when I declare int[] array, its pointing at the first address of my array, which would provide some kind of meta-data to what existed in the array, e.g.
0x123456789 meta-data, 5 - 32 bit integers
0x123456789 + 32 "8"
0x123456789 + 64 "7"
0x123456789 + 96 "6"
0x123456789 + 128 "5"
0x123456789 + 160 "4"
Am I way off base?
Debug + Windows + Memory + Memory 1, set the Address field to "array". You'll see this when you switch the view to "4-byte Integer":
0x018416BC 6feb2c84 00000005 00000008 00000007 00000006 00000005 00000004
The first address is the address of the object in the garbage collected heap, plus the part of the object header that's at a negative offset (syncblk index). You cannot guess this value, the GC moves it around. The 2nd hex number is the 'type handle' for the array type (aka method table pointer). You cannot guess this value, type handles are created by the CLR on demand. The 3rd number is the array length. The rest of them are the array element values.
The odds of reliably finding this array back at runtime without a debugger are quite low. There isn't much point in trying.
Don't. Array is stored on the heap and subject to re-location due to garbage collection. You have to use fixed if you need to make sure memory is not moved in which can you can use but only very carefully.
If you are after high-performance arrays, use stackalloc and use your code scheme.
I don't know exactly but this article seems to suggest that you can get a pointer to your array, with which i would think you can determine the actual address.
Although I see you are using C# and, presumably, .NET, most of your question is in very general terms about memory. Keep mind that, in the most general sense, all memory is just bits whether that memory holds an array, strings, or code.
With that in mind, unless you can find tell-tale signs of your current platform's way of allocating different data types, there is no difference between memory that contains arrays, strings, or code.
Also, I wouldn't make any assumptions about if an array "points" to the first item in the array. Perhaps someone else can address this issue specifically, but I would assume some sort of header is involved.
Memory is not always stored contiguously. If you can ensure that it is, what you are asking is possible.

Array size too big - ruby

I am getting a 'ArgumentError: array size too big' message with the following code:
MAX_NUMBER = 600_000_000
my_array = Array.new(MAX_NUMBER)
Question. What is the max value that the Array.new function takes in Ruby?
An array with 500 million elements is 2 GiBytes in size, which – depending on the specific OS you are using – is typically the maximum that a process can address. In other words: your array is bigger than your address space.
So, the solutions are obvious: either make the array smaller (by, say, breaking it up in chunks) or make the address space bigger (in Linux, you can patch the kernel to get 3, 3.5 and even 4 GiByte of address space, and of course switching to a 64 bit OS and a 64 bit Ruby implementation(!) would also work).
Alternatively, you need to rethink your approach. Maybe use mmap instead of an array, or something like that. Maybe lazy-load only the parts you need.

Resources