Array size too big - ruby

Array size too big - ruby - ruby

I am getting a 'ArgumentError: array size too big' message with the following code:
MAX_NUMBER = 600_000_000
my_array = Array.new(MAX_NUMBER)
Question. What is the max value that the Array.new function takes in Ruby?

An array with 500 million elements is 2 GiBytes in size, which – depending on the specific OS you are using – is typically the maximum that a process can address. In other words: your array is bigger than your address space.
So, the solutions are obvious: either make the array smaller (by, say, breaking it up in chunks) or make the address space bigger (in Linux, you can patch the kernel to get 3, 3.5 and even 4 GiByte of address space, and of course switching to a 64 bit OS and a 64 bit Ruby implementation(!) would also work).
Alternatively, you need to rethink your approach. Maybe use mmap instead of an array, or something like that. Maybe lazy-load only the parts you need.

Related

A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?

Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Why can't ruby use most of the 2^X numbers as object ids?

ObjectSpace._id2ref gives us the object from the Ruby's Object Space, it has an object against id in sequence starting from 0, however, if we try to see object on id 4 it gives an error as
2.6.3 :121 > ObjectSpace._id2ref(4)
Traceback (most recent call last):
2: from (irb):121
1: from (irb):121:in `_id2ref'
RangeError (0x0000000000000004 is not id value)
Also, I figured that it's the same behaviour for 2^x values(except 1, 2, 8).
(0..10).each do |exp|
object_id = 2**exp
begin
puts "Number: #{object_id} : #{ObjectSpace._id2ref(object_id)}"
rescue Exception => e
puts "Number: #{object_id} : #{e.message}"
end
end
Number: 1 : 0
Number: 2 : 2.0
Number: 4 : 0x0000000000000004 is not id value
Number: 8 : nil
Number: 16 : 0x0000000000000010 is not id value
Number: 32 : 0x0000000000000020 is not id value
Number: 64 : 0x0000000000000040 is not id value
Number: 128 : 0x0000000000000080 is not symbol id value
Number: 256 : 0x0000000000000100 is not id value
Number: 512 : 0x0000000000000200 is not id value
Number: 1024 : 0x0000000000000400 is not id value
Why can't ruby use these specific numbers as object ids?
Also, what's different for (1,2,8)? and why error is different for 128?

First, it is very important to make a couple of things crystal clear:
There are exactly two guarantees Ruby makes about object IDs. These two guarantees are the only thing you are allowed to rely on. You must not make any assumptions about object IDs other than these two guarantees:
An object has the same ID for its entire lifetime.
No two objects have the same ID at the same time.
[Note: this means in particular that different objects can have the same ID at different times, i.e. that IDs can be recycled.]
An object ID is an opaque identifier. You must not make any assumptions about its structure or about any particular value.
Any particular implementation of object IDs is a private internal implementation detail of a specific version of a specific implementation running in a specific environment at a specific moment. There is no guarantee that the results will be the same with a different implementation. There is no guarantee that the results will be the same with a different version of the same implementation. There is no guarantee that the results will be the same with the same version of the same implementation running in a different environment. In fact, there is not even a guarantee that the results will be the same between two runs of the same code on the same version of the same implementation in the same environment.
ObjectSpace::_id2ref is an abomination. It should not even exist. It most certainly should not be used. It breaks object-orientation, it breaks encapsulation, it breaks safety.
Just as an example: unfortunately, you don't say which version of which implementation you are running in which environment. However, it looks like you are running YARV 2.6.3 in a 64-bit environment.
If you were to run that exact same code on the exact same version of YARV in a 32-bit environment, you would get different results. If you were to run that exact same code on an older version of YARV (pre-2.0) in the exact same environment, you would get different results.
Let's address the first, implicit, assumption which I think I see in your question. You seem to think that any ID should resolve to an object. It's easy to see that this cannot be true: there are infinitely many IDs, but for every run of a program, there are only finitely many objects, so there will always be infinitely many IDs which don't resolve to an object.
This already explains most of your results, namely the ones for 4, 16, 32, 64, 256, 512, and 1024.
So, with that out of the way, here's a high-level explanation of why there seems to some sort of structure to the IDs, and what that structure is. (But let me remind you again, that this explanation only applies to 64 bit systems, not to 32 bit, it only applies to YARV, it only applies to versions of YARV 2.0 or newer, and it is quite possible that it will no longer apply to YARV 3.0.)
In YARV, the developers made the decision that the object ID is the same thing as the memory address of the object header. This makes it easy to ensure the "rules" of object IDs: you can't have multiple objects at the same memory address at the same time, and an object will not change its memory address.
(Actually, it turns out that the second one is already a quite severe restriction: many modern high-performance garbage collectors depend on being able to move objects around in memory. This is not possible if you assume that object ID == memory address. Which means you will not be able to use any of those high-performance algorithms.)
On pretty much all modern machines, memory access is word-aligned. While it is possible to address individual bytes, that is generally slower or more awkward. So, we can basically assume that if we allocate memory, it will be aligned on a word-boundary. Which means that all memory addresses will be divisible by 8 on 64-bit systems and 4 on 32-bit systems, or in other words, that all memory addresses will end in 3 (64-bit) or 2 (32-bit) zero bits. Or, in other words: 87.5% (75%) of the address space are unused.
On the other hand, it would be quite a waste to represent Integers as a full-blown Ruby object:
They are immutable, which means we don't have to store any state.
They can't have instance variables, which means we don't have to store an instance variable table.
They can't have a singleton class, which means we don't have to store a __klass__ pointer.
They can't be extended.
And so on …
What this means, is that we can optimize the representation of Integers by not storing them as objects at all. All we need is some special case in the engine, so that if someone asks for the class of, say, 42, instead of trying to look at 42's __klass__ pointer, the engine "magically" knows to just return the Integer class.
Once we have that in place, we can do a really cool trick, which is actually as old as the very first LISP and Smalltalk VMs, and it is called a tagged pointer representation. Normally, the value of a variable is a pointer to the object (header), but with a tagged pointer representation we can store the value of the object inside the pointer to the object itself!
All we need to do is to have some sort of tag on the pointer that tells the engine that this is actually not a pointer but a value disguised as a pointer. In some older machines, especially those specifically designed for running high-level languages, pointers did have a tag field specifically for holding, e.g. type information or access control. Modern machines don't have that, but we have those unused bits we can (ab)use as tag bits.
And that is what YARV is doing: When the last bit of a pointer is 1, then it's not actually a pointer, it's an Integer. In particular, an Integer is encoded in YARV by shifting it one bit to the left and setting the last bit to 1. This allows us to encode a 63-bit Integer in a 64-bit pointer, and do native integer arithmetic at it with no object overhead and only a little bit of bit shifting overhead.
And if you think about what this encoding means:
shifting one bit to the left is equivalent to multiplying by two
setting the last bit to 1 is equivalent to incrementing by 1
Then you can explain the first pattern: a small Integer with value n is encoded as the "quasi-pointer" 2n + 1, and since "memory address" and object ID are the same in YARV (even though this is not actually a memory address, because there is no object which could have an address), it will have the object ID 2n + 1.
Integers that don't fit into 63 bit (31 bit), are allocated as objects like any other object. In different engines, these have different names, e.g. in the Smalltalk-80 VM, they are called SmallInts, in YARV, they are called Fixnums (and the ones that don't fit into a Fixnum are called Bignums). They actually used to be different subclasses of a fully-abstract Integer class in older versions of YARV, but this was considered a mistake. (It's really an internal optimization and should not be visible to the programmer.) In current versions of YARV, Fixnum and Bignum are aliases for Integer and using them gives a deprecation warning.
This explains your result for 1. If you had tried out ObjectSpace._id2ref(3), the result would have 1, then ObjectSpace._id2ref(5) would be 2, and so on.
And we still are using only 62.5% of the address space (on a 64-bit system)!
So, let's think about what else we might want to represent in this way.
YARV has a very similar optimization for Floats. Floating point numbers that fit into 62-bits are called flonums and are represented similar, with a tag of 10 at the end. (YARV does not use flonums on 32-bit platforms.)
This explains your result for ObjectSpace._id2ref(2). If you had tried ObjectSpace._id2ref(6), the result would have been -2.0.
And a similar trick is also played for Symbols. I won't explain it here in detail, because a) I don't actually fully know how it works, and b) it is slightly more complex, because the value being encoded isn't directly the Symbol value, rather it is an index into the Symbol table. However, that explains your result for 128.
Now, lastly, there is a completely different part of the address space that is also unused: the low addresses. On most modern Operating Systems, the low addresses are reserved for mapping the kernel memory directly into the user process in order to speed up the user space ↔︎ kernel space transition. Plus, there is another reason the very low addresses are kept free: in C, it is illegal to dereference a NULL pointer. Now, one way of implementing this, would be for the runtime to track all pointer dereferences and check whether they are dereferencing the NULL pointer. But there is an easier way: just give the NULL pointer an actual memory address, but one that is never allocated. That way, you don't have to do anything: if the code tries to dereference the pointer, the address doesn't exist, and the MMU will take care of raising an error. So, most C compilers compile the NULL pointer to the actual memory address 0, and in order to make sure that there is never any real data allocated at that address, they keep a whole area around address 0 free.
This means that the low addresses are never used, and we can (ab)use them to represent even more "interesting" objects. Now, YARV uses the very low addresses to represent the following objects:
false at address 0, which has the additional advantage that 0 is considered false in C.
nil at address 8 (4 in 32-bit).
true at address 20 (2 in 32-bit).
Qundef (a special internal value inside the engine that denotes an undefined value) at address 52 (6 in 32-bit).
And that explains your number 8.
This also means that your 4, 16, 32, 64, 256, 512, and 1024 will probably never resolve to an object, because they are in the low address range where the C library will simply never allocate memory.
As a closing remark, I want to repeat one last time that all of this is a private internal implementation detail of a specific version of YARV running in a specific environment. You must not rely on any of this, ever.
When flonums were introduced in YARV, and on some platforms nil no longer had object ID 4, this did break some code, and it did cause some confusion (as evidenced e.g. by questions on Stack Overflow), even though the YARV developers are allowed to change object IDs at will, because there are no guarantees being made about any particular ID values or the structure of IDs. Please, do not make the same mistake.

Why are my nested for loops taking so long to compute?

I have a code that generates all of the possible combinations of 4 integers between 0 and 36.
This will be 37^4 numbers = 1874161.
My code is written in MATLAB:
i=0;
for a = 0:36
for b= 0:36
for c = 0:36
for d = 0:36
i=i+1;
combination(i,:) = [a,b,c,d];
end
end
end
end
I've tested this with using the number 3 instead of the number 36 and it worked fine.
If there are 1874161 combinations, and with An overly cautions guess of 100 clock cycles to do the additions and write the values, then if I have a 2.3GHz PC, this is:
1874161 * (1/2300000000) * 100 = 0.08148526086
A fraction of a second. But It has been running for about half an hour so far.
I did receive a warning that combination changes size every loop iteration, consider predefining its size for speed, but this can't effect it that much can it?

As #horchler suggested you need to preallocate the target array
This is because your program is not O(N^4) without preallocation. Each time you add new line to array it need to be resized, so new bigger array is created (as matlab do not know how big array it will be it probably increase only by 1 item) and then old array is copied into it and lastly old array is deleted. So when you have 10 items in array and adding 11th, then a copying of 10 items is added to iteration ... if I am not mistaken that leads to something like O(N^12) which is massively more huge
estimated as (N^4)*(1+2+3+...+N^4)=((N^4)^3)/2
Also the reallocation process is increasing in size breaching CACHE barriers slowing down even more with increasing i above each CACHE size barrier.
The only solution to this without preallocation is to store the result in linked list
Not sure Matlab has this option but that will need one/two pointer per item (32/64 bit value) which renders your array 2+ times bigger.
If you need even more speed then there are ways (probably not for Matlab):
use multi-threading for array filling is fully parallelisable
use memory block copy (rep movsd) or DMA the data is periodically repeating
You can also consider to compute the value from i on the run instead of remember the whole array, depending on the usage it can be faster in some cases...

How is an array stored in memory?

In an interest to delve deeper into how memory is allocated and stored, I have written an application that can scan memory address space, find a value, and write out a new value.
I developed a sample application with the end goal to be able to programatically locate my array, and overwrite it with a new sequence of numbers. In this situation, I created a single dimensional array, with 5 elements, e.g.
int[] array = new int[] {8,7,6,5,4};
I ran my application and searched for a sequence of the five numbers above. I was looking for any value that fell between 4 and 8, for a total of 5 numbers in a row. Unfortunately, my sequential numbers within the array matched hundreds of results, as the numbers 4 through 8, in no particular sequence happened to be next to each other, in memory, in many situations.
Is there any way to distinguish that a set of numbers within memory, represents an array, not simply integers that are next to each other? Is there any way of knowing that if I find a certain value, that the matching values proceeding it are that of an array?
I would assume that when I declare int[] array, its pointing at the first address of my array, which would provide some kind of meta-data to what existed in the array, e.g.
0x123456789 meta-data, 5 - 32 bit integers
0x123456789 + 32 "8"
0x123456789 + 64 "7"
0x123456789 + 96 "6"
0x123456789 + 128 "5"
0x123456789 + 160 "4"
Am I way off base?

Debug + Windows + Memory + Memory 1, set the Address field to "array". You'll see this when you switch the view to "4-byte Integer":
0x018416BC 6feb2c84 00000005 00000008 00000007 00000006 00000005 00000004
The first address is the address of the object in the garbage collected heap, plus the part of the object header that's at a negative offset (syncblk index). You cannot guess this value, the GC moves it around. The 2nd hex number is the 'type handle' for the array type (aka method table pointer). You cannot guess this value, type handles are created by the CLR on demand. The 3rd number is the array length. The rest of them are the array element values.
The odds of reliably finding this array back at runtime without a debugger are quite low. There isn't much point in trying.

Don't. Array is stored on the heap and subject to re-location due to garbage collection. You have to use fixed if you need to make sure memory is not moved in which can you can use but only very carefully.
If you are after high-performance arrays, use stackalloc and use your code scheme.

I don't know exactly but this article seems to suggest that you can get a pointer to your array, with which i would think you can determine the actual address.

Although I see you are using C# and, presumably, .NET, most of your question is in very general terms about memory. Keep mind that, in the most general sense, all memory is just bits whether that memory holds an array, strings, or code.
With that in mind, unless you can find tell-tale signs of your current platform's way of allocating different data types, there is no difference between memory that contains arrays, strings, or code.
Also, I wouldn't make any assumptions about if an array "points" to the first item in the array. Perhaps someone else can address this issue specifically, but I would assume some sort of header is involved.

Memory is not always stored contiguously. If you can ensure that it is, what you are asking is possible.

In Ruby, can I make a reference to an array offset?

In Ruby, can I do something C-like, like this (with my made-up operator '&'):
a = [1,2,3,4] and b = &a[2], b => [3,4], and if I set b[0] = 99, a => [1,2,-9,4]?
If the elements of an array are integers, does Ruby necessary store them consecutively in a
contiguous part of memory? I'm guessing "no", that only addresses are stored, integers being
objects, like everything else in Ruby.
If the answer to #2 is "yes" (which I doubt), is there a way to efficiently shift blocks of
memory, as one can do in C, for example.

There is no such functionality built into Ruby (Ruby arrays are not built of cons cells, and taking the address is much lower level than Ruby operates), though honestly it would not be hard to write something like that.
To answer the second question: It wouldn't necessarily be a contiguous array of integers. MRI treats integers as immediate values (with the least significant bit as a flag indicating whether a word represents an integer or an object address), so it would probably store it that way. Other implementations do it their own way.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio