Is there a way to get the Ruby runtime to combine frozen identical objects into a single instance? - ruby

I have data in memory, especially strings, that have large numbers of duplicates. We're hitting the ceiling with memory sometimes and are trying to reduce our footprint. I thought that if I froze the strings, then the Ruby runtime would combine them into single objects in memory. So I thought that this code would return a lower number, ideally, 1, but it did not:
a = Array.new(1000) { 'foo'.dup.freeze } # create separate objects, but freeze them
sleep 5 # give the runtime some time to combine the objects
a.map(&:object_id).uniq.size # => 1000
I guess this makes sense, because if there was a reference to the duplicated object (e.g. object id #202), and all of the frozen strings are combined to use #200, then dereferencing #202 will fail. So maybe my question doesn't make sense.
I guess the best strategy for me to save memory might be to convert the strings to symbols. I am aware that they will never be garbage collected, there would be a small enough number of them that this would not be a problem. Is there a better way?

You basically have the right idea, but in my opinion you found a big gotcha in Ruby. You are correct that Ruby can dedup frozen strings to save memory but in general frozen ≠ deduped!!!
tl;dr the reason is because the two operations have different semantics. Always use String#-# if you want it deduped.
Recall that freeze is a method of Object, so it has to work with every class. In English, freeze is "make it so no further changes can be made to this object and also return the same object so that I can keep calling methods on it". In particular, it would be odd if x.freeze != x. Imagine if I had two arrays that I was modifying, then decided to freeze them. Would it make sense for the interpreter to then iterate through both arrays to see if their contents are equal and to decide to completely throw away one of them? That could be very expensive. So in general freeze does not promise this behavior and always returns the same object, just frozen.
Deduping works very differently because when you call -myStr you're actually saying "return the unique frozen version of this string in memory". In most cases the whole point is to get a different object than the one in myStr (so that the GC can clean up that string and only keep the frozen one).
Unfortunately, the distinction is muddled since if you call freeze on a string literal, Ruby will dedup it automatically! This is sensible because there's no way to get a reference to the original literal object; the fact that the interpreter is allowing x.freeze != x doesn't matter, so we might as well save some memory. But it might also give the impression that freeze does guarantee deduping, when in fact it does not.
This gotcha was discussed when string deduping was first introduced, so it is definitely an intentional design decision by the Ruby developers.

Related

Even when using the same randomseed in Lua, get different results?

I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.

When to refactor string usage to a Symbol in Ruby

According to tryruby.org, the usage of a symbol uses one memory allocation, and then points to its single allocation, whereas storing multiple strings, even if they are the same, stores multiple instances in the memory. So, much like how MP3 and other compression or optimization methods work, what considerations go into switching from multiple strings to refactoring to usage of symbols to take advantage of the repetition? As soon as you have two duplicates? Only when you notice performance drops? A logarithmic calculation? Other considerations or viewpoints?
I am programmer interested in learning strong positive convention practice, that is why I am asking.
A symbol is basically an immutable, interned string. That means, it can't be changed in place (e.g. by using gsub!) and it is guaranteed that two usages of the same symbol always return the same object:
"foo".object_id == "foo".object_id
# => false
:foo.object_id == :foo.object_id
# => true
Because of that guarantee, symbols are never garbage collected. Once you have "created" a symbol, it will be kept forever in the current process.
Generally, you should use symbols when you have a static string, or at a least limited number of them, such as keys in hashes or to reference methods. Using symbols here ensure that you are always getting the same object back.
With ordinary strings, depending on how you compare them, it is possible that you get different objects back. With ordinary strings it is possible to have two or more that look like they are the same, but are actually not the same (see the example above).
When required by an input or output of the program, then use strings.
When it is used only internal to the program, and belongs to a relatively small closed set, such as flags, then use symbols.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

If ruby encourages duck typing so much, why don't we have Hash.count instead of Hash.length?

This is something that really confuses me, it seems like time and time again I run into methods in ruby native data types that do the same thing (essentially), and yet have different names. If duck typing is so strongly encouraged by ruby and the ruby community, why aren't these methods named consistently across types?
You seem to imply that Hash does not have a length method and/or that other enumerables don't have a count method. That is not true.
count is a method defined in the Enumerable module and thus available on all enumerables. It differs from size and length in the following ways:
It (optionally) takes a block specifying which kind of elements to count.
It's available on all enumerables - not just those that keep track of their size - however it has a runtime in O(n) for those that don't (and always when given a block of course).
length and size (which are synonyms) are methods defined on all enumerable classes that keep track of their size (including Hash). They differ from count in that they always return the length in O(1) time and don't take a block.
In summary: You can call length or size on any object that keeps track of its size and you can call count on any enumerable. So duck typing is not hampered in any way.

Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby)

Every text I've read about Ruby symbols talks about the efficiency of symbols over strings. But, this isn't the 1970s. My computer can handle a little bit of extra garbage collection. Am I wrong? I have the latest and greatest Pentium dual core processor and 4 gigs of RAM. I think that should be enough to handle some Strings.
Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?
There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.
It's true, you don't need tokens so very badly for memory reasons. Your computer could undoubtedly handle all kinds of gnarly string handling.
But, in addition to being faster, tokens have the added advantage (especially with context coloring) of screaming out visually: LOOK AT ME, I AM A KEY OF A KEY-VALUE PAIR. That's a good enough reason to use them for me.
There's other reasons too... and the performance gain on lots of them might be more than you realize, especially doing something like comparison.
When comparing two ruby symbols, the interpreter is just comparing two object addresses. When comparing two strings, the interpreter has to compare every character one at a time. That kind of computation can add up if you're doing a lot of this.
Symbols have their own performance problems though... they are never garbage collected.
It's worth reading this article:
http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol
It's nice that symbols are guaranteed unique--that can have some nice effects that you wouldn't get from String (such as their addresses are always exactly equal I believe).
Plus they have a different meaning and you would want to use them in different areas, but ruby isn't too strict about that kind of stuff anyway, so I can understand your question.
Here's the real reason for the difference: strings are never the same. Every instance of a string is a separate object, even if the content is identical. And most operations on strings will make new string objects. Consider the following:
a = 'zowie'
b = 'zowie'
a == b #=> true
On the surface, it'd be easy to claim that a and b are the same. Most common sense operations will work as you'd expect. But:
a.object_id #=> 2152589920 (when I ran this in irb)
b.object_id #=> 2152572980
a.equal?(b) #=> false
They look the same, but they're different objects. Ruby had to allocate memory twice, perform the String#initialize method twice, etc. They're taking up two separate spots in memory. And hey! It gets even more fun when you try to modify them:
a += '' #=> 'zowie'
a.object_id #=> 2151845240
Here we add nothing to a and leave the content exactly the same -- but Ruby doesn't know that. It still allocates a whole new String object, reassigns the variable a to it, and the old String object sits around waiting for eventual garbage collection. Oh, and the empty '' string also gets a temporary String object allocated just for the duration of that line of code. Try it and see:
''.object_id #=> 2152710260
''.object_id #=> 2152694840
''.object_id #=> 2152681980
Are these object allocations fast on your slick multi-Gigahertz processor? Sure they are. Will they chew up much of your 4 GB of RAM? No they won't. But do it a few million times over, and it starts to add up. Most applications use temporary strings all over the place, and your code's probably full of string literals inside your methods and loops. Each of those string literals and such will allocate a new String object, every single time that line of code gets run. The real problem isn't even the memory waste; it's the time wasted when garbage collection gets triggered too frequently and your application starts hanging.
In contrast, take a look at symbols:
a = :zowie
b = :zowie
a.object_id #=> 456488
b.object_id #=> 456488
a == b #=> true
a.equal?(b) #=> true
Once the symbol :zowie gets made, it'll never make another one. Every time you refer to a given symbol, you're referring to the same object. There's no time or memory wasted on new allocations. This can also be a downside if you go too crazy with them -- they're never garbage collected, so if you start creating countless symbols dynamically from user input, you're risking an endless memory leak. But for simple literals in your code, like constant values or hash keys, they're just about perfect.
Does that help? It's not about what your application does once. It's about what it does millions of times.
One less character to type. That's all the justification I need to use them over strings for hash keys, etc.

Resources