When to refactor string usage to a Symbol in Ruby - ruby

According to tryruby.org, the usage of a symbol uses one memory allocation, and then points to its single allocation, whereas storing multiple strings, even if they are the same, stores multiple instances in the memory. So, much like how MP3 and other compression or optimization methods work, what considerations go into switching from multiple strings to refactoring to usage of symbols to take advantage of the repetition? As soon as you have two duplicates? Only when you notice performance drops? A logarithmic calculation? Other considerations or viewpoints?
I am programmer interested in learning strong positive convention practice, that is why I am asking.

A symbol is basically an immutable, interned string. That means, it can't be changed in place (e.g. by using gsub!) and it is guaranteed that two usages of the same symbol always return the same object:
"foo".object_id == "foo".object_id
# => false
:foo.object_id == :foo.object_id
# => true
Because of that guarantee, symbols are never garbage collected. Once you have "created" a symbol, it will be kept forever in the current process.
Generally, you should use symbols when you have a static string, or at a least limited number of them, such as keys in hashes or to reference methods. Using symbols here ensure that you are always getting the same object back.
With ordinary strings, depending on how you compare them, it is possible that you get different objects back. With ordinary strings it is possible to have two or more that look like they are the same, but are actually not the same (see the example above).

When required by an input or output of the program, then use strings.
When it is used only internal to the program, and belongs to a relatively small closed set, such as flags, then use symbols.

Related

Is there a way to get the Ruby runtime to combine frozen identical objects into a single instance?

I have data in memory, especially strings, that have large numbers of duplicates. We're hitting the ceiling with memory sometimes and are trying to reduce our footprint. I thought that if I froze the strings, then the Ruby runtime would combine them into single objects in memory. So I thought that this code would return a lower number, ideally, 1, but it did not:
a = Array.new(1000) { 'foo'.dup.freeze } # create separate objects, but freeze them
sleep 5 # give the runtime some time to combine the objects
a.map(&:object_id).uniq.size # => 1000
I guess this makes sense, because if there was a reference to the duplicated object (e.g. object id #202), and all of the frozen strings are combined to use #200, then dereferencing #202 will fail. So maybe my question doesn't make sense.
I guess the best strategy for me to save memory might be to convert the strings to symbols. I am aware that they will never be garbage collected, there would be a small enough number of them that this would not be a problem. Is there a better way?
You basically have the right idea, but in my opinion you found a big gotcha in Ruby. You are correct that Ruby can dedup frozen strings to save memory but in general frozen ≠ deduped!!!
tl;dr the reason is because the two operations have different semantics. Always use String#-# if you want it deduped.
Recall that freeze is a method of Object, so it has to work with every class. In English, freeze is "make it so no further changes can be made to this object and also return the same object so that I can keep calling methods on it". In particular, it would be odd if x.freeze != x. Imagine if I had two arrays that I was modifying, then decided to freeze them. Would it make sense for the interpreter to then iterate through both arrays to see if their contents are equal and to decide to completely throw away one of them? That could be very expensive. So in general freeze does not promise this behavior and always returns the same object, just frozen.
Deduping works very differently because when you call -myStr you're actually saying "return the unique frozen version of this string in memory". In most cases the whole point is to get a different object than the one in myStr (so that the GC can clean up that string and only keep the frozen one).
Unfortunately, the distinction is muddled since if you call freeze on a string literal, Ruby will dedup it automatically! This is sensible because there's no way to get a reference to the original literal object; the fact that the interpreter is allowing x.freeze != x doesn't matter, so we might as well save some memory. But it might also give the impression that freeze does guarantee deduping, when in fact it does not.
This gotcha was discussed when string deduping was first introduced, so it is definitely an intentional design decision by the Ruby developers.

Does using enums instead of booleans really affect cache usage?

I saw a comment thread where it was suggested that enums should be used instead of booleans in general since it's clearer what the parameters do at the call site, and it's easier to refactor if you need to add a case.
Then someone else claimed that that was a terrible idea since it would often create an unnecessary variable for each call, which would be an unnecessary use of compute resources and cache space.
Does this second claim hold up? My understanding is that booleans are usually not stored as single bits, but as sets of bits with more than enough room for the amount of extra options in a typical enum. So the same amount of data would need to be moved around.
If I understand correctly, there would be an extra variable required if the enum has 3 or more options, (one to store the enum and one for a derived boolean on each check,) but in that case, you actually needed the three options so what can you do? In the case that you have exactly two enum options, then couldn't a compiler just transform it into a boolean in the same register, (assuming the enum value was specified as not having one 0 value and one non-zero value,) therefore not using any extra space?
One extra compare instruction I suppose, but cache usage seems to be the much bigger deal performance-wise these days. And if you have an enum that's isomorphic to a boolean, you'll often make the values automatic anyway, so the compiler should be free to fully optimize it.
Do I understand this correctly, or am I missing something?

Why is useful to have a atom type (like in elixir, erlang)?

According to http://elixir-lang.org/getting-started/basic-types.html#atoms:
Atoms are constants where their name is their own value. Some other
languages call these symbols
I wonder what is the point of have a atom type. Probably to help build a parser or for macros? But in everyday use how it help the programmer?
BTW: Never use elixir or erlang, just note it exist (also in kdb)
They're basically strings that can easily be tested for equality.
Consider a string. Conceptually, we generally want to think of strings as being equal if they have the same contents. For example, "dog" == "dog" but "dog" != "cat". However, to check the equality of strings, we have to check to see if each letter in one string is equal to the letter in the same position in another string, which means that we have to walk through each element of the string and check each character for equality. This becomes a bit more cumbersome if dealing with Unicode strings and having to consider different ways of composing identical characters (for example, the character é has two representations in UTF-8).
It would be much simpler if we stored identical strings at the same location in memory. Then, checking equality would be a simple pointer or index comparison.
As a consequence of storing identical strings in the same location in memory, we can also store one copy of each unique kind of string regardless of how many times it is used in the program, thus saving some memory for commonly-used strings as well.
At a higher level, using atoms also lets us think of strings the same way we think of other primitive data types like integers.
I think that one of the most common usage in erlang is to tag variables and messages, with the benefit of fast comparison (pattern match) as mipadi says.
For example you write a function that may fail depending on parameters provided, the status of connection to a server, or any reason. A very frequent usage is to return a tuple {ok,Value} in case of success, {error,Reason} in case of error. The calling function will have the choice to manage only the success case coding {ok,Value} = yourModule:yourFunction(Param...). Doing this it is clear that you consider only the success case, you extract directly the Value from the function return, it is fast, and you don't have to share any header with yourModule to decode the ok atom.
In messages you will often see things like {add,Key,Value}, {delete,Key},{delete_all}, {replace,Key,Value}, {append,Key,Value}... These are explicit messages, with the same advantages as mentioned before: fast,sensible,no share of header...
Atoms are constants with itself as value.
This is a concept very usefull in distributed systems, where constants can be defined differently on each system, while atoms are self-containing with no need for definement.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby)

Every text I've read about Ruby symbols talks about the efficiency of symbols over strings. But, this isn't the 1970s. My computer can handle a little bit of extra garbage collection. Am I wrong? I have the latest and greatest Pentium dual core processor and 4 gigs of RAM. I think that should be enough to handle some Strings.
Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?
There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.
It's true, you don't need tokens so very badly for memory reasons. Your computer could undoubtedly handle all kinds of gnarly string handling.
But, in addition to being faster, tokens have the added advantage (especially with context coloring) of screaming out visually: LOOK AT ME, I AM A KEY OF A KEY-VALUE PAIR. That's a good enough reason to use them for me.
There's other reasons too... and the performance gain on lots of them might be more than you realize, especially doing something like comparison.
When comparing two ruby symbols, the interpreter is just comparing two object addresses. When comparing two strings, the interpreter has to compare every character one at a time. That kind of computation can add up if you're doing a lot of this.
Symbols have their own performance problems though... they are never garbage collected.
It's worth reading this article:
http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol
It's nice that symbols are guaranteed unique--that can have some nice effects that you wouldn't get from String (such as their addresses are always exactly equal I believe).
Plus they have a different meaning and you would want to use them in different areas, but ruby isn't too strict about that kind of stuff anyway, so I can understand your question.
Here's the real reason for the difference: strings are never the same. Every instance of a string is a separate object, even if the content is identical. And most operations on strings will make new string objects. Consider the following:
a = 'zowie'
b = 'zowie'
a == b #=> true
On the surface, it'd be easy to claim that a and b are the same. Most common sense operations will work as you'd expect. But:
a.object_id #=> 2152589920 (when I ran this in irb)
b.object_id #=> 2152572980
a.equal?(b) #=> false
They look the same, but they're different objects. Ruby had to allocate memory twice, perform the String#initialize method twice, etc. They're taking up two separate spots in memory. And hey! It gets even more fun when you try to modify them:
a += '' #=> 'zowie'
a.object_id #=> 2151845240
Here we add nothing to a and leave the content exactly the same -- but Ruby doesn't know that. It still allocates a whole new String object, reassigns the variable a to it, and the old String object sits around waiting for eventual garbage collection. Oh, and the empty '' string also gets a temporary String object allocated just for the duration of that line of code. Try it and see:
''.object_id #=> 2152710260
''.object_id #=> 2152694840
''.object_id #=> 2152681980
Are these object allocations fast on your slick multi-Gigahertz processor? Sure they are. Will they chew up much of your 4 GB of RAM? No they won't. But do it a few million times over, and it starts to add up. Most applications use temporary strings all over the place, and your code's probably full of string literals inside your methods and loops. Each of those string literals and such will allocate a new String object, every single time that line of code gets run. The real problem isn't even the memory waste; it's the time wasted when garbage collection gets triggered too frequently and your application starts hanging.
In contrast, take a look at symbols:
a = :zowie
b = :zowie
a.object_id #=> 456488
b.object_id #=> 456488
a == b #=> true
a.equal?(b) #=> true
Once the symbol :zowie gets made, it'll never make another one. Every time you refer to a given symbol, you're referring to the same object. There's no time or memory wasted on new allocations. This can also be a downside if you go too crazy with them -- they're never garbage collected, so if you start creating countless symbols dynamically from user input, you're risking an endless memory leak. But for simple literals in your code, like constant values or hash keys, they're just about perfect.
Does that help? It's not about what your application does once. It's about what it does millions of times.
One less character to type. That's all the justification I need to use them over strings for hash keys, etc.

Resources