Freezing string literals in early Ruby versions - ruby

This question applies in particular to Ruby 1.9 and 2.1, where String literals can not be frozen automatically. In particular I am refering to this article, which suggests to freeze strings, so that repeated evaluation of the code does not create a new String object every time, which among other advantages is said to make the program perform better. As a concrete example, this article proposes the expression
("%09d".freeze % id).scan(/\d{3}/).join("/".freeze)
I want to use this concept in our project, and for testing purpose, I tried the following code:
3.times { x="abc".freeze; puts x.object_id }
In Ruby 2.3, this prints the same object ID every time. In JRuby 1.7, which corresponds on the language level to Ruby 1.9, it prints three different object IDs, although I have explicitly frozen the string.
Could somebody explain the reason for this, and how to use freeze properly in this situation?

In particular I am refering to this article, which suggests to freeze strings, so that repeated evaluation of the code does not create a new String object every time
That is not what Object#freeze does. As the name implies, it "freezes" the object, i.e. it disallows any further modification to the object's internal state. There is nothing in the documentation that even remotely suggests that Object#freeze performs some sort of de-duplication or interning.
You may be thinking of String#-#, but this does not exist in Ruby 2.1. It was only added in Ruby 2.3, and actually had different semantics then:
Ruby 2.3–2.4: returns self if self is already frozen, otherwise returns self.dup.freeze, i.e. a frozen duplicate of the string:
-str → str (frozen)
If the string is frozen, then return the string itself.
If the string is not frozen, then duplicate the string freeze it and return it.
Ruby 2.5+: returns self if self is already frozen, otherwise returns a frozen version of the string that is de-duplicated (i.e. it may be looked up in a cache of existing frozen strings and the existing version returned):
-str → str (frozen)
Returns a frozen, possibly pre-existing copy of the string.
The string will be deduplicated as long as it is not tainted, or has any instance variables set on it.
So, the article you linked to is wrong on three counts:
De-duplication is only performed for strings, not arbitrary objects.
De-duplication is not performed by freeze.
De-duplication is only performed by String#-# starting in Ruby 2.5.
There is also a fourth claim that is wrong in that article, although we can't really blame the author for that since the article is from 2016 and the decision was only changed in 2019: Ruby 3.0 will not have immutable string literals by default.
The one thing that is correct in that article is that the # frozen_string_literal: true pragma (or the corresponding command line option --enable-frozen-string-literal) will not only freeze all static string literals, it will also de-duplicate them.

Related

What happens when a method is used on an object created from a built in class?

I understand that classes are like mold from which you can create objects, and a class defines a number of methods and variables (class,instances,local...) inside of it.
Let's say we have a class like this:
class Person
def initialize (name,age)
#name = name
#age = age
end
def greeting
"#{#name} says hi to you!"
end
end
me = Person.new "John", 34
puts me.greeting
As I can understand, when we call Person.new we are creating an object of class Person and initializing some internal attributes for that object, which will be stored in the instance variables #name and #age. The variable me will then be a reference to this newly created object.
When we call me.greeting, what happens is that greeting method is called on the object referenced by me, and that method will use the instance variable #name that is directly tied/attached to that object.
Hence, when calling a method on an object you are actually "talking" to that object, inspecting and using its attributes that are stored in its instance variables. All good for now.
Let's say now that we have the string "hello". We created it using a string literal, just like: string = "hello".
My question is, when creating an object from a built in class (String, Array, Integer...), are we actually storing some information on some instance variables for that object during its creation?
My doubt arises because I can't understand what happens when we call something like string.upcase, how does the #upcase method "work" on string? I guess that in order to return the string in uppercase, the string object previously declared has some instance variables attached to, and the instances methods work on those variables?
Hence, when calling a method on an object you are actually "talking" to that object, inspecting and using its attributes that are stored in its instance variables. All good for now.
No, that is very much not what you are doing in an Object-Oriented Program. (Or really any well-designed program.)
What you are describing is a break of encapsulation, abstraction, and information hiding. You should never inspect and/or use another object's instance variables or any of its other private implementation details.
In Object-Orientation, all computation is performed by sending messages between objects. The only thing you can do is sending messages to objects and the only thing you can observe about an object is the responses to those messages.
Only the object itself can inspect and use its attributes and instance variables. No other object can, not even objects of the same type.
If you send an object a message and you get a response, the only thing you know is what is in that response. You don't know how the object created that response: did the object compute the answer on the fly? Was the answer already stored in an instance variable and the object just responded with that? Did the object delegate the problem to a different object? Did it print out the request, fax it to a temp agency in the Philippines, and have a worker compute the answer by hand with pen and paper? You don't know. You can't know. You mustn't know. That is at the very heart of Object-Orientation.
This is, BTW, exactly how messaging works in real-life. If you send someone a message asking "what is π²" and they answer with "9.8696044011", then you have no idea whether they computed this by hand, used a calculator, used their smart phone, looked it up, asked a friend, or hired someone to answer the question for them.
You can imagine objects as being little computers themselves: they have internal storage, RAM, HDD, SSD, etc. (instance variables), they have code running on them, the OS, the basic system libraries, etc. (methods), but one computer cannot read another computer's RAM (access its instance variables) or run its code (execute its methods). It can only send it a request over the network and look at the response.
So, in some sense, your question is meaningless: from the point of view of Object-Oriented Abstraction, is should be impossible to answer your question, because it should be impossible to know how an object is implemented internally.
It could use instance variables, or it could not. It could be implemented in Ruby, or it could be implemented in another programming language. It could be implemented as a standard Ruby object, or it could be implemented as some secret internal private part of the Ruby implementation.
In fact, it could even not exist at all! (For example, in many Ruby implementations small integers do not actually exist as objects at all. The Ruby implementation will just make it look like they do.)
My question is, when creating an object from a built in class (String, Array, Integer...), are we actually storing some information on some instance variables for that object during its creation?
[…] [W]hat happens when we call something like string.upcase, how does the #upcase method "work" on string? I guess that in order to return the string in uppercase, the string object previously declared has some instance variables attached to, and the instances methods work on those variables?
There is nothing in the Ruby Language Specification that says how the String#upcase method is implemented. The Ruby Language Specification only says what the result is, but it doesn't say anything about how the result is computed.
Note that this is not specific to Ruby. This is how pretty much every programming language works. The Specification says what the results should be, but the details of how to compute those results is left to the implementor. By leaving the decision about the internal implementation details up to the implementor, this frees the implementor to choose the most efficient, most performant implementation that makes sense for their particular implementation.
For example, in the Java platform, there are existing methods available for converting a string to upper case. Therefore, in an implementation like TruffleRuby, JRuby, or XRuby, which sits on top of the Java platform, it makes sense to just call the existing Java methods for converting strings to upper case. Why waste time implementing an algorithm for converting strings to upper case when somebody else has already done that for you? Likewise, in an implementation like IronRuby or Ruby.NET, which sit on top of the .NET platform, you can just use .NET's builtin methods for converting strings to upper case. In an implementation like Opal, you can just use ECMAScript's methods for converting strings to upper case. And so on.
Unfortunately, unlike many other programming languages, the Ruby Language Specification does not exist as a single document in a single place. Ruby does not have a single formal specification that defines what certain language constructs mean.
There are several resources, the sum of which can be considered kind of a specification for the Ruby programming language.
Some of these resources are:
The ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification – Note that the ISO Ruby Specification was written around 2009–2010 with the specific goal that all existing Ruby implementations at the time would easily be compliant. Since YARV only implements Ruby 1.9+ and MRI only implements Ruby 1.8 and lower, this means that the ISO Ruby Specification only contains features that are common to both Ruby 1.8 and Ruby 1.9. Also, the ISO Ruby Specification was specifically intended to be minimal and only contain the features that are absolutely required for writing Ruby programs. Because of that, it does for example only specify Strings very broadly (since they have changed significantly between Ruby 1.8 and Ruby 1.9). It obviously also does not specify features which were added after the ISO Ruby Specification was written, such as Ractors or Pattern Matching.
The Ruby Spec Suite aka ruby/spec – Note that the ruby/spec is unfortunately far from complete. However, I quite like it because it is written in Ruby instead of "ISO-standardese", which is much easier to read for a Rubyist, and it doubles as an executable conformance test suite.
The Ruby Programming Language by David Flanagan and Yukihiro 'matz' Matsumoto – This book was written by David Flanagan together with Ruby's creator matz to serve as a Language Reference for Ruby.
Programming Ruby by Dave Thomas, Andy Hunt, and Chad Fowler – This book was the first English book about Ruby and served as the standard introduction and description of Ruby for a long time. This book also first documented the Ruby core library and standard library, and the authors donated that documentation back to the community.
The Ruby Issue Tracking System, specifically, the Feature sub-tracker – However, please note that unfortunately, the community is really, really bad at distinguishing between Tickets about the Ruby Programming Language and Tickets about the YARV Ruby Implementation: they both get intermingled in the tracker.
The Meeting Logs of the Ruby Developer Meetings.
New features are often discussed on the mailing lists, in particular the ruby-core (English) and ruby-dev (Japanese) mailing lists.
The Ruby documentation – Again, be aware that this documentation is generated from the source code of YARV and does not distinguish between features of Ruby and features of YARV.
In the past, there were a couple of attempts of formalizing changes to the Ruby Specification, such as the Ruby Change Request (RCR) and Ruby Enhancement Proposal (REP) processes, both of which were unsuccessful.
If all else fails, you need to check the source code of the popular Ruby implementations to see what they actually do.
For example, this is what the ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification has to say about String#upcase:
15.2.10.5.42 String#upcase
upcase
Visibility: public
Behavior: The method returns a new direct instance of the class String which contains all the characters of the receiver, with all the lower-case characters replaced with the corresponding upper-case characters.
As you can see, there is no mention of instances variables or really any details at all about how the method is implemented. It only specifies the result.
If a Ruby implementor wants to use instance variables, they are allowed to use instances variables, if a Ruby implementor doesn't want to use instance variables, they are allowed to do that, too.
If you check the Ruby Spec Suite for String#upcase, you will find specifications like these (this is just an example, there are quite a few more):
describe "String#upcase" do
it "returns a copy of self with all lowercase letters upcased" do
"Hello".upcase.should == "HELLO"
"hello".upcase.should == "HELLO"
end
describe "full Unicode case mapping" do
it "works for all of Unicode with no option" do
"äöü".upcase.should == "ÄÖÜ"
end
it "updates string metadata" do
upcased = "aßet".upcase
upcased.should == "ASSET"
upcased.size.should == 5
upcased.bytesize.should == 5
upcased.ascii_only?.should be_true
end
end
end
Again, as you can see, the Spec only describes results but not mechanisms. And this is very much intentional.
The same is true for the Ruby-Doc documentation of String#upcase:
upcase(*options) → string
Returns a string containing the upcased characters in self:
s = 'Hello World!' # => "Hello World!"
s.upcase # => "HELLO WORLD!"
The casing may be affected by the given options; see Case Mapping.
There is no mention of any particular mechanism here, nor in the linked documentation about Unicode Case Mapping.
All of this only tells us how String#upcase is specified and documented, though. But how is it actually implemented? Well, lucky for us, the majority of Ruby implementations are Free and Open Source Software, or at the very least make their source code available for study.
In Rubinius, you can find the implementation of String#upcase in core/string.rb lines 819–822 and it looks like this:
def upcase
str = dup
str.upcase! || str
end
It just delegates the work to String#upcase!, so let's look at that next, it is implemented right next to String#upcase in core/string.rb lines 824–843 and looks something like this (simplified and abridged):
def upcase!
return if #num_bytes == 0
ctype = Rubinius::CType
i = 0
while i < #num_bytes
c = #data[i]
if ctype.islower(c)
#data[i] = ctype.toupper!(c)
end
i += 1
end
end
So, as you can see, this is indeed just standard Ruby code using instance variables like #num_bytes which holds the length of the String in platform bytes and #data which is an Array of platform bytes holding the actual content of the String. It uses two helper methods from the Rubinius::CType library (a library for manipulating individual characters as byte-sized integers). The "actual" conversion to upper case is done by Rubinius::CType::toupper!, which is implemented in core/ctype.rb and is extremely simple (to the point of being simplistic):
def self.toupper!(num)
num - 32
end
Another very simple example is the implementation of String#upcase in Opal, which you can find in opal/corelib/string.rb and looks like this:
def upcase
`self.toUpperCase()`
end
Opal is an implementation of Ruby for the ECMAScript platform. Opal cleverly overloads the Kernel#` method, which is normally used to spawn a sub shell (which doesn't exist in ECMAScript) and execute commands in the platform's native command language (which on the ECMAScript platform arguably is ECMAScript). In Opal, Kernel#` is instead used to inject arbitrary ECMAScript code into Ruby.
So, all that `self.toUpperCase()` does, is call the String.prototype.toUpperCase method on self, which does work because of how the String class is defined in Opal:
class ::String < `String`
In other words, Opal implements Ruby's String class by simply inheriting from ECMAScript's String "class" (really the String Constructor function) and is therefore able to very easily and elegantly reuse all the work that has been done implementing Strings in ECMAScript.
Another very simple example is TruffleRuby. Its implementation of String#upcase can be found in src/main/ruby/truffleruby/core/string.rb and looks like this:
def upcase(*options)
s = Primitive.dup_as_string_instance(self)
s.upcase!(*options)
s
end
Similar to Rubinius, String#upcase just delegates to String#upcase!, which is not surprising since TruffleRuby's core library was originally forked from Rubinius's. This is what String#upcase! looks like:
def upcase!(*options)
mapped_options = Truffle::StringOperations.validate_case_mapping_options(options, false)
Primitive.string_upcase! self, mapped_options
end
The Truffle::StringOperations::valdiate_case_mapping_options helper method is not terribly interesting, it is just used to implement the rather complex rules for what the Case Mapping Options that you can pass to the various String methods are allowed to look like. The actual "meat" of TruffleRuby's implementation of String#upcase! is just this: Primitive.string_upcase! self, mapped_options.
The syntax Primitive.some_name was agreed upon between the developers of multiple Ruby implementations as "magic" syntax within the core of the implementation itself to be able to call out from Ruby code into "primitives" or "intrinsics" that are provided by the runtime system, but are not necessarily implemented in Ruby.
In other words, all that Primitive.string_upcase! self, mapped_options tells us is "there is a magic function called string_upcase! defined somewhere deep in the bowels of TruffleRuby itself, which knows how to convert a string to upper case, but we are not supposed to know how it works".
If you are really curious, you can find the implementation of Primitive.string_upcase! in src/main/java/org/truffleruby/core/string/StringNodes.java. The code looks dauntingly long and complex, but all you really need to know is that the Truffle Language Implementation Framework is based on constructing Nodes for an AST-walking interpreter. Once you ignore all the machinery related to constructing the AST nodes, the code itself is actually fairly simple.
Once again, the implementors are relying on the fact that the Truffle Language Implementation Framework already comes with a powerful implementation of strings, which the TruffleRuby developers can simply reuse for their own strings.
By the way, this idea of "primitives" or "intrinsics" is an idea that is used in many programming language implementations. It is especially popular in the Smalltalk world. It allows you to write the definition of your methods in the language itself, which in turn allows features like reflection and tools like documentation generators and IDEs (e.g. for automatic code completion) to work without them having to understand a second language, but still have an efficient implementation in a separate language with privileged access to the internals of the implementation.
For example, because large parts of YARV are implemented in C instead of Ruby, but YARV is the implementation that the documentation on Ruby-Doc and Ruby-Lang is generated from, that means that the RDoc Ruby Documentation Generator actually needs to understand both Ruby and C. And you will notice that sometimes documentation for methods implemented in C is missing, incomplete, or corrupted. Similarly, trying to get information about methods implemented in C using Method#parameters sometimes returns non-sensical or useless results. This would not happen if YARV used something like Intrinsics instead of directly writing the methods in C.
JRuby implements String#upcase in several overloads of org.jruby.RubyString.upcase and String#upcase! in several overloads of org.jruby.RubyString.upcase_bang.
However, in the end, they all delegate to one specific overload of org.jruby.RubyString.upcase_bang defined in core/src/main/java/org/jruby/RubyString.java like this:
private IRubyObject upcase_bang(ThreadContext context, int flags) {
modifyAndKeepCodeRange();
Encoding enc = checkDummyEncoding();
if (((flags & Config.CASE_ASCII_ONLY) != 0 && (enc.isUTF8() || enc.maxLength() == 1)) ||
(flags & Config.CASE_FOLD_TURKISH_AZERI) == 0 && getCodeRange() == CR_7BIT) {
int s = value.getBegin();
int end = s + value.getRealSize();
byte[]bytes = value.getUnsafeBytes();
while (s < end) {
int c = bytes[s] & 0xff;
if (Encoding.isAscii(c) && 'a' <= c && c <= 'z') {
bytes[s] = (byte)('A' + (c - 'a'));
flags |= Config.CASE_MODIFIED;
}
s++;
}
} else {
flags = caseMap(context.runtime, flags, enc);
if ((flags & Config.CASE_MODIFIED) != 0) clearCodeRange();
}
return ((flags & Config.CASE_MODIFIED) != 0) ? this : context.nil;
}
As you can see, this is is a very low-level way of implementing it.
In MRuby, the implementation looks again very different. MRuby is designed to be light-weight, small, and easy to embed into a larger application. It is also designed to be used in small embedded systems such as robots, sensors, and IoT devices. Because of that, it is designed to be very modular: a lot of the parts of MRuby are optional and are distributed as "MGems". Even parts of the core language are optional and can be left out, such as support for the catch and throw keywords, big numbers, the Dir class, meta programming, eval, the Math module, IO and File, and so on.
If we want to find out where String#upcase is implemented, we have to follow a trail of breadcrumbs. We start with the mrb_str_upcase function in src/string.c which looks like this:
static mrb_value
mrb_str_upcase(mrb_state *mrb, mrb_value self)
{
mrb_value str;
str = mrb_str_dup(mrb, self);
mrb_str_upcase_bang(mrb, str);
return str;
}
This is a pattern we have already seen a couple of times: String#upcase just duplicates the String and then delegates to String#upcase!, which is implemented just above in mrb_str_upcase_bang:
static mrb_value
mrb_str_upcase_bang(mrb_state *mrb, mrb_value str)
{
struct RString *s = mrb_str_ptr(str);
char *p, *pend;
mrb_bool modify = FALSE;
mrb_str_modify_keep_ascii(mrb, s);
p = RSTRING_PTR(str);
pend = RSTRING_END(str);
while (p < pend) {
if (ISLOWER(*p)) {
*p = TOUPPER(*p);
modify = TRUE;
}
p++;
}
if (modify) return str;
return mrb_nil_value();
}
As you can see, there is a lot of mechanic in there to extract the underlying data structure from the Ruby String object, iterate over that data structure making sure to not run over the end, etc., but the real work of actually converting to uppercase is actually performed by the TOUPPER macro defined in include/mruby.h:
#define TOUPPER(c) (ISLOWER(c) ? ((c) & 0x5f) : (c))
There you have it! That's how String#upcase works "under the hood" in five different Ruby implementations: Rubinius, Opal, TruffleRuby, JRuby, and MRuby. And it will again be different in IronRuby, YARV, RubyMotion, Ruby.NET, XRuby, MagLev, MacRuby, tinyrb, MRI, IoRuby, or any of the other Ruby implementations of present, future, and past.
This shows you that there are many different ways of approaching how to implement something like String#upcase in a Ruby implementation. There are almost as many different approaches as there are implementations!
My question is, when creating an object from a built in class (String, Array, Integer...), are we actually storing some information on some instance variables for that object during its creation?
Yes, we are, basically:
string = "hello" is shorthand for string = String.new("hello")
take a look at the following:
https://ruby-doc.org/core-3.1.2/String.html#method-c-new (ruby 3)
https://ruby-doc.org/core-2.3.0/String.html#method-c-new (ruby 2)
What's the difference between String.new and a string literal in Ruby?
You can also check the following (to extend the functionalities of the class):
Extend Ruby String class with method to change the contents
So the short answer is:
Dealing with built in classes (String, Array, Integer, ...etc) is almost the same thing as we do in any other class we create

Ruby object to_s what is the encoding of the object id?

In Ruby, the to_s on an object includes an encoding of the object's id.
[2] pry(main)> shape = Shape.new(4,4)
=> #<Shape:0x00007fac5eb6afc8 #num_sides=4, #side_length=4>
In the documentation it says
Returns a string representing obj. The default to_s prints the object’s class and an encoding of the object id.
https://apidock.com/ruby/Object/to_s
In the example above, the encoding of the object id is 0x00007fac5eb6afc8.
In How does object_id assignment work? they explain
In MRI the object_id of an object is the same as the VALUE that represents the object on the C level.
So I compared to the object_id and it is not the same as the encoding of the object id.
[2] pry(main)> shape = Shape.new(4,4)
=> #<Shape:0x00007fac5eb6afc8 #num_sides=4, #side_length=4>
[3] pry(main)> shape.object_id
=> 70189150066660
What exactly is the encoding of the object id? It does not appear to be the object_id.
Think of the object_id, or __id__ as the "pointer" for the object. It is not technically a pointer, but does contain a unique value that can be used to retrieve the internal C VALUE.
There are patterns to the value it has for some data types, as you can see with its hexadecimal representation with to_s. I am will not go into all the details, as there are already numerous answers on SO explaining, and already linked from comments, but integers (up to a FIXNUM_MAX, have predictable values, and special constants like true, false, and nil will always have the same object_id in every run.
To put simply, it is nothing more than a number, and shown as a hexadecimal (base 16) value, not any actual "encoding" or cypher.
Going to expand upon this a bit more in light of your latest edits to the question. As you posted, the hexadecimal number you see in to_s is the value of the internal C VALUE of the object. VALUE is a C data type (unsigned, pointer size number) that every Ruby object is represented as in C code. As #Stefan pointed out in a comment, for non-integer types (I speak only for MRI version), it is twice the value of the object_id. Not that you probably care, but you can shift the bits of an integer to predict the value for those.
Therefore, using you example.
A value of 0x00007fac5eb6afc8 is simple hexadecimal notation for a number. It uses a base 16 counting system as opposed to the base 10 decimal system we are more used to in everyday life. It is simply a different way of looking at the same number.
So, using that logic.
a = 0x00007fac5eb6afc8
#=> 140378300133320 # Decimal representation
a /= 2 # Remember, non-integers are half of this value
#=> 70189150066660 # Your object_id
The best answer you can get is: You don't know, and you shouldn't need to.
Ruby guarantees exactly three things about object IDs:
An object has the same ID during its lifetime.
No two objects have the same ID at the same time.
IDs are integers.
In particular, this means that you cannot rely on a specific object having a specific ID (for example, nil having ID 8). It also means that IDs can be re-used. You should think of it as nothing but opaque identifier.
And, as you quoted, the default Object#to_s uses "some" encoding of the ID.
And that is all you know, and all you should ever rely on. In particular, you should never try to parse IDs or Object#to_s.
So, the ID part of Object#to_s is "some unspecified encoding" of the ID, which itself is "some opaque identifier".
Everything else is deliberately left unspecified, so that different implementations can make different choices that make sense for their specific needs. For example, it would be stupid to tie object IDs to memory addresses, because implementations like JRuby, Opal, IronPython, MagLev, and Topaz run on platforms where the concept of "memory address" doesn't even exist! And Rubinius uses a moving garbage collector, where objects can move around in memory and thus their address changes.

Converting Ruby symbols to integers

I find it surprising that Ruby symbols can be typecasted to integers without errors.
So :a.to_i is legal. I was wondering what is the significance of this integer, is it a unique value specific to that symbol?
You shouldn't do this, as Symbol#to_i was removed in Ruby 1.9 and is thus not compatible going forward. Regardless, the docs say this about it:
Returns an integer that is unique for each symbol within a particular execution of a program.
It is roughly equivalent to calling object_id on the symbol, as they both end up calling the C function SYM2ID().
In 1.8.x symbols were immediate objects. Their implementation was fast, and, most of the time, small. But with that came a security concern regarding the lack of garbage collection.
The #to_i and #to_int methods returned a unique integer and were related to the internal implementation.
Symbols-as-immediates and the implicit and explicit integer conversions have all been removed in 1.9.x. You can of course get the object_id. It's interesting that in 1.8.x to_i and object_id did not return the same number.

Why use symbols as hash keys in Ruby?

A lot of times people use symbols as keys in a Ruby hash.
What's the advantage over using a string?
E.g.:
hash[:name]
vs.
hash['name']
TL;DR:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Ruby Symbols are immutable (can't be changed), which makes looking something up much easier
Short(ish) answer:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.
a = 'name'
a.object_id
=> 557720
b = 'name'
=> 557740
'name'.object_id
=> 1373460
'name'.object_id
=> 1373480 # !! different entity from the one above
# Ruby assumes any string can change at any point in time,
# therefore treating it as a separate entity
# versus:
:name.object_id
=> 71068
:name.object_id
=> 71068
# the symbol :name is a references to the same unique entity
Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).
If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.
If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)
Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter (e.g. Ruby can run out of memory and blow up if you generate too many symbols programmatically).
Notes:
If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.
If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.
Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.
Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.
Long answers:
https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings
http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/
https://www.rubyguides.com/2016/01/ruby-mutability/
The reason is efficiency, with multiple gains over a String:
Symbols are immutable, so the question "what happens if the key changes?" doesn't need to be asked.
Strings are duplicated in your code and will typically take more space in memory.
Hash lookups must compute the hash of the keys to compare them. This is O(n) for Strings and constant for Symbols.
Moreover, Ruby 1.9 introduced a simplified syntax just for hash with symbols keys (e.g. h.merge(foo: 42, bar: 6)), and Ruby 2.0 has keyword arguments that work only for symbol keys.
Notes:
1) You might be surprised to learn that Ruby treats String keys differently than any other type. Indeed:
s = "foo"
h = {}
h[s] = "bar"
s.upcase!
h.rehash # must be called whenever a key changes!
h[s] # => nil, not "bar"
h.keys
h.keys.first.upcase! # => TypeError: can't modify frozen string
For string keys only, Ruby will use a frozen copy instead of the object itself.
2) The letters "b", "a", and "r" are stored only once for all occurrences of :bar in a program. Before Ruby 2.2, it was a bad idea to constantly create new Symbols that were never reused, as they would remain in the global Symbol lookup table forever. Ruby 2.2 will garbage collect them, so no worries.
3) Actually, computing the hash for a Symbol didn't take any time in Ruby 1.8.x, as the object ID was used directly:
:bar.object_id == :bar.hash # => true in Ruby 1.8.7
In Ruby 1.9.x, this has changed as hashes change from one session to another (including those of Symbols):
:bar.hash # => some number that will be different next time Ruby 1.9 is ran
Re: what's the advantage over using a string?
Styling: its the Ruby-way
(Very) slightly faster value look ups since hashing a symbol is equivalent to hashing an integer vs hashing a string.
Disadvantage: consumes a slot in the program's symbol table that is never released.
I'd be very interested in a follow-up regarding frozen strings introduced in Ruby 2.x.
When you deal with numerous strings coming from a text input (I'm thinking of HTTP params or payload, through Rack, for example), it's way easier to use strings everywhere.
When you deal with dozens of them but they never change (if they're your business "vocabulary"), I like to think that freezing them can make a difference. I haven't done any benchmark yet, but I guess it would be close the symbols performance.

Ruby and :symbols

I have just started using Ruby and I am reading "Programming Ruby 1.9 - The Pragmatic Programmer's Guide". I came across something called symbols, but as a PHP developer I don't understand what they do and what they are good for.
Can anyone help me out with this?
It's useful to think of symbols in terms of "the thing called." In other words, :banana is referring to "the thing called banana." They're used extensively in Ruby, mostly as Hash (associative array) keys.
They really are similar to strings, but behind the scenes, very different. One key difference is that only one of a particular symbol exists in memory. So if you refer to :banana 10 times in your code, only one instance of :banana is created and they all refer to that one. This also implies they're immutable.
Symbols are similar to string literals in the sense that share the same memory space, but it is important to remark they are not string equivalents.
In Ruby, when you type "this" and "this" you're using two different memory locations; by using symbols you'll use only one name during the program execution. So if you type :this in several places in your program, you'll be using only one.
From Symbol doc:
Symbol objects represent names and some strings inside the Ruby interpreter. They are generated using the :name and :"string" literals syntax, and by the various to_sym methods. The same Symbol object will be created for a given name or string for the duration of a program‘s execution, regardless of the context or meaning of that name. Thus if Fred is a constant in one context, a method in another, and a class in a third, the Symbol :Fred will be the same object in all three contexts.
So, you basically use it where you want to treat a string as a constant.
For instance, it is very common to use it with the attr_accessor method, to define getter/setter for an attribute.
class Person
attr_accessor :name
end
p = Person.new
p.name= "Oscar"
But this would do the same:
class DontDoThis
attr_accessor( "name" )
end
ddt = DontDoThis.new
ddt.name= "Dont do it"
Think of a Symbol as a either:
A method name that you plan to use later
A constant / enumeration that you want to store and compare against
For example:
s = "FooBar"
length = s.send(:length)
>>> 6
#AboutRuby has a good answer, using the terms "the thing called".
:banana is referring to "the thing
called banana."
He notes that you can refer to :banana many times in the code and its the same object-- even in different scopes or off in some weird library. :banana is the thing called banana, whatever that might mean when you use it.
They are used as
keys to arrays, so you look up :banana you only have one entry. In most languages if these are Strings you run the risk of having multiple Strings in memory with the text "banana" and not having the code detect they are the same
method/proc names. Most people are familiar with how C distinguishes a method from its call with parentheses: my_method vs. my_method(). In Ruby, since parentheses are optional, these both indicate a call to that method. The symbol, however, is convenient to use as a standin for methods (even though there really is no relationship between a symbol and a method).
enums (and other constants). Since they don't change they exhibit many of the properties of these features from other languages.

Resources