When to use symbols instead of strings in Ruby? - ruby

If there are at least two instances of the same string in my script, should I instead use a symbol?

TL;DR
A simple rule of thumb is to use symbols every time you need internal identifiers. For Ruby < 2.2 only use symbols when they aren't generated dynamically, to avoid memory leaks.
Full answer
The only reason not to use them for identifiers that are generated dynamically is because of memory concerns.
This question is very common because many programming languages don't have symbols, only strings, and thus strings are also used as identifiers in your code. You should be worrying about what symbols are meant to be, not only when you should use symbols. Symbols are meant to be identifiers. If you follow this philosophy, chances are that you will do things right.
There are several differences between the implementation of symbols and strings. The most important thing about symbols is that they are immutable. This means that they will never have their value changed. Because of this, symbols are instantiated faster than strings and some operations like comparing two symbols is also faster.
The fact that a symbol is immutable allows Ruby to use the same object every time you reference the symbol, saving memory. So every time the interpreter reads :my_key it can take it from memory instead of instantiate it again. This is less expensive than initializing a new string every time.
You can get a list all symbols that are already instantiated with the command Symbol.all_symbols:
symbols_count = Symbol.all_symbols.count # all_symbols is an array with all
# instantiated symbols.
a = :one
puts a.object_id
# prints 167778
a = :two
puts a.object_id
# prints 167858
a = :one
puts a.object_id
# prints 167778 again - the same object_id from the first time!
puts Symbol.all_symbols.count - symbols_count
# prints 2, the two objects we created.
For Ruby versions before 2.2, once a symbol is instantiated, this memory will never be free again. The only way to free the memory is restarting the application. So symbols are also a major cause of memory leaks when used incorrectly. The simplest way to generate a memory leak is using the method to_sym on user input data, since this data will always change, a new portion of the memory will be used forever in the software instance. Ruby 2.2 introduced the symbol garbage collector, which frees symbols generated dynamically, so the memory leaks generated by creating symbols dynamically it is not a concern any longer.
Answering your question:
Is it true I have to use a symbol instead of a string if there is at least two the same strings in my application or script?
If what you are looking for is an identifier to be used internally at your code, you should be using symbols. If you are printing output, you should go with strings, even if it appears more than once, even allocating two different objects in memory.
Here's the reasoning:
Printing the symbols will be slower than printing strings because they are cast to strings.
Having lots of different symbols will increase the overall memory usage of your application since they are never deallocated. And you are never using all strings from your code at the same time.
Use case by #AlanDert
#AlanDert: if I use many times something like %input{type: :checkbox} in haml code, what should I use as checkbox?
Me: Yes.
#AlanDert: But to print out a symbol on html page, it should be converted to string, shouldn't it? what's the point of using it then?
What is the type of an input? An identifier of the type of input you want to use or something you want to show to the user?
It is true that it will become HTML code at some point, but at the moment you are writing that line of your code, it is mean to be an identifier - it identifies what kind of input field you need. Thus, it is used over and over again in your code, and have always the same "string" of characters as the identifier and won't generate a memory leak.
That said, why don't we evaluate the data to see if strings are faster?
This is a simple benchmark I created for this:
require 'benchmark'
require 'haml'
str = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: "checkbox"}').render
end
end.total
sym = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: :checkbox}').render
end
end.total
puts "String: " + str.to_s
puts "Symbol: " + sym.to_s
Three outputs:
# first time
String: 5.14
Symbol: 5.07
#second
String: 5.29
Symbol: 5.050000000000001
#third
String: 4.7700000000000005
Symbol: 4.68
So using smbols is actually a bit faster than using strings. Why is that? It depends on the way HAML is implemented. I would need to hack a bit on HAML code to see, but if you keep using symbols in the concept of an identifier, your application will be faster and reliable. When questions strike, benchmark it and get your answers.

Put simply, a symbol is a name, composed of characters, but immutable. A string, on the contrary, is an ordered container for characters, whose contents are allowed to change.

Here is a nice strings vs symbols benchmark I found at codecademy:
require 'benchmark'
string_AZ = Hash[("a".."z").to_a.zip((1..26).to_a)]
symbol_AZ = Hash[(:a..:z).to_a.zip((1..26).to_a)]
string_time = Benchmark.realtime do
1000_000.times { string_AZ["r"] }
end
symbol_time = Benchmark.realtime do
1000_000.times { symbol_AZ[:r] }
end
puts "String time: #{string_time} seconds."
puts "Symbol time: #{symbol_time} seconds."
The output is:
String time: 0.21983 seconds.
Symbol time: 0.087873 seconds.

use symbols as hash key identifiers
{key: "value"}
symbols allow you to call the method in a different order
def write(file:, data:, mode: "ascii")
# removed for brevity
end
write(data: 123, file: "test.txt")
freeze to keep as a string and save memory
label = 'My Label'.freeze

Related

Why is it important to create a method as a symbol?

I'm trying to understand the extent of what symbols do in Ruby. I understand that it is much more faster and efficient to use symbols as keys as opposed to strings, but how is it faster?
And from my understanding, when referencing methods it has to be represented as a symbol, :to_i as opposed to to_i. What is the purpose of this?
In Ruby, a symbol is just an immutable string:
"hello " + "world" #=> "hello world"
:hello_ + :world #=> NoMethodError: undefined method `+' for :hello:Symbol
Being immutable makes symbols a safe and reliable reference, for example:
Object.methods => [:new, :allocate, :superclass, #etc..]
If Ruby were to use strings here, users would be able to modify the strings, thus ruining future calls of Object.methods. This could be fixed by making copies of the strings each time the method is called, but that would be a huge memory footprint.
In fact, since Ruby knows symbols are never going to be modified, it saves each symbol only once, no matter how many times you declare it:
"hello".object_id #=> 9504940
"hello".object_id #=> 9565300
:hello.object_id #=> 1167708
:hello.object_id #=> 1167708
This takes the memory-saving potential of symbols even further, allowing you to use symbol literals in your code anywhere and everywhere with little memory overhead.
So, the round-about answer to your question: symbols can't be modified, but they're safer and more memory efficient; therefore, you should use them whenever you have a string that you know shouldn't be modified.
Symbols are used as the keys to hashes because:
You should never modify the key of a hash while it's in the hash.
Hashes require literal referencing a lot, ie my_hash[:test], so it's more memory-efficient to use symbols.
As for method references: you can't reference a method directly, ie send(my_method()) because can't tell the difference between passing the method in and executing it. Strings could have been used here, but since a method's name never changes once defined, it makes more sense to represent the name as a symbol.

What is the difference between being immutable and the fact that there can only be one instance of a Symbol?

I'm reading Eloquent Ruby, and am on Chapter 6 on Symbols. Some excerpts:
"There can only ever be one instance of any given symbol. If I mention :all twice in my code, it is always the same :all."
a = :all
b = :all
puts a.object_id, b.object_id # same objects
"Another aspect of symbols that makes them so well suited to their chosen career is that symbols are immutable - once you create that :all symbol, it will be :all until the end of time (or at least until your Ruby interpreter exits)"
What is the difference between being immutable and the fact that there can only be one instance of you?
By the way, I would like to write the previous sentence more accurately: "What is the difference between a class being immutable and the fact that there can only be one instance of the class?" Is class the right word to insert there?
How would you even go about trying to mutate a symbol, they don't seem to hold values like other variables?
Immutable means that an object cannot be changed. In Ruby, symbols are immutable. To make a symbol mutable, you have to perform type conversion to a string, which is mutable.
a = :mystring
a = a.to_s
=> "mystring"
For proof that a symbol is immutable, you can call the frozen? property on it.
a.frozen?
=> true
Note that symbols cannot be unfrozen unlike strings which have an unfreeze method.
For object ids
In Ruby, the object_id of an object is the same as the VALUE that represents the object on the C level. For most objects, this points to a location in memory where the object data is stored. This varies over time because it depends on where the system decided to allocate its memory.
Symbols have the same object id because they are meant to represent a SINGLE value.
To check this out, let's type to the console the same symbol multiple times.
:z.object_id
=> 636328
:z.object_id
=> 636328
:z.object_id
=> 636328
Now, let's try the same thing only with strings
"z".object_id
=> 21237740
"z".object_id
=> 24355380
As you can see, here we have two references to the string z, both of which are different objects. Thus, they have different object_ids.
This also means that symbols can save quite a bit of memory, especially if we are dealing with big data. Because symbols are the same object, it's faster to compare them then it is strings. Strings require comparing the values instead of the object ids.
Your sentence is fine; you're not sure of the common phrase used to describe a class with only one instance. I'll explain that as I go along.
An object that is immutable cannot change through any operations done on it. This means that any operation that would change a symbol would generate a new one instead.
:foo.object_id # 1520028
:foo.upcase.object_id # 70209716662240
:foo.capitalize.object_id # 70209719120060
You can certainly write objects that are immutable, or make them immutable (with some caveats) via freeze, but you can always create a new instance of them.
f = "foo"
f.freeze
f1 = "foo"
puts f.object_id == f1.object_id # false
An object that only ever has one instance of itself is considered to be a singleton.
If there's only one instance of it, then you only store it in memory once.
If you attempt to create it, you only get the previously existing object back.

Why use symbols as hash keys in Ruby?

A lot of times people use symbols as keys in a Ruby hash.
What's the advantage over using a string?
E.g.:
hash[:name]
vs.
hash['name']
TL;DR:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Ruby Symbols are immutable (can't be changed), which makes looking something up much easier
Short(ish) answer:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.
a = 'name'
a.object_id
=> 557720
b = 'name'
=> 557740
'name'.object_id
=> 1373460
'name'.object_id
=> 1373480 # !! different entity from the one above
# Ruby assumes any string can change at any point in time,
# therefore treating it as a separate entity
# versus:
:name.object_id
=> 71068
:name.object_id
=> 71068
# the symbol :name is a references to the same unique entity
Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).
If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.
If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)
Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter (e.g. Ruby can run out of memory and blow up if you generate too many symbols programmatically).
Notes:
If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.
If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.
Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.
Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.
Long answers:
https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings
http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/
https://www.rubyguides.com/2016/01/ruby-mutability/
The reason is efficiency, with multiple gains over a String:
Symbols are immutable, so the question "what happens if the key changes?" doesn't need to be asked.
Strings are duplicated in your code and will typically take more space in memory.
Hash lookups must compute the hash of the keys to compare them. This is O(n) for Strings and constant for Symbols.
Moreover, Ruby 1.9 introduced a simplified syntax just for hash with symbols keys (e.g. h.merge(foo: 42, bar: 6)), and Ruby 2.0 has keyword arguments that work only for symbol keys.
Notes:
1) You might be surprised to learn that Ruby treats String keys differently than any other type. Indeed:
s = "foo"
h = {}
h[s] = "bar"
s.upcase!
h.rehash # must be called whenever a key changes!
h[s] # => nil, not "bar"
h.keys
h.keys.first.upcase! # => TypeError: can't modify frozen string
For string keys only, Ruby will use a frozen copy instead of the object itself.
2) The letters "b", "a", and "r" are stored only once for all occurrences of :bar in a program. Before Ruby 2.2, it was a bad idea to constantly create new Symbols that were never reused, as they would remain in the global Symbol lookup table forever. Ruby 2.2 will garbage collect them, so no worries.
3) Actually, computing the hash for a Symbol didn't take any time in Ruby 1.8.x, as the object ID was used directly:
:bar.object_id == :bar.hash # => true in Ruby 1.8.7
In Ruby 1.9.x, this has changed as hashes change from one session to another (including those of Symbols):
:bar.hash # => some number that will be different next time Ruby 1.9 is ran
Re: what's the advantage over using a string?
Styling: its the Ruby-way
(Very) slightly faster value look ups since hashing a symbol is equivalent to hashing an integer vs hashing a string.
Disadvantage: consumes a slot in the program's symbol table that is never released.
I'd be very interested in a follow-up regarding frozen strings introduced in Ruby 2.x.
When you deal with numerous strings coming from a text input (I'm thinking of HTTP params or payload, through Rack, for example), it's way easier to use strings everywhere.
When you deal with dozens of them but they never change (if they're your business "vocabulary"), I like to think that freezing them can make a difference. I haven't done any benchmark yet, but I guess it would be close the symbols performance.

Why is it not a good idea to dynamically create a lot of symbols in ruby (for versions before 2.2)?

What is the function of symbol in ruby? what's difference between string and symbol?
Why is it not a good idea to dynamically create a lot of symbols?
Symbols are like strings but they are immutable - they can't be modified.
They are only put into memory once, making them very efficient to use for things like keys in hashes but they stay in memory until the program exits. This makes them a memory hog if you misuse them.
If you dynamically create lots of symbols, you are allocating a lot of memory that can't be freed until your program ends (edit: this is no longer the case since Ruby 2.2). You should only dynamically create symbols (using string.to_sym) if you know you will:
need to repeatedly access the symbol
not need to modify them
As I said earlier, they are useful for things like hashes - where you care more about the identity of the variable than its value. Symbols, when correctly used, are a readable and efficient way to pass around identity.
I will explain what I mean about the immutability of symbols RE your comment.
Strings are like arrays; they can be modified in place:
12:17:44 ~$ irb
irb(main):001:0> string = "Hello World!"
=> "Hello World!"
irb(main):002:0> string[5] = 'z'
=> "z"
irb(main):003:0> string
=> "HellozWorld!"
irb(main):004:0>
Symbols are more like numbers; they can't be edited in place:
irb(main):011:0> symbol = :Hello_World
=> :Hello_World
irb(main):012:0> symbol[5] = 'z'
NoMethodError: undefined method `[]=' for :Hello_World:Symbol
from (irb):12
from :0
A symbol is the same object and the same allocation of memory no matter where it is used:
>> :hello.object_id
=> 331068
>> a = :hello
=> :hello
>> a.object_id
=> 331068
>> b = :hello
=> :hello
>> b.object_id
=> 331068
>> a = "hello"
=> "hello"
>> a.object_id
=> 2149256980
>> b = "hello"
=> "hello"
>> b.object_id
=> 2149235120
>> b = "hell" + "o"
Two strings which are 'the same' in that they contain the same characters may not reference the same memory, which can be inefficient if you're using strings for, say, hashes.
So, symbols can be useful for reducing memory overhead. However - they are a memory leak waiting to happen, because symbols cannot be garbage collected once created. Creating thousands and thousands of symbols would allocate the memory and not be recoverable. Yikes!
It can be particularly bad to create symbols from user input without validating the input against some kind of a white-list (for example, for query string parameters in RoR). If user input is converted to symbols without validation, a malicious user can cause your program to consume large amounts of memory that will never be garbage collected.
Bad (a symbol is created regardless of user input):
name = params[:name].to_sym
Good (a symbol is only created if the user input is allowed):
whitelist = ['allowed_value', 'another_allowed_value']
raise ArgumentError unless whitelist.include?(params[:name])
name = params[:name].to_sym
Starting Ruby 2.2 and above Symbols are automatically garbage collected and so this should not be an issue.
If you are using Ruby 2.2.0 or later, it should usually be OK to dynamically create a lot of symbols, because they will be garbage collected according to the Ruby 2.2.0-preview1 announcement, which has a link to more details about the new symbol GC. However, if you pass your dynamic symbols to some kind of code that converts it to an ID (an internal Ruby implementation concept used in the C source code), then in that case it will get pinned and never get garbage collected. I'm not sure how commonly that happens.
You can think of symbols as a name of something, and strings (roughly) as a sequence of characters. In many cases you could use either a symbol or a string, or you could use a mixture of the two. Symbols are immutable, which means they can't be changed after being created. The way symbols are implemented, it is very efficient to compare two symbols to see if they are equal, so using them as keys to hashes should be a little faster than using strings. Symbols don't have a lot the methods that strings do, such as start_with? so you would have to use to_s to convert the symbol into a string before calling those methods.
You can read more about symbols here in the documentation:
http://www.ruby-doc.org/core-2.1.3/Symbol.html

Why don't more projects use Ruby Symbols instead of Strings?

When I first started reading about and learning ruby, I read something about the power of ruby symbols over strings: symbols are stored in memory only once, while strings are stored in memory once per string, even if they are the same.
For instance: Rails' params Hash in the Controller has a bunch of keys as symbols:
params[:id] or
params[:title]...
But other decently sized projects such as Sinatra and Jekyll don't do that:
Jekyll:
post.data["title"] or
post.data["tags"]...
Sinatra:
params["id"] or
params["title"]...
This makes reading new code a little tricky, and makes it hard to transfer code around and to figure out why using symbols isn't working. There are many more examples of this and it's kind of confusing. Should we or shouldn't we be using symbols in this case? What are the advantages of symbols and should we be using them here?
In ruby, after creating the AST, each symbol is represented as a unique integer. Having symbols as hash keys makes the computing a lot faster, as the main operation is comparison.
Symbols are not garbage collected AFAIK, so that might be a thing to watch out for, but except for that they really are great as hash keys.
One reason for the usage of strings may be the usage of yaml to define the values.
require 'yaml'
data = YAML.load(<<-data
one:
title: one
tag: 1
two:
title: two
tag: 2
data
) #-> {"one"=>{"title"=>"one", "tag"=>1}, "two"=>{"title"=>"two", "tag"=>2}}
You may use yaml to define symbol-keys:
require 'yaml'
data = YAML.load(<<-data
:one:
:title: one
:tag: 1
:two:
:title: two
:tag: 2
data
) #-> {:one=>{:title=>"one", :tag=>1}, :two=>{:title=>"two", :tag=>2}}
But in the yaml-definition symbols look a bit strange, strings looks more natural.
Another reason for strings as keys: Depending on the use case, it can be reasonable to sort by keys, but you can't sort symbols (at least not without a conversion to strings).
The main difference is that multiple symbols representing a single value are identical whereas this is not true with strings. For example:
irb(main):007:0> :test.object_id
=> 83618
irb(main):008:0> :test.object_id
=> 83618
irb(main):009:0> :test.object_id
=> 83618
3 references to the symbol :test, all the same object.
irb(main):010:0> "test".object_id
=> -605770378
irb(main):011:0> "test".object_id
=> -605779298
irb(main):012:0> "test".object_id
=> -605784948
3 references to the string "test", all different objects.
This means that using symbols can potentially save a good bit of memory depending on the application. It is also faster to compare symbols for equality since they are the same object, comparing identical strings is much slower since the string values need to be compared instead of just the object ids.
I usually use strings for almost everything except things like hash keys where I really want a unique identifier, not a string

Resources