Why freezing hash literal is not the same as freezing string literal? - ruby

I have been reading about ways to reduce memory usage in my Ruby/Rails app, and one thing that is mentioned is freezing objects. I have tried the code below (MRI, Ruby 2.3.3) and it does save memory, according to Activity Monitor, compared to not freezing the string.
pipeline = []
100_000.times { pipeline << 'hello world'.freeze }
However, if I try the same with a hash literal, it uses lots of memory, unless I assign the hash to a variable and freeze it before.
pipeline = []
100_000.times { pipeline << {hello: 'world'}.freeze } # Uses about 25MB
my_hash = {hello: 'world'}
my_hash.freeze
100_000.times { pipeline << my_hash} # This uses about 1MB
Can anyone explain why? I always thought the string case was a bit strange, because it looks like you're simply creating lots of different string objects, freezing each one separately, and adding lots of frozen objects to the array. Don't know why it works, but hey, it did. Now, the hash case is more in line with what I expected, but I don't know why it won't behave like the string.

It's probably the case that the Ruby optimizer can identify that string as being the same from one loop to the next, but it's unable to identify that hash as being identical so it makes new ones. In the second variant you literally use the same hash so the optimizer can handle it.
For proof, look at this:
pipeline = []
100_000.times { pipeline << 'hello world'.freeze }
pipeline.map(&:object_id).uniq.length
# => 1
That's an array of identical objects, one allocation only.
pipeline = []
100_000.times { pipeline << {hello: 'world'}.freeze }
pipeline.map(&:object_id).uniq.length
# => 100000
That's 100,000 different objects.

Can anyone explain why? I always thought the string case was a bit strange, because it looks like you're simply creating lots of different string objects, freezing each one separately, and adding lots of frozen objects to the array.
The expression form
'string literal'.freeze
is a special expression form that is special-cased by the language. It not only freezes the string object, it also performs de-duplication. (Similar to symbols.)
It is a special-cased expression form. It is not evaluating the string literal and then sending it the message freeze. Rather, it is treated as a single entity, a different form of string literal if you will.
In fact, the original proposal did introduce a different form of string literal like this:
'string literal'f
The proposal was changed to make it forwards-compatible: 'foo'f would be a syntax error, if you had to run your code in older versions of Ruby, whereas 'foo'.freeze just works the same way in older versions of Ruby, it only uses more memory.
Note: this means it only works for literals. Here, the string is de-duplicated:
'foo'.freeze
Here, it is not:
foo = 'foo'
foo.freeze
Don't know why it works, but hey, it did.
Basically, it works, because the language specification says so.
Now, the hash case is more in line with what I expected, but I don't know why it won't behave like the string.
Again, it doesn't work, because the language specification only special-cases string literals.

Related

Ruby idiom to shovel into a string or nil? (e.g. shovel or assign / safe shovel)

I'd like to do this:
summary << reason
In my case, summary is a string, containing several sentences, and reason is one such sentence.
This works fine if the target already has a value, but sometimes summary can be nil. In that case, this raises:
NoMethodError: undefined method `<<' for nil:NilClass
So, I could write something like this:
if summary
summary << reason
else
summary = reason
end
This is cumbersome and ugly. I can hide it away in a new method like append(summary, reason), but I'm hoping there's a ruby idiom that can wrap this up concisely.
I've optimistically tried a few variants, without success:
summary += reason
summary &<< reason
In other scenarios, I might build an array of reasons (you can shovel into an empty array just fine), then finally join them into a summary...but that's not viable in my current project.
I also can't seed summary with an empty string (shoveling into an empty string also works fine), as other code depends on it being nil at times.
So, is there a "safe shovel" or simple "shovel or assign" idiom in Ruby, particularly for strings that might be nil?
I prefer #Oto Brglez's answer, but it inspired another solution that might be useful to someone:
summary = [summary, reason].join
This may or may not be easier to read, and probably is less performant. But it handles the nil summary problem without explicit alternation.
You can solve this with something like this; with the help of ||.
summary = (summary || '') + reason
Or like so with the help of ||= and <<:
(summary ||= '') << reason

How to reset value of local variable within loop?

I'd like to point out I tried quite extensively to find a solution for this and the closest I got was this. However I couldn't see how I could use map to solve my issue here. I'm brand new to Ruby so please bear that in mind.
Here's some code I'm playing with (simplified):
def base_word input
input_char_array = input.split('') # split string to array of chars
#file.split("\n").each do |dict_word|
input_text = input_char_array
dict_word.split('').each do |char|
if input_text.include? char.downcase
input_text.slice!(input_text.index(char))
end
end
end
end
I need to reset the value of input_text back to the original value of input_char_array after each cycle, but from what I gather since Ruby is reference-based, the modifications I make with the line input_text.slice!(input_text.index(char)) are reflected back in the original reference, and I end up assigning input_text to an empty array fairly quickly as a result.
How do I mitigate that? As mentioned I've tried to use .map but maybe I haven't fully wrapped my head around how I ought to go about it.
You can get an independent reference by cloning the array. This, obviously, has some RAM usage implications.
input_text = input_char_array.dup
The Short and Quite Frankly Not Very Good Answer
Using slice! overwrites the variable in place, equivalent to
input_text = input_text.slice # etc.
If you use plain old slice instead, it won't overwrite input_text.
The Longer and Quite Frankly Much Better Answer
In Ruby, code nested four levels deep is often a smell. Let's refactor, and avoid the need to reset a loop at all.
Instead of splitting the file by newline, we'll use Ruby's built-in file handling module to read through the lines. Memoizing it (the ||= operator) may prevent it from reloading the file each time it's referenced, if we're running this more than once.
def dictionary
#dict ||= File.open('/path/to/dictionary')
end
We could also immediately make all the words lowercase when we open the file, since every character is downcased individually in the original example.
def downcased_dictionary
#dict ||= File.open('/path/to/dictionary').each(&:downcase)
end
Next, we'll use Ruby's built-in file and string functions, including #each_char, to do the comparisons and output the results. We don't need to convert any inputs into Arrays (at all!), because #include? works on strings, and #each_char iterates over the characters of a string.
We'll decompose the string-splitting into its own method, so the loop logic and string logic can be understood more clearly.
Lastly, by using #slice instead of #slice!, we don't overwrite input_text and entirely avoid the need to reset the variable later.
def base_word(input)
input_text = input.to_s # Coerce in case it's not a string
# Read through each line in the dictionary
dictionary.each do |word|
word.each_char {|char| slice_base_word(input_text, char) }
end
end
def slice_base_word(input, char)
input.slice(input.index(char)) if input.include?(char)
end

When to use symbols instead of strings in Ruby?

If there are at least two instances of the same string in my script, should I instead use a symbol?
TL;DR
A simple rule of thumb is to use symbols every time you need internal identifiers. For Ruby < 2.2 only use symbols when they aren't generated dynamically, to avoid memory leaks.
Full answer
The only reason not to use them for identifiers that are generated dynamically is because of memory concerns.
This question is very common because many programming languages don't have symbols, only strings, and thus strings are also used as identifiers in your code. You should be worrying about what symbols are meant to be, not only when you should use symbols. Symbols are meant to be identifiers. If you follow this philosophy, chances are that you will do things right.
There are several differences between the implementation of symbols and strings. The most important thing about symbols is that they are immutable. This means that they will never have their value changed. Because of this, symbols are instantiated faster than strings and some operations like comparing two symbols is also faster.
The fact that a symbol is immutable allows Ruby to use the same object every time you reference the symbol, saving memory. So every time the interpreter reads :my_key it can take it from memory instead of instantiate it again. This is less expensive than initializing a new string every time.
You can get a list all symbols that are already instantiated with the command Symbol.all_symbols:
symbols_count = Symbol.all_symbols.count # all_symbols is an array with all
# instantiated symbols.
a = :one
puts a.object_id
# prints 167778
a = :two
puts a.object_id
# prints 167858
a = :one
puts a.object_id
# prints 167778 again - the same object_id from the first time!
puts Symbol.all_symbols.count - symbols_count
# prints 2, the two objects we created.
For Ruby versions before 2.2, once a symbol is instantiated, this memory will never be free again. The only way to free the memory is restarting the application. So symbols are also a major cause of memory leaks when used incorrectly. The simplest way to generate a memory leak is using the method to_sym on user input data, since this data will always change, a new portion of the memory will be used forever in the software instance. Ruby 2.2 introduced the symbol garbage collector, which frees symbols generated dynamically, so the memory leaks generated by creating symbols dynamically it is not a concern any longer.
Answering your question:
Is it true I have to use a symbol instead of a string if there is at least two the same strings in my application or script?
If what you are looking for is an identifier to be used internally at your code, you should be using symbols. If you are printing output, you should go with strings, even if it appears more than once, even allocating two different objects in memory.
Here's the reasoning:
Printing the symbols will be slower than printing strings because they are cast to strings.
Having lots of different symbols will increase the overall memory usage of your application since they are never deallocated. And you are never using all strings from your code at the same time.
Use case by #AlanDert
#AlanDert: if I use many times something like %input{type: :checkbox} in haml code, what should I use as checkbox?
Me: Yes.
#AlanDert: But to print out a symbol on html page, it should be converted to string, shouldn't it? what's the point of using it then?
What is the type of an input? An identifier of the type of input you want to use or something you want to show to the user?
It is true that it will become HTML code at some point, but at the moment you are writing that line of your code, it is mean to be an identifier - it identifies what kind of input field you need. Thus, it is used over and over again in your code, and have always the same "string" of characters as the identifier and won't generate a memory leak.
That said, why don't we evaluate the data to see if strings are faster?
This is a simple benchmark I created for this:
require 'benchmark'
require 'haml'
str = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: "checkbox"}').render
end
end.total
sym = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: :checkbox}').render
end
end.total
puts "String: " + str.to_s
puts "Symbol: " + sym.to_s
Three outputs:
# first time
String: 5.14
Symbol: 5.07
#second
String: 5.29
Symbol: 5.050000000000001
#third
String: 4.7700000000000005
Symbol: 4.68
So using smbols is actually a bit faster than using strings. Why is that? It depends on the way HAML is implemented. I would need to hack a bit on HAML code to see, but if you keep using symbols in the concept of an identifier, your application will be faster and reliable. When questions strike, benchmark it and get your answers.
Put simply, a symbol is a name, composed of characters, but immutable. A string, on the contrary, is an ordered container for characters, whose contents are allowed to change.
Here is a nice strings vs symbols benchmark I found at codecademy:
require 'benchmark'
string_AZ = Hash[("a".."z").to_a.zip((1..26).to_a)]
symbol_AZ = Hash[(:a..:z).to_a.zip((1..26).to_a)]
string_time = Benchmark.realtime do
1000_000.times { string_AZ["r"] }
end
symbol_time = Benchmark.realtime do
1000_000.times { symbol_AZ[:r] }
end
puts "String time: #{string_time} seconds."
puts "Symbol time: #{symbol_time} seconds."
The output is:
String time: 0.21983 seconds.
Symbol time: 0.087873 seconds.
use symbols as hash key identifiers
{key: "value"}
symbols allow you to call the method in a different order
def write(file:, data:, mode: "ascii")
# removed for brevity
end
write(data: 123, file: "test.txt")
freeze to keep as a string and save memory
label = 'My Label'.freeze

I'd like an explanation of a behavior in Ruby that I ran across in the Koans

So is it just the shovel operator that modifies the original string? Why does this work, it looks like:
hi = original_string
is acting like some kind of a pointer? Can I get some insight as to when and how and why this behaves like this?
def test_the_shovel_operator_modifies_the_original_string
original_string = "Hello, "
hi = original_string
there = "World"
hi << there
assert_equal "Hello, World", original_string
# THINK ABOUT IT:
#
# Ruby programmers tend to favor the shovel operator (<<) over the
# plus equals operator (+=) when building up strings. Why?
end
In ruby, everything is a reference. If you do foo = bar, now foo and bar are two names for the same object.
If, however, you do foo = foo + bar (or, equivalently, foo += bar), foo now refers to a new object: one that is the result of the computation foo + bar.
is acting like some kind of a pointer
It's called reference semantics. As in Python, Ruby's variables refer to values, rather than containing them. This is normal for dynamically typed languages, as it's much easier to implement the "values have type, variables don't" logic when the variable is always just a reference instead of something that has to magically change size to hold different types of values.
As for the actual koan, see Why is the shovel operator (<<) preferred over plus-equals (+=) when building a string in Ruby? .
A string is just a sequence os characters, the << operator allows you to add more characters to this sequence. Some languages have immutable strings, like Java and C#, others have mutable strings, like C++, there isn't anything wrong about that, it's just something the language designers felt that was necessary.
In Java, when you need to create a large string by merging many smaller strings, you would first use a StringBuilder and then at the end build a real string out of it. In Ruby you can just keep on using << to add more characters to that string and that's it.
The main difference is that using << is much faster than "one_string + other_string" because the + operator generates a new string instead of appending to one_string.

Why use symbols as hash keys in Ruby?

A lot of times people use symbols as keys in a Ruby hash.
What's the advantage over using a string?
E.g.:
hash[:name]
vs.
hash['name']
TL;DR:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Ruby Symbols are immutable (can't be changed), which makes looking something up much easier
Short(ish) answer:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.
a = 'name'
a.object_id
=> 557720
b = 'name'
=> 557740
'name'.object_id
=> 1373460
'name'.object_id
=> 1373480 # !! different entity from the one above
# Ruby assumes any string can change at any point in time,
# therefore treating it as a separate entity
# versus:
:name.object_id
=> 71068
:name.object_id
=> 71068
# the symbol :name is a references to the same unique entity
Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).
If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.
If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)
Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter (e.g. Ruby can run out of memory and blow up if you generate too many symbols programmatically).
Notes:
If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.
If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.
Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.
Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.
Long answers:
https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings
http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/
https://www.rubyguides.com/2016/01/ruby-mutability/
The reason is efficiency, with multiple gains over a String:
Symbols are immutable, so the question "what happens if the key changes?" doesn't need to be asked.
Strings are duplicated in your code and will typically take more space in memory.
Hash lookups must compute the hash of the keys to compare them. This is O(n) for Strings and constant for Symbols.
Moreover, Ruby 1.9 introduced a simplified syntax just for hash with symbols keys (e.g. h.merge(foo: 42, bar: 6)), and Ruby 2.0 has keyword arguments that work only for symbol keys.
Notes:
1) You might be surprised to learn that Ruby treats String keys differently than any other type. Indeed:
s = "foo"
h = {}
h[s] = "bar"
s.upcase!
h.rehash # must be called whenever a key changes!
h[s] # => nil, not "bar"
h.keys
h.keys.first.upcase! # => TypeError: can't modify frozen string
For string keys only, Ruby will use a frozen copy instead of the object itself.
2) The letters "b", "a", and "r" are stored only once for all occurrences of :bar in a program. Before Ruby 2.2, it was a bad idea to constantly create new Symbols that were never reused, as they would remain in the global Symbol lookup table forever. Ruby 2.2 will garbage collect them, so no worries.
3) Actually, computing the hash for a Symbol didn't take any time in Ruby 1.8.x, as the object ID was used directly:
:bar.object_id == :bar.hash # => true in Ruby 1.8.7
In Ruby 1.9.x, this has changed as hashes change from one session to another (including those of Symbols):
:bar.hash # => some number that will be different next time Ruby 1.9 is ran
Re: what's the advantage over using a string?
Styling: its the Ruby-way
(Very) slightly faster value look ups since hashing a symbol is equivalent to hashing an integer vs hashing a string.
Disadvantage: consumes a slot in the program's symbol table that is never released.
I'd be very interested in a follow-up regarding frozen strings introduced in Ruby 2.x.
When you deal with numerous strings coming from a text input (I'm thinking of HTTP params or payload, through Rack, for example), it's way easier to use strings everywhere.
When you deal with dozens of them but they never change (if they're your business "vocabulary"), I like to think that freezing them can make a difference. I haven't done any benchmark yet, but I guess it would be close the symbols performance.

Resources