When not to use to_sym in Ruby? - ruby

I have a large dataset from an analytics provider.
It arrives in JSON and I parse it into a hash, but due to the size of the set I'm ballooning to over a gig in memory usage. Almost everything starts as strings (a few values are numerical), and while of course the keys are duplicated many times, many of the values are repeated as well.
So I was thinking, why not symbolize all the (non-numerical) values, as well?
I've found some discusion of potential problems, but I figure it would be nice to have a comprehensive description for Ruby, since the problems seem dependent on the implementation of the interning process (what happens when you symbolize a string).
I found this talking about Java:
Is it good practice to use java.lang.String.intern()?
The interning process can be expensive
Interned strings are never de-allocated, resulting in a memory leak
(Except there's some contention on that last point.)
So, can anyone give a detailed explanation of when not to intern strings in Ruby?

When the list of things in question is an open set (i.e., dynamic, has no fixed inventory), you should not convert them into symbols. Each symbol created will not be garbage collected, and will cause memory leak.
When the list of things in question is a closed set (i.e., static, has a fixed inventory), you should better convert them into symbols. Each symbol will be created only once, and will be reused. That will save memory.

The interning process can be expensive
there is always a tradeoff between memory and computing power we have to choose. so try some best practices out there and benchmark to figure out what's right for you. a few suggestions I like to mention..
symbols are an excellent choice for a hash key
{name: "my name"}
Freeze Strings to save memory, try to keep a small string pool
person[:country] = "USA".freeze
have fun with Ruby GC tuning.
Interned strings are never de-allocated, resulting in a memory leak
ruby 2.2 introduced a symbol garbage collection. so this concern is no longer valid. however, overuse of frozen strings and symbols will decrease the performance.

Related

How to avoid memory leaks in Ruby

I noticed that sometimes my ruby scripts that run reasonably well with processing small data crash due to no more memory left when you give them a large data set to process. For example, I have a long script and each minute it grows with a hundred or so megabytes of ram usage until it crashes when I give it large enough data to process.
So, question is, how to avoid memory leaks in ruby, what are the do's and don'ts? Any hints and tips to optimize ruby memory usage for long-running scripts?
How to make sure my ruby scripts didn't leak any memory?
Thanks!
The quick-fix to memory problems is often to spike in calls to GC.start, this force-initiates the garbage collector. Sometimes Ruby gets very lazy about cleaning up garbage and it can accumulate to a dangerous degree.
It's sometimes the case you inadvertently create structures that are difficult to clean-up, that is wildly inter-linked things that are, when analyzed more deeply, not actually retained. This makes life harder for the garbage collector. For example, a deep Hash of Hash structures with lots and lots of strings can take a lot more work to liberate than a simple Array.
If you're having memory problems you'll want to pay attention to how much garbage you're producing when doing operations. Look for ways of collapsing things to remove intermediate products. For instance, the classic case is this:
s = ''
10.times do |i|
s += i.to_s
end
This creates a string of the form 01234... as a final product, but it also creates 10 other strings with the intermediate products. That's 11x as much garbage as this solution:
s = ''
10.times do |i|
s << i.to_s
end
That creates a single string and appends to it repeatedly. Technically the to_s operation on a number also creates garbage, so that's another thing to keep in mind as well, that conversions aren't free. This is why you see symbols like :name used in Ruby quite frequently, you pay the cost of those once and once only. Every string "name" could be an independent object.

Does C# 6 string interpolation use boxing like string.Format() does for its arguments?

I am asking this for performance sake - using lots of boxing makes lots of heap allocations which brings more GC collects which sometimes causes apps to freeze for a glimpse which annoy users.
All string interpolation does (at least in the common case) is to call string.Format().
Right now, calling string.Format() allocates quite a lot and not just due to boxing (for example, string.Format("{0:s} - {1:B}: The value is: {2:C2}", DateTime.UtcNow, Guid.NewGuid(), 3.50m) makes 13 allocations, only 3 of those due to boxing), though there is talk about improving that in the future.
Though as usual when it comes to performance, you generally should not just blindly write unreadable code everywhere because the readable version has known performance issues. Instead, limit the unreadable efficient code to the parts of your code that actually need it.

Using Ruby to convert a range to an array and running out of memory

I'm doing something that seems simple at first
Ruby Code:
(0...693530740).to_a
This results in
NoMemoryError: failed to allocate memory
Well i'm stumped. Is there any way to change how much memory the ruby interpreter can use? I don't see how. I have no way around this as it has to exist in my code. Is there a better way? Is there perhaps a lower level language that can do this? Any ideas are welcome. I have already tried using jRuby and giving the jvm 15GB of memory but no luck.
Thank you for reading this question
Why do you need to make an array with hundreds of millions of items? Do you really need to have all those items in memory at the same time, or can you process them one by one, allowing the ones which have already been processed to be garbage-collected? Something like this:
(0...693530740).each do |n|
# do something
end
"I have no way around this as it has to exist in my code"
Alter your code so that you don't need it. You only have one piece of meaningful data here, the number 693530740, which fits just fine into 8 bytes. It is very unlikely that you really need to expand it into that huge array. Most of Ruby's array methods that you might think you need will have equivalents (using Range, or Enumerator) that work without needing to instantiate such a list of numbers.
If you have trouble seeing what kind of re-design would avoid the large array, then post a new question - here on Stack Overflow if the design can be outlined in a short description and a few lines of code. Perhaps to codereview.stackexchange.com if it not possible to demonstrate your algorithm in a small piece of code.
So you want to allocate an array with ~650 million numbers. which is ~2.6GB . Note that this memory must be continuous. Without knowing how much physical memory you have, I think this is the main reason you can't do it

Storing data in a ruby C extension - terrible idea or not?

My team is working on an MMO server in Ruby, and we opted to start moving computationally intensive operations into a C extension. As part of that effort, we moved the actual data storage into C (using Data_Get_Struct and all that). So, for example, each Ruby "Zone" object has an associated "ZoneKernel::Zone" C struct where the actual binary data is stored.
Basically, I'm wondering if this is a terrible idea or not. I'm not super familiar with the ruby internals, but it seems like the data should be fine as long as the parent Zone stays in memory on the ruby side (thus preventing garbage collection of the C data).
One caveat is that we've been getting semi-regular "Stack Consistency Errors" that crash our server - this seems like potentially a related memory issue (instead of just your garden variety segfault) - if anyone has any knowledge of what that might be, I would appreciate that as well!
As stated in the documentation to Data_Wrap_Struct(klass, mark, free, ptr) function:
The free argument is the function to free the pointer allocation. If
this is -1, the pointer will be just freed.
These mark / free functions are invoked during GC execution.
Your wrapped native structure will be automatically freed when its corresponding Ruby object is finalized. Until that happens, your data will not be freed unless you do so manually.
Writing C extensions doesn't guarantee a performance boost, but it almost always increases the complexity of your code. Profile your server in order to measure your performance gains, and develop your Zone class in pure Ruby if viable.
In general I like to keep any data that could change outside of my source. Loading it from YAML or a database means you can tweak the data to your heart's content without needing to recompile. Obviously, if your compile time and load times are fast then it's not as big an issue, but it's still a good idea to separate the two.
I favor YAML, because it's a standard format, so you could access the same file from any number of languages. You could load it directly into the C side, or into the Ruby side, depending on what seems faster/smarter.

What are your strategies to keep the memory usage low?

Ruby is truly memory-hungry - but also worth every single bit.
What do you do to keep the memory usage low? Do you avoid big strings and use smaller arrays/hashes instead or is it no problem to concern about for you and let the garbage collector do the job?
Edit: I found a nice article about this topic here - old but still interesting.
I've found Phusion's Ruby Enterprise Edition (a fork of mainline Ruby with much-improved garbage collection) to make a dramatic difference in memory usage... Plus, they've made it extraordinarily easy to install (and to remove, if you find the need).
You can find out more and download it on their website.
I really don't think it matters all that much.
Making your code less readable in order to improve memory consumption is something you should only ever do if you need it. And by need, I mean have a specific case for the performance profile and specific metrics that indicate that any change will address the issue.
If you have an application where memory is going to be the limiting factor, then Ruby may not be the best choice. That said, I have found that my Rails apps generally consume about 40-60mb of RAM per Mongrel instance. In the scheme of things, this isn't very much.
You might be able to run your application on the JVM with JRuby - the Ruby VM is currently not as advanced as the JVM for memory management and garbage collection. The 1.9 release is adding many improvements and there are alternative VM's under development as well.
Choose date structures that are efficient representations, scale well, and do what you need.
Use algorithms that work using efficient data structures rather than bloated, but easier ones.
Look else where. Ruby has a C bridge and its much easier to be memory conscious in C than in Ruby.
Ruby developers are quite lucky since they don’t have to manage the memory themselves.
Be aware that ruby allocates objects, for instance something as simple as
100.times{ 'foo' }
allocates 100 string objects (strings are mutable and each version requires its own memory allocation).
Make sure that if you are using a library allocating a lot of objects, that other alternatives are not available and your choice is worth paying the garbage collector cost. (you might not have a lot of requests/s or might not care for a few dozen ms per requests).
Creating a hash object really allocates more than an object, for instance
{'joe' => 'male', 'jane' => 'female'}
doesn’t allocate 1 object but 7. (one hash, 4 strings + 2 key strings)
If you can use symbol keys as they won’t be garbage collected. However because they won’t be garbage collected you want to make sure to not use totally dynamic keys like converting the username to a symbol, otherwise you will ‘leak’ memory.
Example: Somewhere in your app, you apply a to_sym on an user’s name like :
hash[current_user.name.to_sym] = something
When you have hundreds of users, that’s could be ok, but what is happening if you have one million of users ? Here are the numbers :
ruby-1.9.2-head >
# Current memory usage : 6608K
# Now, add one million randomly generated short symbols
ruby-1.9.2-head > 1000000.times { (Time.now.to_f.to_s).to_sym }
# Current memory usage : 153M, even after a Garbage collector run.
# Now, imagine if symbols are just 20x longer than that ?
ruby-1.9.2-head > 1000000.times { (Time.now.to_f.to_s * 20).to_sym }
# Current memory usage : 501M
Be aware to never convert non controlled arguments in symbol or check arguments before, this can easily lead to a denial of service.
Also remember to avoid nested loops more than three levels deep because it makes the maintenance difficult. Limiting nesting of loops and functions to three levels or less is a good rule of thumb to keep the code performant.
Here are some links in regards:
http://merbist.com
http://blog.monitis.com
When deploying a Rails/Rack webapp, use REE or some other copy-on-write friendly interpreter.
Tweak the garbage collector (see https://www.engineyard.com/blog/tuning-the-garbage-collector-with-ruby-1-9-2 for example)
Try to cut down the number of external libraries/gems you use since additional code uses memory.
If you have a part of your app that is really memory-intensive then it's maybe worth rewriting it in a C extension or completing it by invoking other/faster/better optimized programs (if you have to process vast amounts of text data, maybe you can replace that code with calls to grep, awk, sed etc.)
I am not a ruby developer but I think some techniques and methods are true of any language:
Use the minimum size variable suitable for the job
Destroy and close variables and connections when not in use
However if you have an object you will need to use many times consider keeping it in scope
Any loops with manipulations of a big string dp the work on a smaller string and then append to bigger string
Use decent (try catch finally) error handling to make sure objects and connections are closed
When dealing with data sets only return the minimum necessary
Other than in extreme cases memory usage isn't something to worry about. The time you spend trying to reduce memory usage will buy a LOT of gigabytes.
Take a look at Small Memory Software - Patterns for Systems with Limited Memory. You don't specify what sort of memory constraint, but I assume RAM. While not Ruby-specific, I think you'll find some useful ideas in this book - the patterns cover RAM, ROM and secondary storage, and are divided into major techniques of small data structures, memory allocation, compression, secondary storage, and small architecture.
The only thing we've ever had which has actually been worth worrying about is RMagick.
The solution is to make sure you're using RMagick version 2, and call Image#destroy! when you're done using your image
Avoid code like this:
str = ''
veryLargeArray.each do |foo|
str += foo
# but str << foo is fine (read update below)
end
which will create each intermediate string value as a String object and then remove its only reference on the next iteration. This junks up the memory with tons of increasingly long strings that have to be garbage collected.
Instead, use Array#join:
str = veryLargeArray.join('')
This is implemented in C very efficiently and doesn't incur the String creation overhead.
UPDATE: Jonas is right in the comment below. My warning holds for += but not <<.
I'm pretty new at Ruby, but so far I haven't found it necessary to do anything special in this regard (that is, beyond what I just tend to do as a programmer generally). Maybe this is because memory is cheaper than the time it would take to seriously optimize for it (my Ruby code runs on machines with 4-12 GB of RAM). It might also be because the jobs I'm using it for are not long-running (i.e. it's going to depend on your application).
I'm using Python, but I guess the strategies are similar.
I try to use small functions/methods, so that local variables get automatically garbage collected when you return to the caller.
In larger functions/methods I explicitly delete large temporary objects (like lists) when they are no longer needed. Closing resources as early as possible might help too.
Something to keep in mind is the life cycle of your objects. If you're objects are not passed around that much, the garbage collector will eventually kick in and free them up. However, if you keep referencing them it may require some cycles for the garbage collector to free them up. This is particularly true in Ruby 1.8, where the garbage collector uses a poor implementation of the mark and sweep technique.
You may run into this situation when you try to apply some "design patterns" like decorator that keep objects in memory for a long time. It may not be obvious when trying example in isolation, but in real world applications where thousands of objects are created at the same time the cost of memory growth will be significant.
When possible, use arrays instead of other data structures. Try not to use floats when integers will do.
Be careful when using gem/library methods. They may not be memory optimized. For example, the Ruby PG::Result class has a method 'values' which is not optimized. It will use a lot of extra memory. I have yet to report this.
Replacing malloc(3) implementation to jemalloc will immediately decrease your memory consumption up to 30%. I've created 'jemalloc' gem to achieve this instantly.
'jemalloc' GEM: Inject jemalloc(3) into your Ruby app in 3 min
I try to keep arrays & lists & datasets as small as possible. The individual object do not matter much, as creation and garbage collection is pretty fast in most modern languages.
In the cases you have to read some sort of huge dataset from the database, make sure to read in a forward/only manner and process it in little bits instead og loading everything into memory first.
dont use a lot of symbols, they stay in memory until the process gets killed.. this because symbols never get garbage collected.

Resources