This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Why don't more projects use Ruby Symbols instead of Strings?
What's the difference between a string and a symbol in Ruby?
I am new to ruby language.
From what I read I must use symbols instead strings where is possible.
Is that correct ?
You don't HAVE TO do anything. However, it is advisable to use symbols instead of strings in cases where you will be using the same string literal over and over again. The reason is that there is only ever 1 of the symbol held in memory, while if you use the string literal it will created anew whenever you assign it, therefore potentially wasting memory.
Strings are mutable where as Symbols are not. Symbols are better performance wise.
Read this article to better understand symbols, strings and their differences.
It's sort of up to you which one you use -- you can use a string anywhere you'd use a symbol, but not the other way around. Symbols do have a number of advantages, in a few cases.
Symbols give you a performance advantage because two symbols of the same name actually map to the same object in memory, whereas two strings with the same characters create different objects. Symbols are immutable and lightweight, which makes them ideal for elements that you won't be changing around at runtime; keys in a hash table, for example.
Here's a nice excerpt from the Ruby Newbie guide to symbols:
The granddaddy of all advantages is
also the granddaddy of advantages:
symbols can't be changed at runtime.
If you need something that absolutely,
positively must remain constant, and
yet you don't want to use an
identifier beginning with a capital
letter (a constant), then symbols are
what you need.
The big advantage of symbols is that
they are immutable (can't be changed
at runtime), and sometimes that's
exactly what you want.
Sometimes that's exactly what you
don't want. Most usage of strings
requires manipulation -- something you
can't do (at least directly) with
symbols.
Another disadvantage of symbols is
they don't have the String class's
rich set of instance methods. The
String class's instance method make
life easier. Much easier.
Symbols have two main advantages:
They are like intern'ed strings, so they can improve performance.
They are syntactically distinct which can make your code more readable, particularly for programs that use hashtables as complex build-in data structures.
Related
I have a large dataset from an analytics provider.
It arrives in JSON and I parse it into a hash, but due to the size of the set I'm ballooning to over a gig in memory usage. Almost everything starts as strings (a few values are numerical), and while of course the keys are duplicated many times, many of the values are repeated as well.
So I was thinking, why not symbolize all the (non-numerical) values, as well?
I've found some discusion of potential problems, but I figure it would be nice to have a comprehensive description for Ruby, since the problems seem dependent on the implementation of the interning process (what happens when you symbolize a string).
I found this talking about Java:
Is it good practice to use java.lang.String.intern()?
The interning process can be expensive
Interned strings are never de-allocated, resulting in a memory leak
(Except there's some contention on that last point.)
So, can anyone give a detailed explanation of when not to intern strings in Ruby?
When the list of things in question is an open set (i.e., dynamic, has no fixed inventory), you should not convert them into symbols. Each symbol created will not be garbage collected, and will cause memory leak.
When the list of things in question is a closed set (i.e., static, has a fixed inventory), you should better convert them into symbols. Each symbol will be created only once, and will be reused. That will save memory.
The interning process can be expensive
there is always a tradeoff between memory and computing power we have to choose. so try some best practices out there and benchmark to figure out what's right for you. a few suggestions I like to mention..
symbols are an excellent choice for a hash key
{name: "my name"}
Freeze Strings to save memory, try to keep a small string pool
person[:country] = "USA".freeze
have fun with Ruby GC tuning.
Interned strings are never de-allocated, resulting in a memory leak
ruby 2.2 introduced a symbol garbage collection. so this concern is no longer valid. however, overuse of frozen strings and symbols will decrease the performance.
What is the difference between an "interned" and an "uninterned" symbol. Is it only Racket that has uninterned symbols or do other dialects of scheme or lisp have them?
Interned symbols are eq? if and only if they have the same name. Uninterned symbols are not eq? to any other symbol, so they are a kind of unique token with an attached string. Interned symbols are the kind that are produced by the default reader. Uninterned symbols can be used as identifiers when generating code in a macro, such an identifier cannot be shadowed by any other identifier. Most Lisp dialects have this concepts, in Scheme it is rarer, since hygienic macros are supposed to reduce its usefulness.
Common Lisp has uninterned symbols. As Juho's answer says, an uninterned symbol is guaranteed not to be equal to any other value.
Common Lisp-style requires uninterned symbols in order to write many macros correctly (particularly macros whose expansion requires introducing and binding new variables), because any interned symbol you use in a macro expansion might capture or shadow a binding in its expansion site.
Scheme's hygienic macro systems, on the other hand, do not have this problem, so a Scheme system does not need to provide uninterned symbols. Still, many of them do. Why? Several reasons:
Some Scheme systems offer a Common Lisp-style defmacro capability.
In others, the hygienic macro system's implementation may use uninterned symbols internally, but the concept of an uninterned symbol may be exposed.
Uninterned symbols can be useful in many programs that use s-expressions to represent a language, and transform this language into another s-expr language. These sorts of tasks often benefit from an ability to generate an identifier guaranteed to be new.
The other two excellent answers nevertheless fail to mention the virtue of interned values generally, which is that they can be compared in constant time. This typically means that these values are represented as pointers to a table without duplicates. In Racket, as of a few months ago[*], other values--floating point values and strings used as literals, for instance--will also be interned. In addition to allowing faster comparisons, I believe that this enables better compile-time optimizations, because these values can be compared for equality without running the code.
Are there other systems that do things like this? I bet there are.
[*] I'm sure someone will correct me if I'm wrong :).
The two languages where I have used symbols are Ruby and Erlang and I've always found them to be extremely useful.
Haskell does have algebraic datatypes, but I still think symbols would be mighty convenient. An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
The syntactic sugar for atoms can be minor - :something or <something> is an atom. All atoms are instances of a Type called Atom which derives Show and Eq. You can then use it for more descriptive error codes, for example
type ErrorCode = Atom
type Message = String
data Error = Error ErrorCode Message
loginError = Error :redirect "Please login first"
In this case :redirect is more efficient than using a string ("redirect") and easier to understand than an integer (404).
The benefit may seem minor, but I say it is worth adding atoms as a language feature (or at least a GHC extension).
So why have symbols not been added to the language? Or am I thinking about this the wrong way?
I agree with camccann's answer that it's probably missing mainly because it would have to be baked quite deeply into the implementation and it is of too little use for this level of complication. In Erlang (and Prolog and Lisp) symbols (or atoms) usually serve as special markers and serve mostly the same notion as a constructor. In Lisp, the dynamic environment includes the compiler, so it's partly also a (useful) compiler concept leaking into the runtime.
The problem is the following, symbol interning is impure (it modifies the symbol table). Because we never modify an existing object it is referentially transparent, however, but if implemented naïvely can lead to space leaks in the runtime. In fact, as currently implemented in Erlang you can actually crash the VM by interning too many symbols/atoms (current limit is 2^20, I think), because they can never get garbage collected. It's also difficult to implement in a concurrent setting without a huge lock around the symbol table.
Both problems can be (and have been) solved, however. For example, see Erlang EEP 20. I use this technique in the simple-atom package. It uses unsafePerformIO under the hood, but only in (hopefully) rare cases. It could still use some help from the GC to perform an optimisation similar to indirection shortening. It also uses quite a few IORefs internally which isn't too great for performance and memory usage.
In summary, it can be done but implementing it properly is non-trivial. Compiler writers always weigh the power of a feature against its implementation and maintenance efforts, and it seems like first-class symbols lose out on this one.
I think the simplest answer is that, of the things Lisp-style symbols (which is where both Ruby and Erlang got the idea, I believe) are used for, in Haskell most are either:
Already done in some other fashion--e.g. a data type with a bunch of nullary constructors, which also behave as "convenient names for integers".
Awkward to fit in--things that exist at the level of language syntax instead of being regular data usually have more type information associated with them, but symbols would have to either be distinct types from each other (nearly useless without some sort of lightweight ad-hoc sum type) or all the same type (in which case they're barely different from just using strings).
Also, keep in mind that Haskell itself is actually a very, very small language. Very little is "baked in", and of the things that are most are just syntactic sugar for other primitives. This is a bit less true if you include a bunch of GHC extensions, but GHC with -XAndTheKitchenSinkToo is not the same language as Haskell proper.
Also, Haskell is very amenable to pseudo-syntax and metaprogramming, so there's a lot you can do even without having it built in. Particularly if you get into TH and scary type metaprogramming and whatever else.
So what it mostly comes down to is that most of the practical utility of symbols is already available from other features, and the stuff that isn't available would be more difficult to add than it's worth.
Atoms aren't provided by the language, but can be implemented reasonably as a library:
http://hackage.haskell.org/package/simple-atom
There are a few other libs on hackage, but this one looks the most recent and well-maintained.
Haskell uses type constructors* instead of symbols so that the set of symbols a function can take is closed, and can be reasoned about by the type system. You could add symbols to the language, but it would put you in the same place that using strings would - you'd have to check all possible symbols against the few with known meanings at runtime, add error handling all over the place, etc. It'd be a big workaround for all the compile-time checking.
The main difference between strings and symbols is interning - symbols are atomic and can be compared in constant time. Both are types with an essentially infinite number of distinct values, though, and against the grain of Haskell's specifying arguments and results with finite types.
I'm more familiar with OCaml than Haskell, so "type constructor" may not be the right term. Things like None or Just 3.
An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
Use Enum instead.
data FileType = GZipped | BZipped | Plain
deriving Enum
descr ft = ["compressed with gzip",
"compressed with bzip2",
"uncompressed"] !! fromEnum ft
Why does Ruby expose symbols for explicit use? Isn't that the sort of optimisation that's usually handled by the interpreter/compiler?
Part of the issue is that Ruby strings are mutable. Since every string Ruby allocates must be independent (it can't cache short/common ones), it's convenient to have a Symbol type to let the programmer have what are essentially immutable, memory-efficient strings.
Also, they share many characteristics with enum's, but with less pain for the programmer.
Ruby symbols are used in lieu of string constants in other similar languages. Besides the performance benefit, they can be used to semantically distinguish between string data and a more abstract symbol. Being syntactically different, they can clearly be distinguished in code.
Have a look at Ruby symbols post.
If a Ruby regular expression is matching against something that isn't a String, the to_str method is called on that object to get an actual String to match against. I want to avoid this behavior; I'd like to match regular expressions against objects that aren't Strings, but can be logically thought of as randomly accessible sequences of bytes, and all accesses to them are mediated through a byte_at() method (similar in spirit to Java's CharSequence.char_at() method).
For example, suppose I want to find the byte offset in an arbitrary file of an arbitrary regular expression; the expression might be multi-line, so I can't just read in a line at a time and look for a match in each line. If the file is very big, I can't fit it all in memory, so I can't just read it in as one big string. However, it would be simple enough to define a method that gets the nth byte of a file (with buffering and caching as needed for speed).
Eventually, I'd like to build a fully featured rope class, like in Ruby Quiz #137, and I'd like to be able to use regular expressions on them without the performance loss of converting them to strings.
I don't want to get up to my elbows in the innards of Ruby's regular expression implementation, so any insight would be appreciated.
You can't. This wasn't supported in Ruby 1.8.x, probably because it's such an edge case; and in 1.9 it wouldn't even make sense. Ruby 1.9 doesn't map its strings to bytes in any user-serviceable fashion; instead it uses character code points, so that it can support the multitude of encodings that it accepts. And 1.9's new optimized regex engine, Oniguruma, is also built around the same concept of encodings and code points. Bytes just don't enter into the picture at this level.
I have a suspicion that what you're asking for is a case of premature optimization. For any reasonable Ruby object, implementing to_str shouldn't be a huge performance hurdle. If it is, then Ruby's probably the wrong tool for you, as it abstracts and insulates you from your raw data in all sorts of ways.
Your example of looking for a byte sequence in a large binary file isn't an ideal use case for Ruby -- you'd be better off using grep or some other Unix tool. If you need the results in your Ruby program, run it as a system process using backticks and process the output.