Why does Ruby expose symbols? - ruby

Why does Ruby expose symbols for explicit use? Isn't that the sort of optimisation that's usually handled by the interpreter/compiler?

Part of the issue is that Ruby strings are mutable. Since every string Ruby allocates must be independent (it can't cache short/common ones), it's convenient to have a Symbol type to let the programmer have what are essentially immutable, memory-efficient strings.
Also, they share many characteristics with enum's, but with less pain for the programmer.

Ruby symbols are used in lieu of string constants in other similar languages. Besides the performance benefit, they can be used to semantically distinguish between string data and a more abstract symbol. Being syntactically different, they can clearly be distinguished in code.

Have a look at Ruby symbols post.

Related

Can I change the encoding of a frozen String without copying it?

Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
I have a large, frozen String and I want to change its encoding. But I don't want to copy the whole String just to do that. For context, this is to pass values to a Google Protocol Buffer which has the bytes type and only accepts Encoding::ASCII_8BIT.
big_string.freeze
MyProtobuf::SomeMessage.new(
# I would prefer not to have to copy the whole string just to
# change the encoding.
value: big_string.dup.force_encoding(Encoding::ASCII_8BIT)
)
It seems to work just fine for me: (using MRI/YARV 1.9, 2.x, 3.x)
require 'objspace'
big_string = Random.bytes(1_000_000).force_encoding(Encoding::UTF_8)
big_string.encoding #=> #<Encoding:UTF-8>
big_string.bytesize #=> 1000000
ObjectSpace.memsize_of(big_string) #=> 1000041
dup_string = big_string.dup.force_encoding(Encoding::ASCII_8BIT)
dup_string.encoding #=> #<Encoding:ASCII-8BIT>
dup_string.bytesize #=> 1000000
ObjectSpace.memsize_of(dup_string) #=> 40
Those 40 bytes are the size to hold an object (RVALUE) in Ruby.
Note that instead of dup / force_encoding(Encoding::ASCII_8BIT) there's also b which returns a copy in binary encoding right away.
For more in-depth information, here's a blog post from 2012 (Ruby 1.9) about copy-on-write / shared strings in Ruby:
Seeing double: how Ruby shares string values
From the author's book Ruby Under a Microscope: (p. 265)
Internally, both JRuby and MRI use an optimization called copy-on-write for strings and other data. This trick allows two identical string values to share the same data buffer, which saves both memory and time because Ruby avoids making separate copies of the same string data unnecessarily.
Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
There is nothing in the Ruby Language Specification that prevents that. There is also nothing in the Ruby Language Specification that enforces that.
In general, the Ruby Language Specification tries to stay silent on all things related to memory management, space complexity, step complexity, or time complexity. This is not exclusive to the Ruby Language Specification, most Language Specifications try to leave the implementors as much leeway as possible. In other words, Language Specifications tend to specify Syntax and Semantics and leave the Pragmatics up to the implementor. (C++ is somewhat of an exception in that it specifies space and time complexity for the algorithms in the standard library.) Even C, which is typically thought of as a language which gives you full control over everything, doesn't actually specify things like memory layouts precisely – for example, due to the definition of the term width in the standard, a uint16_t is actually allowed to occupy more than 16 bits!
Every implementor is free to implement strings however they want, as long as they comply with the semantics defined in the Ruby Language Specification.
If I remember correctly, both Rubinius and TruffleRuby did, at one point, experiment with a String implementation based on Ropes. Chris Seaton, TruffleRuby's lead developer, wrote a paper about that implementation. However, I don't know if they are still using it. (I know TruffleRuby switched to Truffle Strings recently, and I am not sure what their underlying representation is … or whether they are even guaranteeing a specific underlying representation.)
There is problem with the answer "you have to look at the specification", though: unfortunately, unlike many other programming languages, the Ruby Language Specification does not exist as a single document in a single place. Ruby does not have a single formal specification that defines what certain language constructs mean.
There are several resources, the sum of which can be considered kind of a specification for the Ruby programming language.
Some of these resources are:
The ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification – Note that the ISO Ruby Specification was written around 2009–2010 with the specific goal that all existing Ruby implementations at the time would easily be compliant. Since YARV and MacRuby only implement Ruby 1.9+ and MRI only implements Ruby 1.8 and lower and JRuby, XRuby, Ruby.NET, and IronRuby (at the time) only implemented a subset of Ruby 1.8, this means that the ISO Ruby Specification only contains features that are common to both Ruby 1.8 and Ruby 1.9. Also, the ISO Ruby Specification was specifically intended to be minimal and only contain the features that are absolutely required for writing Ruby programs. Because of that, it does for example only specify Strings very broadly (since they have changed significantly between Ruby 1.8 and Ruby 1.9). It obviously also does not specify features which were added after the ISO Ruby Specification was written, such as Ractors or Pattern Matching.
The Ruby Spec Suite aka ruby/spec – Note that the ruby/spec is unfortunately far from complete. However, I quite like it because it is written in Ruby instead of "ISO-standardese", which is much easier to read for a Rubyist, and it doubles as an executable conformance test suite.
The Ruby Programming Language by David Flanagan and Yukihiro 'matz' Matsumoto – This book was written by David Flanagan together with Ruby's creator matz to serve as a Language Reference for Ruby.
Programming Ruby by Dave Thomas, Andy Hunt, and Chad Fowler – This book was the first English book about Ruby and served as the standard introduction and description of Ruby for a long time. This book also first documented the Ruby core library and standard library, and the authors donated that documentation back to the community.
The Ruby Issue Tracking System, specifically, the Feature sub-tracker – However, please note that unfortunately, the community is really, really bad at distinguishing between Tickets about the Ruby Programming Language and Tickets about the YARV Ruby Implementation: they both get intermingled in the tracker.
The Meeting Logs of the Ruby Developer Meetings. (Same problem: Ruby and YARV get intermingled.)
New features are often discussed on the mailing lists, in particular the ruby-core (English) and ruby-dev (Japanese) mailing lists. (Same problem again.)
The Ruby documentation – Again, be aware that this documentation is generated from the source code of YARV and does not distinguish between features of Ruby and features of YARV.
In the past, there were a couple of attempts of formalizing changes to the Ruby Specification, such as the Ruby Change Request (RCR) and Ruby Enhancement Proposal (REP) processes, both of which were unsuccessful.
If all else fails, you need to check the source code of the popular Ruby implementations to see what they actually do. Please note the plural: you have to look at multiple, ideally all, implementations to figure out what the consensus is. Only looking at one implementation cannot possibly tell you whether what you are looking at is an implementation quirk of this particular implementation or is a universally agreed-upon behavior of the Ruby Language.

How does the ruby interpreter tell when you're calling a method on an integer or when you're setting a float?

I had a look through the 0.methods output in irb and couldn't see what path the ruby interpreter would take when it was passed 0.15 as opposed to 0.to_s
I've tried reading up on how ruby determines the difference between a floating point number being defined and a method being called on an integer but I haven't come to any conclusions.
The best guess I have is that because Ruby doesn't allow for a digit to lead a method name, it simply checks whether the character following the . is numeric or alphabetical.
I don't like guessing though, assumptions can lead to misunderstandings. Can someone clear this up for me?
How well can you read Yacc files? (Rhetorical question)
https://github.com/ruby/ruby/blob/trunk/parse.y#L7380 I believe this is where the Ruby parser handles floating point tokenisation.
Disclaimer: parse.y hurts my head.
As Methods in Ruby cannot begin with numbers, it's pretty easy to determine that 6.foo is a Method call and 6.12 is a Float.
You can distinguish both of them by pretty simple regular grammar specs, which is all what a Lexer needs to tokenize the source code.
I don't know for sure, but I think it is safe to assume that the two are distinguished by method names being unable to start with a number.
I don't see that it's an especially interesting or useful thing to know, and I think your curiosity is best directed elsewhere.

Can we just take symbol literal as a "interned string"

Anyone knows if there is similar concept in other popular language compared to symbol literal in Ruby? Can I considered it just as an "Interned String"?
Yes, symbols (sometimes referred to as atoms in other languages) can be considered interned strings.
There's a ton of information on Ruby symbols here: Question - Understanding Symbols In Ruby
And, an afterthought, this question lists many examples of similar concepts in a couple languages:
Lisp and Erlang Atoms, Ruby and Scheme Symbols. How useful are they?
Anyone knows if there is similar concept in other popular language compared to symbol literal in Ruby?
Sure, symbols in Ruby come from symbols in Smalltalk, which in turn gets them from Lisp. Scala also has symbols, and Erlang's atoms are similar. Erlang probably got them from Prolog.
Can I considered it just as an "Interned String"?
You can consider it all sorts of things, but symbols are symbols. They aren't immutable strings or interned strings or whatever ... they are just symbols.

Why doesn't Haskell have symbols (a la ruby) / atoms (a la erlang)?

The two languages where I have used symbols are Ruby and Erlang and I've always found them to be extremely useful.
Haskell does have algebraic datatypes, but I still think symbols would be mighty convenient. An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
The syntactic sugar for atoms can be minor - :something or <something> is an atom. All atoms are instances of a Type called Atom which derives Show and Eq. You can then use it for more descriptive error codes, for example
type ErrorCode = Atom
type Message = String
data Error = Error ErrorCode Message
loginError = Error :redirect "Please login first"
In this case :redirect is more efficient than using a string ("redirect") and easier to understand than an integer (404).
The benefit may seem minor, but I say it is worth adding atoms as a language feature (or at least a GHC extension).
So why have symbols not been added to the language? Or am I thinking about this the wrong way?
I agree with camccann's answer that it's probably missing mainly because it would have to be baked quite deeply into the implementation and it is of too little use for this level of complication. In Erlang (and Prolog and Lisp) symbols (or atoms) usually serve as special markers and serve mostly the same notion as a constructor. In Lisp, the dynamic environment includes the compiler, so it's partly also a (useful) compiler concept leaking into the runtime.
The problem is the following, symbol interning is impure (it modifies the symbol table). Because we never modify an existing object it is referentially transparent, however, but if implemented naïvely can lead to space leaks in the runtime. In fact, as currently implemented in Erlang you can actually crash the VM by interning too many symbols/atoms (current limit is 2^20, I think), because they can never get garbage collected. It's also difficult to implement in a concurrent setting without a huge lock around the symbol table.
Both problems can be (and have been) solved, however. For example, see Erlang EEP 20. I use this technique in the simple-atom package. It uses unsafePerformIO under the hood, but only in (hopefully) rare cases. It could still use some help from the GC to perform an optimisation similar to indirection shortening. It also uses quite a few IORefs internally which isn't too great for performance and memory usage.
In summary, it can be done but implementing it properly is non-trivial. Compiler writers always weigh the power of a feature against its implementation and maintenance efforts, and it seems like first-class symbols lose out on this one.
I think the simplest answer is that, of the things Lisp-style symbols (which is where both Ruby and Erlang got the idea, I believe) are used for, in Haskell most are either:
Already done in some other fashion--e.g. a data type with a bunch of nullary constructors, which also behave as "convenient names for integers".
Awkward to fit in--things that exist at the level of language syntax instead of being regular data usually have more type information associated with them, but symbols would have to either be distinct types from each other (nearly useless without some sort of lightweight ad-hoc sum type) or all the same type (in which case they're barely different from just using strings).
Also, keep in mind that Haskell itself is actually a very, very small language. Very little is "baked in", and of the things that are most are just syntactic sugar for other primitives. This is a bit less true if you include a bunch of GHC extensions, but GHC with -XAndTheKitchenSinkToo is not the same language as Haskell proper.
Also, Haskell is very amenable to pseudo-syntax and metaprogramming, so there's a lot you can do even without having it built in. Particularly if you get into TH and scary type metaprogramming and whatever else.
So what it mostly comes down to is that most of the practical utility of symbols is already available from other features, and the stuff that isn't available would be more difficult to add than it's worth.
Atoms aren't provided by the language, but can be implemented reasonably as a library:
http://hackage.haskell.org/package/simple-atom
There are a few other libs on hackage, but this one looks the most recent and well-maintained.
Haskell uses type constructors* instead of symbols so that the set of symbols a function can take is closed, and can be reasoned about by the type system. You could add symbols to the language, but it would put you in the same place that using strings would - you'd have to check all possible symbols against the few with known meanings at runtime, add error handling all over the place, etc. It'd be a big workaround for all the compile-time checking.
The main difference between strings and symbols is interning - symbols are atomic and can be compared in constant time. Both are types with an essentially infinite number of distinct values, though, and against the grain of Haskell's specifying arguments and results with finite types.
I'm more familiar with OCaml than Haskell, so "type constructor" may not be the right term. Things like None or Just 3.
An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
Use Enum instead.
data FileType = GZipped | BZipped | Plain
deriving Enum
descr ft = ["compressed with gzip",
"compressed with bzip2",
"uncompressed"] !! fromEnum ft

Question about ruby symbols [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Why don't more projects use Ruby Symbols instead of Strings?
What's the difference between a string and a symbol in Ruby?
I am new to ruby language.
From what I read I must use symbols instead strings where is possible.
Is that correct ?
You don't HAVE TO do anything. However, it is advisable to use symbols instead of strings in cases where you will be using the same string literal over and over again. The reason is that there is only ever 1 of the symbol held in memory, while if you use the string literal it will created anew whenever you assign it, therefore potentially wasting memory.
Strings are mutable where as Symbols are not. Symbols are better performance wise.
Read this article to better understand symbols, strings and their differences.
It's sort of up to you which one you use -- you can use a string anywhere you'd use a symbol, but not the other way around. Symbols do have a number of advantages, in a few cases.
Symbols give you a performance advantage because two symbols of the same name actually map to the same object in memory, whereas two strings with the same characters create different objects. Symbols are immutable and lightweight, which makes them ideal for elements that you won't be changing around at runtime; keys in a hash table, for example.
Here's a nice excerpt from the Ruby Newbie guide to symbols:
The granddaddy of all advantages is
also the granddaddy of advantages:
symbols can't be changed at runtime.
If you need something that absolutely,
positively must remain constant, and
yet you don't want to use an
identifier beginning with a capital
letter (a constant), then symbols are
what you need.
The big advantage of symbols is that
they are immutable (can't be changed
at runtime), and sometimes that's
exactly what you want.
Sometimes that's exactly what you
don't want. Most usage of strings
requires manipulation -- something you
can't do (at least directly) with
symbols.
Another disadvantage of symbols is
they don't have the String class's
rich set of instance methods. The
String class's instance method make
life easier. Much easier.
Symbols have two main advantages:
They are like intern'ed strings, so they can improve performance.
They are syntactically distinct which can make your code more readable, particularly for programs that use hashtables as complex build-in data structures.

Resources