Matching regular expressions against non-Strings in Ruby without conversion - ruby

If a Ruby regular expression is matching against something that isn't a String, the to_str method is called on that object to get an actual String to match against. I want to avoid this behavior; I'd like to match regular expressions against objects that aren't Strings, but can be logically thought of as randomly accessible sequences of bytes, and all accesses to them are mediated through a byte_at() method (similar in spirit to Java's CharSequence.char_at() method).
For example, suppose I want to find the byte offset in an arbitrary file of an arbitrary regular expression; the expression might be multi-line, so I can't just read in a line at a time and look for a match in each line. If the file is very big, I can't fit it all in memory, so I can't just read it in as one big string. However, it would be simple enough to define a method that gets the nth byte of a file (with buffering and caching as needed for speed).
Eventually, I'd like to build a fully featured rope class, like in Ruby Quiz #137, and I'd like to be able to use regular expressions on them without the performance loss of converting them to strings.
I don't want to get up to my elbows in the innards of Ruby's regular expression implementation, so any insight would be appreciated.

You can't. This wasn't supported in Ruby 1.8.x, probably because it's such an edge case; and in 1.9 it wouldn't even make sense. Ruby 1.9 doesn't map its strings to bytes in any user-serviceable fashion; instead it uses character code points, so that it can support the multitude of encodings that it accepts. And 1.9's new optimized regex engine, Oniguruma, is also built around the same concept of encodings and code points. Bytes just don't enter into the picture at this level.
I have a suspicion that what you're asking for is a case of premature optimization. For any reasonable Ruby object, implementing to_str shouldn't be a huge performance hurdle. If it is, then Ruby's probably the wrong tool for you, as it abstracts and insulates you from your raw data in all sorts of ways.
Your example of looking for a byte sequence in a large binary file isn't an ideal use case for Ruby -- you'd be better off using grep or some other Unix tool. If you need the results in your Ruby program, run it as a system process using backticks and process the output.

Related

How does the ruby interpreter tell when you're calling a method on an integer or when you're setting a float?

I had a look through the 0.methods output in irb and couldn't see what path the ruby interpreter would take when it was passed 0.15 as opposed to 0.to_s
I've tried reading up on how ruby determines the difference between a floating point number being defined and a method being called on an integer but I haven't come to any conclusions.
The best guess I have is that because Ruby doesn't allow for a digit to lead a method name, it simply checks whether the character following the . is numeric or alphabetical.
I don't like guessing though, assumptions can lead to misunderstandings. Can someone clear this up for me?
How well can you read Yacc files? (Rhetorical question)
https://github.com/ruby/ruby/blob/trunk/parse.y#L7380 I believe this is where the Ruby parser handles floating point tokenisation.
Disclaimer: parse.y hurts my head.
As Methods in Ruby cannot begin with numbers, it's pretty easy to determine that 6.foo is a Method call and 6.12 is a Float.
You can distinguish both of them by pretty simple regular grammar specs, which is all what a Lexer needs to tokenize the source code.
I don't know for sure, but I think it is safe to assume that the two are distinguished by method names being unable to start with a number.
I don't see that it's an especially interesting or useful thing to know, and I think your curiosity is best directed elsewhere.

How to get unicode character that is alphabetically next?

How to get in Ruby 1.8.7 unicode character that is alphabetically right after given character?
If you mean "next in the code page" then you can always hack around with the bytes and find out. You will probably end up falling into holes with no assigned characters if you go exploring the code page sequentially. This would mean "Unicode-abetically" if you can imagine such a term.
If you mean "alphabetically" then you're out of luck since that doesn't mean anything. The concept of alphabetic order varies considerably from one language to another and is sometimes even context-specific. Some languages don't even have a set order to their characters at all. This is the reason why some systems have a collation in addition to an encoding. The collation defines order, but often many letters are considered equivalent for the purposes of sorting, further complicating things.
Ruby 1.8.7 is also not aware about Unicode in general and pretends everything is an 8-bit ASCII string with one byte characters. Ruby 1.9 can parse multi-byte UTF-8 into separate characters and might make this exercise a lot easier.

Efficiency of each_char vs. getc (and analagous methods) in Ruby

How do the iterator methods such as each_char, each_line. etc. compare to while-loopedgetc,gets, etc. methods for reading large files? Mainly, what is the overhead for using each method, which one will use more memory, and which one will be faster?
Essentially, which will be better in terms of memory, overhead, and speed if file is a 100MB text file?
file.each_char{
|ch|
#process ch
}
vs
ch = ""
until(file.eof?)
ch = file.getc
#process ch
end
Or is there an even better method of doing this?
You can easily answer questions like this definitively using the Ruby Standard Library's benchmark package.
Or is there an even better method of doing this?
I think so. While a C program might reasonably process a file one character at a time, Ruby has a towering edifice of String and Array functionality built-in, and that's all written in C so it runs quickly.
It may not seem efficient to split a line into words and then just use a few, or just count them, or whatever, but it's probably a lot faster than parsing that line one character at a time and it's more easily read and rewritten if necessary.
In general, I would suggest that a Ruby program leverage the library and do as much as possible working with objects that are as abstract as possible.

Question about ruby symbols [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Why don't more projects use Ruby Symbols instead of Strings?
What's the difference between a string and a symbol in Ruby?
I am new to ruby language.
From what I read I must use symbols instead strings where is possible.
Is that correct ?
You don't HAVE TO do anything. However, it is advisable to use symbols instead of strings in cases where you will be using the same string literal over and over again. The reason is that there is only ever 1 of the symbol held in memory, while if you use the string literal it will created anew whenever you assign it, therefore potentially wasting memory.
Strings are mutable where as Symbols are not. Symbols are better performance wise.
Read this article to better understand symbols, strings and their differences.
It's sort of up to you which one you use -- you can use a string anywhere you'd use a symbol, but not the other way around. Symbols do have a number of advantages, in a few cases.
Symbols give you a performance advantage because two symbols of the same name actually map to the same object in memory, whereas two strings with the same characters create different objects. Symbols are immutable and lightweight, which makes them ideal for elements that you won't be changing around at runtime; keys in a hash table, for example.
Here's a nice excerpt from the Ruby Newbie guide to symbols:
The granddaddy of all advantages is
also the granddaddy of advantages:
symbols can't be changed at runtime.
If you need something that absolutely,
positively must remain constant, and
yet you don't want to use an
identifier beginning with a capital
letter (a constant), then symbols are
what you need.
The big advantage of symbols is that
they are immutable (can't be changed
at runtime), and sometimes that's
exactly what you want.
Sometimes that's exactly what you
don't want. Most usage of strings
requires manipulation -- something you
can't do (at least directly) with
symbols.
Another disadvantage of symbols is
they don't have the String class's
rich set of instance methods. The
String class's instance method make
life easier. Much easier.
Symbols have two main advantages:
They are like intern'ed strings, so they can improve performance.
They are syntactically distinct which can make your code more readable, particularly for programs that use hashtables as complex build-in data structures.

Why does Ruby expose symbols?

Why does Ruby expose symbols for explicit use? Isn't that the sort of optimisation that's usually handled by the interpreter/compiler?
Part of the issue is that Ruby strings are mutable. Since every string Ruby allocates must be independent (it can't cache short/common ones), it's convenient to have a Symbol type to let the programmer have what are essentially immutable, memory-efficient strings.
Also, they share many characteristics with enum's, but with less pain for the programmer.
Ruby symbols are used in lieu of string constants in other similar languages. Besides the performance benefit, they can be used to semantically distinguish between string data and a more abstract symbol. Being syntactically different, they can clearly be distinguished in code.
Have a look at Ruby symbols post.

Resources