string parsing optimization : ruby - ruby

I am working on a parser that is currently way too slow for my needs (like 40x slower than I would like) and would like advice on methods to increase my speed. I have tried and am currently using a custom regex parser, aswell as a custom parser using strscanner class. Ive heard a lot of positive comments on treetop, and have considered trying to combine the regex into one huge regex that would cover all matches, but would like to get some feedback w/ experience before I rewrite my parser yet again.
The basic rules of the strings that I am parsing are:
3 segments (BoL operators, message, EoL operators)
~6 BoL operators
BoL operators can be in any order
2 EoL operators EoL operators can be in any order
Quantity of any specific operator can be 0, 1, or >1 but only 1 is used rest are removed and discarded
Operators in the 'message' section of the string are not captured / removed
Whitespaces is allowed before & after operators but not required
Some BoL operators can have whitespace in the setting
My current Regex parser works by running the string through a loop that checks for BoL or EoL operators 1 at a time and cutting them out, ending the loop when there are no more operators of the given type as so...
loop{
if input =~ /^\s+/ then input.gsub!(/^\s+/,'') end
if input =~ /reges for operator_a/ #sets
sets operator_a
input.gsub!(/regex for operator_a)/, '')
elsif input =~ /regex for operator_b/
sets operator_b
input.gsub!(/regex for operator_b/,'')
elsif input =~ /regex for operator_c/
sets operator_c
etc .. etc .. etc..
else
break
end
}
The question I have, What would be the best way to optimize this code? Treetop, another library/gem that I have not found yet, combining the loops into one huge regex, something else?
Please restrict all answers and input to the Ruby language, I know that it is not the 'best' tool for this job, it is the language that I use.
More specific grammer / examples if that helps.
This is for parsing communication commands sent to a game by users, so far the only commands are say, and whisper. The begenning of line operators accepted are ::{target}, :{adverb}, ={verb}, and #{direction of}. The end of line operators are {emoticon (aka. :D :( :)}, which sets adverb if not already set, and end of line puncutation which sets verb if not already set.
the character ' is an alias for say, and sayto is an alias for say::
examples :
':happy::my sword=as# my helm Bol command operators work.
{:action=>:say, :adverb=>"happily", :verb=>"ask", :direction=>"my helm", :message=>"Bol command operators work."}
say yep say works
{:action=>:say, :message=>" yep say works"}
sayto my sword yep sayto works as do EoL operators!:)
{:action=>:say, :target=>"my sword", :adverb=>"happily", :verb=>"say", :message=>"yep sayto works as do EoL operators!"}
whisper::my friend : happy Bol command operators work with
whisper.
{:action=>:whisper, :target=>"my friend", :adverb=>"happily", :message=>"Bol command operators work with whisper."}
whisp:happy::tinkerbell and they work in a different order.
{:action=>:whisper, :adverb=>"happily", :target=>"tinkerbell", :message=>"and they work in a different order."}
':bash=exclaim::hammer BoL operators work in this order too.
{:action=>:say, :adverb=>"bashfully", :verb=>"exclaim", :target=>"hammer", :message=>"BoL operators work in this order too."}
sayto bells =say :sad #wontwork Bol > Eol and directed !work with
directional? :)
{:action=>:say, :verb=>"say", :adverb=>"sadly", :direction=>"wontwork", :message=>"Bol > Eol and directed !work with directional?"}
'all EoL removed closest to end used and reinserted. !!??!?....... :)
? :(
{:action=>:say, :adverb=>"sadly", :verb=>"ask", :message=>"all EoL removed closest to end used and reinserted?"}

Maybe this syntax is useful in your case:
emoti_convert = { ":)" => "happily", ":(" => "sadly" }
re_emoti = Regexp.union(emoti_convert.keys)
str = "It does not work :(. Oh, it does :)!"
p str.gsub(re_emoti, emoti_convert)
#=> "It does not work sadly. Oh, it does happily!"
But if you are trying to define a grammar, this is not the way to go (agreeing with #Dave Newton's comments).

Related

Count the number of sentences in a paragraph using Ruby

I have gotten to the point where I can split and count sentences with simple end of sentence punctuation like ! ? .
However, I need it to work for complex sentences such as:
"Learning Ruby is a great endeavor!!!! Well, it can be difficult at times..."
Here you can see the punctuation repeats itself.
What I have so far, that works with simple sentences:
def count_sentences
sentence_array = self.split(/[.?!]/)
return sentence_array.count
end
Thank you!
It's pretty easy to adapt your code to be a little more forgiving:
def count_sentences
self.split(/[.?!]+/).count
end
There's no need for the intermediate variable or return.
Note that empty strings will also be caught up in this, so you may want to filter those out:
test = "This is junk! There's a space at the end! "
That would return 3 with your code. Here's a fix for that:
def count_sentences
self.split(/[.?!]+/).grep(/\S/).count
end
That will select only those strings that have at least one non-space character.
class String
def count_sentences
scan(/[.!?]+(?=\s|\z)/).size
end
end
str = "Learning Ruby is great!!!! The course cost $2.43... How much??!"
str.count_sentences
#=> 3
(?=\s|\z)/) is a positive lookahead, requiring the match to be immediately followed by a whitespace character or the end of the string.
String#count might be easiest.
"Who will treat me to a beer? I bet, alexnewby will!".count('.!?')
Compared to tadman's solution, no intermediate array needs to be constructed. However it yields incorrect results if, for instance, a run of periods or exclamation mark is found in the string:
"Now thinking .... Ah, that's it! This is what we have to do!!!".count('.!?')
=> 8
The question therefore is: Do you need absolute, exact results, or just approximate ones (which might be sufficient, if this is used for statistical analysis of, say, large printed texts)? If you need exact results, you need to define, what is a sentence, and what is not. Think about the following text - how many sentences are in it?
Louise jumped out of the ground floor window.
"Stop! Don't run away!", cried Andy. "I did not
want to eat your chocolate; you have to believe
me!" - and, after thinking for a moment, he
added: "If you come back, I'll buy you a new
one! Large one! With hazelnuts!".
BTW, even tadman's solution is not exact. It would give a count of five for the following single sentence:
The IP address of Mr. Sloopsteen's dishwasher is 192.168.101.108!

what is the elegant way to replace string without gsub?

I have some dynamic strings, which have an X character. X can appear continuously or scattered though the string. I want to replace those X with #.
For example, abXXcX12XX. I want ab#c#12#. That means multiple contiguous X have to be replaced by only one # and if only one X, then also by a single #.
I tried:
s = "aXX123Xc56XXX"
s.squeeze('X').gsub('X','#') # => "a#123#c56#"
Any elegant way or direct approach to do the same operation ?
I will do using String#tr_s as below :
Processes a copy of str as described under String#tr, then removes duplicate characters in regions that were affected by the translation.
s = "aXX123Xc56XXX"
s.tr_s('X','#') # => "a#123#c56#"
Not sure why you wouldn't use gsub here?
Regexes seem to work pretty well:
"aXX123Xc56XXX".gsub(/X+/, "#")
=> "a#123#c56#"
The reason this works is that /X+/ will match one or more of the X character, so multiple X in a row will generate only one match and be replaced by one #.
pry(main)> "aXX123Xc56XXX".gsub(/X+/, "#")
=> "a#123#c56#"
While both gsub and tr_s will accomplish the task, here's the compelling reason to use tr_s:
require 'fruity'
STRING = 'aXX123Xc56XXX' * 1000
compare do
using_tr_s { STRING.tr_s('X', '#') }
using_gsub { STRING.gsub(/X+/, '#') }
end
Which, after running on my laptop, results in:
Running each test 16 times. Test will take about 1 second.
using_tr_s is faster than using_gsub by 5x ± 0.1
The regular expression engine has seen a lot of speedups, but it'll still be beat for non-anchored lookups. If there was a way to tell it to start at the start or end of the string we'd see it speed up greatly, and, for some search/replace actions, I've seen it outrun everything else. The pattern used is critical; Poorly written ones cripple the engine, so be careful what you use.

Good style for splitting lengthy expressions over lines

If the following is not the best style, what is for the equivalent expression?
if (some_really_long_expression__________ && \
some_other_really_long_expression)
The line continuation feels ugly. But I'm having a hard time finding a better alternative.
The parser doesn't need the backslashes in cases where the continuation is unambiguous. For example, using Ruby 2.0:
if true &&
true &&
true
puts true
end
#=> true
The following are some more-or-less random thoughts about the question of line length from someone who just plays with Ruby. Nor have I had any training as a software engineer, so consider yourself forewarned.
I find the problem of long lines is often more the number of characters than the number of operations. The former can be reduced by (drum-roll) shortening variable names and method names. The question, of course, is whether the application of a verbosity filter (aka babbling, prattling or jabbering filter) will make the code harder to comprehend. How often have you seen something fairly close to the following (without \)?
total_cuteness_rating = cats_dogs_and_pigs.map {|animal| \
cuteness_calculation(animal)}.reduce {|cuteness_accumulator, \
cuteness_per_animal| cuteness_accumulator + cuteness_per_animal}
Compare that with:
tot_cuteness = pets.map {|a| cuteness(a)}.reduce(&:+)
Firstly, I see no benefit of long names for local variables within a block (and rarely for local variables in a method). Here, isn't it perfectly obvious what a refers to in the calculation of tot_cuteness? How good a memory do you need to remember what a is when it is confined to a single line of code?
Secondly, whenever possible use the short form for enumerables followed by a block (e.g, reduce(&:+)). This allows us to comprehend what's going on in microseconds, here as soon as our eyes latch onto the +. Same, for .to_i, _s or _f. True, reduce {|tot, e| tot + e} isn't much longer, but we're forcing the reader's brain to decode two variables as well as the operator, when + is really all it needs.
Another way to shorten lines is to avoid long chains of operations. That comes at a cost, however. As far as I'm concerned, the longer the chain, the better. It reduces the need for temporary variables, reduces the number of lines of code and--possibly of greatest importance--allows us to read across a line, as most humans are accustomed, rather than down the page. The above line of code reads, "To calculate total cuteness, calculate each pet's cuteness rating, then sum those ratings". How could it be more clear?
When chains are particularly long, they can be written over multiple lines without using the line-continuaton character \:
array.each {|e| blah, blah, ..., blah
.map {|a| blah, blah, ..., blah
.reduce {|i| blah, blah, ..., blah }
}
}
That's no less clear than separate statements. I think this is frequently done in Rails.
What about the use of abbreviations? Which of the following names is most clear?
number_of_dogs
number_dogs
nbr_dogs
n_dogs
I would argue the first three are equally clear, and the last no less clear if the writer consistently prefixes variable names with n_ when that means "number of". Same for tot_, and so on. Enough.
One approach is to encapsulate those expressions inside meaningful methods. And you might be able to break it into multiple methods that you can later reuse.
Other then that is hard to suggest anything with the little information you gave. You might be able to get rid of the if statement using command objects or something like that but I can't tell if it makes sense on your code because you didn't show it.
Ismael answer works really well in Ruby (there may be other languages too) for 2 reasons:
Ruby has very low overhead to creating methods due to lack of type
definition
It allows you to decouple such logic for reuse or future adaptability and testing
Another option I'll toss out is create logic equations and store the result in a variable e.g.
# this are short logic equations testing x but you can apply same for longer expressions
number_gt_5 = x > 5
number_lt_20 = x < 20
number_eq_11 = x == 11
if (number_gt_5 && number_lt_20 && !number_eq_11)
# do some stuff
end

Avoiding Spaces in Keywords

I am designing a language. I am being troubled by what to call "else if". My language uses indentation for blocks, so I need a keyword for "else if".
Python uses "elif" (meh...) and Ruby uses "elsif" (yuck!). Personally, I hate to use abbreviations, so I don't want to use either of these. Instead, I am thinking of just using "else if", where an arbitrary number of spaces can appear between "else" and "if".
I've noticed that this doesn't occur very often in programming languages. C# has "yield return" as a keyword, but that's the only example I can think of.
Is there an implementation concern behind this? I've created my lex file and it accepts the keyword with no issues. I am worried there is something I haven't thought of.
As long as you don't allow inline comments/newlines, there's nothing wrong with multi-word keywords. The only thing is that else if might be confusing for your language users, tempting them to write else while or else for. You'll have hard time explaining them that your else if is a keyword and not two statements following each other.
Out of curiosity, why bother having an "else if" keyword? You can just have an "else" keyword and the next thing is an expression... an "if" expression.
<expression> := IF | somethingelse
IF <expression> THEN <expression> ELSE <expression>
The idea being that the if in "else if" is just the start of the expression after the "else" keyword.
A sample of what I mean, per the comments
if (x == 1)
return 5
else return 4
if (x == 1)
return 5
else if (x == 2)
return 6
else
return 7
The concept above being that the return 4 is no different than the if and it's "arguments". As long as you allot the start of the next command on the same line (even if it's not considered good form), you can treat the if in else if as just the start of the next expression.

Why should you avoid the then keyword in Ruby?

It's mentioned in several Ruby style guides that you should "Never use then." Personally, I think the "then" keyword allows you to make code denser, which tends to be harder to read. Is there any other justification for this recommendation?
I almost never use the then keyword. However, there is one case where I believe it greatly improves readability. Consider the following multi conditional if statements.
Example A
if customer.jobs.present? && customer.jobs.last.date.present? && (Date.today - customer.jobs.last.date) <= 90
puts 'Customer had a job recently'
end
Line length too long. Hard to read.
Example B
if customer.jobs.present? &&
customer.jobs.last.date.present? &&
(Date.today - customer.jobs.last.date) <= 90
puts 'Customer had a job recently'
end
Where do the conditions end and where does the inner code begin. According to must Ruby Style Guides, you to have one extra space of indentation for multi line conditionals, but I still don't find it all the easy to read.
Example C
if customer.jobs.present? &&
customer.jobs.last.date.present? &&
(Date.today - customer.jobs.last.date) <= 90
then
puts 'Customer had a job recently'
end
To me, Example C is by far the most clear. And it is the use of the then that does the trick.
If I remember correctly, then is just one of the delimiters to separate the condition from the true part (semicolon and newline being the others)
if you have an if statement that is a one-liner, you'd have to use one of the delimiters.
if (1==2) then puts "Math doesn't work" else puts "Math works!" end
for multi-line ifs, then is optional (newline works)
if (1==2)
puts "Math doesn't work"
else
puts "Math works!"
end
Could you post a link to one of the style-guides that you mention...
I think "never use then" is wrong. Using too much non-alphabet characters can make code as difficult to read as perl or APL. Using a natural language word often makes the programmer more comfortable. It depends on the balance between readability and compactness of the code. Ternary operator is occasionally convinient, but is ugly if misused.

Resources