Perform subtraction within regular expression - ruby

I have the following RSpec output:
30 examples, 15 failures
I would like to subtract the second number from the first. I have this code:
def capture_passing_score(output)
captures = output.match(/^(?<total>\d+)\s*examples,\s*(?<failed>\d+)\s*failures$/)
captures[:total].to_i - captures[:failed].to_i
end
I am wondering if there is a way to do the calculation within a regular expression. Ideally, I'd avoid the second step in my code, and subtract the numbers within a regex. Performing mathematical operations may not be possible with Ruby's (or any) regex engine, but I couldn't find an answer either way. Is this possible?

Nope.
By every definition I have ever seen, Regular Expressions are about text processing. It is character based pattern matching. Numbers are a class of textual characters in Regex and do not represent their numerical values. While syntactic sugar may mask what is actually being done, you still need to convert the text to a numeric value to perform the subtraction.
WikiPedia
RubyDoc

If you know the format is going to remain consistent, you could do something like this:
output.scan(/\d+/).map(&:to_i).inject(:-)
It's not doing the subtraction via regex, but it does make it more concise.

Related

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Ruby evaluate an expression with regex, no eval

Sorry if this is duplicated. I thought I'd reword my question a little bit.
How could I use regex to evaluate a mathematical expression? Without using the eval function.
Example expressions:
math1 = "1+1"
math2 = "3+2-1"
I would like it to work for a variable number of numbers in the expression like I showed in the example.
This is just a bad idea. Regexp is not a parser, nor an evaluator.
Use a grammar to describe your expressions. Parse it with a formal parser like the lovely ruby gem Treetop. Then evaluate the abstract syntax tree (AST) produced by the parser.
Gosh, Treetop's arithmetic example practically gives you the solution for free.
This is a little late, but I wrote a gem for evaluating arbitrary mathematical expressions (and it doesn't use eval internally): https://github.com/rubysolo/dentaku
For addition and subtraction, this should work
(?:(/d+)([-+]))+(/d+)
This means:
one or more digits, followed by exactly one plus or minus
the above can be repeated as many times as required (this is a non capturing group)
and then must end with one or more digits.
Note that each individual number and sign are captured in groups 1..n
So to evaluate, you could take captures 1 and 3, applying the sign from capture 2. Then apply the sign from capture 4 (if it exists) with the previous result and the number from capture 5 (which must exist if capture 4 exists) and so on...
So to evaluate, in psuedo code:
i=1
result=capture(i)
loop while i <= (n-2) (where n is the capture count):
If capture(i+1) == "-" // is subtraction
result = result - capture(i+2)
Else // is addition
result = result + capture(i+2)
End if
i = i + 2
End while
This is only going to work for simple addition and subtraction like in the examples you provided, as it relies on left to right associativity. As others have suggested, you'll probably need to properly parse anything more complex, eg by building a tree of nodes that can then be evaluated in the correct (depth-first?) order.
This is really messy…
math2 = "12+3-4"
head, *tail = math2.scan(/(?<digits>\d+)(?<op>[\+\-\*\/])?/)
.map{|(digits,op)|
[digits.to_i,op]
}
.reverse
tail.inject(head.first){|sum,(digits,op)|
op.nil? ?
digits :
digits.send(op,sum)
}
# => 11
You should really consider a parser though.

Ruby Regular expression too big / Multiple string match

I have 1,000,000 strings that I want to categorize. The way I do this is to bucket it if it contains a set of words or phrases. The set of words is about 10,000. Ideally I would be able to support regular expressions, but I am focused on making it run fast right now. Example phrases:
ford, porsche, mazda...
I really dont want to match each word against the strings one by one, so I decided to use regular expressions. Unfortunately, I am running into a regular expression issue:
Regexp.new("(a)"*253)
=> /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...
Regexp.new("(a)"*254)
RegexpError: regular expression too big: /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...
where a would be one of my words or phrases. Right now, I am planning on running 10,000 / 253 matches. I read that the length of the regex heavily impacts performance, but my regex match is really simple and the regexp is created very quickly. I would like to get around the limitation somehow, or use a better solution if anyone has any ideas. Thanks.
You might consider other mechanisms for recognizing 10k words.
Trie: Sometimes called a prefix tree, it is often used by spell checkers for doing word lookups. See Trie on wikipedia
DFA (deterministic finite automata): A DFA is often created by the lexer in a compiler for recognizing the tokens of the language. A DFA runs very quickly. Simple regexes are often compiled into DFAs. See DFA on wikipedia

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

NSNumberFormatter: plusSign vs. positivePrefix

The NSNumberFormatter class has plusSign and minusSign properties, but also has positivePrefix and negativePrefix.
What is the difference between these two? If I specify both, which one comes first?
plusSign and minusSign are used for the mathematical addition and subtraction operators. positivePrefix and suffix and negativePrefix and suffix are used to describe what characters/strings are used todisplay whether a certain numeric value is positive or negative.
To illustrate why they are different: most of the times, when a positive numeric value is displayed anywhere you'll just see numbers, No prefix or suffix. Negative numeric values have a minus in front, or behind them, or, in some styles of accounting they're just enclosed in brackets. Either way we'll still need a + and a - to express mathematical operations.

Resources