Can anybody help me write a regular expression which could find all the instances of the following in a long string >
type="array" count="x" total="y"
where x and y could be any numbers from 1 to 100.
language is ruby.
First, since we'll use the regex for a number twice, we'll save it as its own variable. Note that the number regex is comprised of three separate pieces: one-digit numbers, two-digit numbers, and three-digit numbers. This is a good rule of thumb to use when trying to make a regex to match a range of numbers. It's easy to get it wrong otherwise (allowing strings like "07").
Once you have the number regex, the rest is easy.
number = /[1-9]|[1-9][0-9]|100/
regex = /type="array" count="#{number}" total="#{number}"/
string.scan(regex)
This will return an array of matches
long_string.scan(/type="array" count="(?:[1-9]\d?|100)" total="(?:[1-9]\d?|100)")
Related
I have the following RSpec output:
30 examples, 15 failures
I would like to subtract the second number from the first. I have this code:
def capture_passing_score(output)
captures = output.match(/^(?<total>\d+)\s*examples,\s*(?<failed>\d+)\s*failures$/)
captures[:total].to_i - captures[:failed].to_i
end
I am wondering if there is a way to do the calculation within a regular expression. Ideally, I'd avoid the second step in my code, and subtract the numbers within a regex. Performing mathematical operations may not be possible with Ruby's (or any) regex engine, but I couldn't find an answer either way. Is this possible?
Nope.
By every definition I have ever seen, Regular Expressions are about text processing. It is character based pattern matching. Numbers are a class of textual characters in Regex and do not represent their numerical values. While syntactic sugar may mask what is actually being done, you still need to convert the text to a numeric value to perform the subtraction.
WikiPedia
RubyDoc
If you know the format is going to remain consistent, you could do something like this:
output.scan(/\d+/).map(&:to_i).inject(:-)
It's not doing the subtraction via regex, but it does make it more concise.
Problem: find all words that follow a pattern (independently from the actual symbols used to define the pattern).
Almost identical to what this site does: http://design215.com/toolbox/wordpattern.php
Enter patterns like: ABCCDE
This will find words like "bloody",
"kitten", and "valley". The above pattern will NOT find words like
"fennel" or "hippie" because that would require the pattern to be
ABCCBE.
Please note: I need a version of that algorithm that does find words like "fennel" or "hippie" even with an ABCCDE pattern.
To complicate things further, there is the possibility to add known characters anywhere in the searching pattern, for example: cBBX (where c is the known character) will yield cees, coof, cook, cool ...
What I've done so far: I found this answer (Pattern matching for strings independent from symbols) that solves my problem almost perfectly, but if I assign an integer to every word I need to compare, I will encounter two problems.
The first is the number of unique digits I can use. For example, if the pattern is XYZABCDEFG, the equivalent digit pattern will be 1 2 3 4 5 6 7 8 9 and then? 10? Consider that I would use the digit 0 to indicate a known character (for example, aBe --> 010 --> 10). Using hexadecimal digits will move the problem further, but will not solve it.
The second problem is the maximum length of the pattern: a Long in Java is 19-digit long, and I need no restriction in my patterns (although I don't think there exists a word with 20 different characters).
To solve those problems, I could store each digit of the pattern in an array, but then it becomes an array-to-array comparison instead of an integer comparison, thus taking a lot more time to compute.
As a side note: according to the algorithm used, what data structure will be the best suited for storing the dictionary? I was thinking about using an hash-map, converting each word into its digit-pattern equivalent (assuming no known character) and using this number as an hash (of course, there would be a lot of collisions). In that way searching will require first to match the numeric pattern, and then to scan the results to find all the words that have the known characters at the right place (if present in the original searching pattern).
Also, the dictionary is not static: words can be added and deleted.
EDIT:
This answer (https://stackoverflow.com/a/44604329/4452829) works fairly well and it's fast (testing for equal lengths before matching the patterns). The only problem is that I need a version of that algorithm that find words like "fennel" or "hippie" even with an ABCCDE pattern.
I've already implemented a way to check for known characters.
EDIT 2:
Ok, by checking if each character in the pattern is greater or equal than the corresponding character in the current word (normalized as a temporary pattern) I am almost done: it correctly matches the search pattern ABCA with the word ABBA and it correctly ignores the word ABAC. The last problem remaining is that if (for example) the pattern is ABBA it will match the word ABAA, and that's not correct.
EDIT 3:
Meh, not pretty but it seems to be working (I'm using Python because it's fast to code with it). Also, the search pattern can be any sequence of symbols, using lowercase letters as fixed characters and everything else as wildcards; there is also no need to convert each word in an abstract pattern.
def match(pattern, fixed_chars, word):
d = dict()
if len(pattern) != len(word):
return False
if check_fixed_char(word, fixed_chars) is False:
return False
for i in range(0, len(pattern)):
cp = pattern[i]
cw = word[i]
if cp in d:
if d[cp] != cw:
return False
else:
d[cp] = cw
if cp > cw:
return False
return True
A long time ago I wrote a program for solving cryptograms which was based on the same concept (generating word patterns such that "kitten" and "valley" both map to "abccde."
My technique did involve generating a sort of index of words by pattern.
The core abstraction function looks like:
#!python
#!/usr/bin/env python
import string
def abstract(word):
'''find abstract word pattern
dog or cat -> abc, book or feel -> abbc
'''
al = list(string.ascii_lowercase)
d = dict
for i in word:
if i not in d:
d[i] = al.pop(0)
return ''.join([d[i] for i in word])
From there building our index is pretty easy. Assume we have a file like /usr/share/dict/words (commonly found on Unix-like systems including MacOS X and Linux):
#!/usr/bin/python
words_by_pattern = dict()
words = set()
with open('/usr/share/dict/words') as f:
for each in f:
words.add(each.strip().lower())
for each in sorted(words):
pattern = abstract(each)
if pattern not in words_by_pattern:
words_by_pattern[pattern] = list()
words_by_pattern[pattern].append(each)
... that takes less than two seconds on my laptop for about 234,000 "words" (Although you might want to use a more refined or constrained word list for your application).
Another interesting trick at this point is to find the patterns which are most unique (returns the fewest possible words). We can create a histogram of patterns thus:
histogram = [(len(words_by_pattern[x]),x) for x in words_by_pattern.keys()]
histogram.sort()
I find that the this gives me:
8077 abcdef
7882 abcdefg
6373 abcde
6074 abcdefgh
3835 abcd
3765 abcdefghi
1794 abcdefghij
1175 abc
1159 abccde
925 abccdef
Note that abc, abcd, and abcde are all in the top ten. In other words the most common letter patterns for words include all of those with no repeats among 3 to 10 characters.
You can also look at the histogram of the histogram. In other words how many patterns only show one word: for example aabca only matches "eerie" and aabcb only matches "llama". There are over 48,000 patterns with only a single matching word and almost six thousand with just two words and so on.
Note: I don't use digits; I use letters to create the pattern mappings.
I don't know if this helps with your project at all; but this are very simple snippets of code. (They're intentionally verbose).
This can easily be achieved through using Regular Expressions.
For example, the below pattern matches any word that has ABCCDE pattern:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?!\1|\2|\3|\5)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
And this one matches ABCCBE:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?=\2)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
To cover both above pattern, you can use:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?(?=\2)|(?!\1\2\3\5))([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
Going this path, your challenge would be generating the above Regex pattern out of the alphabetic notation you used.
And please note that you may want to use the i Regex flag when using these if case-insensitivity is a requirement.
For more Regex info, take a look at:
Look-around
Back-referencing
I just learned from a book about regular expressions in the Ruby language. I did Google it, but still got confused about {x} and {x,y}.
The book says:
{x}→Match x occurrences of the preceding character.
{x,y}→Match at least x occurrences and at most y occurrences.
Can anyone explain this better, or provide some examples?
Sure, look at these examples:
http://rubular.com/r/sARHv0vf72
http://rubular.com/r/730Zo6rIls
/a{4}/
is the short version for:
/aaaa/
It says: Match exact 4 (consecutive) characters of 'a'.
where
/a{2,4}/
says: Match at least 2, and at most 4 characters of 'a'.
it will match
/aa/
/aaa/
/aaaa/
and it won't match
/a/
/aaaaa/
/xxx/
Limiting Repetition good online tutorial for this.
I highly recommend regexbuddy.com and very briefly, the regex below does what you refer to:
[0-9]{3}|\w{3}
The [ ] characters indicate that you must match a number between 0 and 9. It can be anything, but the [ ] is literal match. The { } with a 3 inside means match sets of 3 numbers between 0 and 9. The | is an or statement. The \w, is short hand for any word character and once again the {3} returns only sets of 3.
If you go to RegexPal.com you can enter the code above and test it. I used the following data to test the expression:
909 steve kinzey
and the expression matched the 909, the 'ste', the 'kin' and the 'zey'. It did not match the 've' because it is only 2 word characters long and a word character does not span white space so it could not carry over to the second word.
Interval Expressions
GNU awk refers to these as "interval expressions" in the Regexp Operators section of its manual. It explains the expressions as follows:
{n}
{n,}
{n,m}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times:
The manual also includes these reference examples:
wh{3}y
Matches ‘whhhy’, but not ‘why’ or ‘whhhhy’.
wh{3,5}y
Matches ‘whhhy’, ‘whhhhy’, or ‘whhhhhy’, only.
wh{2,}y
Matches ‘whhy’ or ‘whhhy’, and so on.
See Also
Ruby's Regexp class.
Quantifiers section of Ruby's oniguruma engine.
Sorry if this is duplicated. I thought I'd reword my question a little bit.
How could I use regex to evaluate a mathematical expression? Without using the eval function.
Example expressions:
math1 = "1+1"
math2 = "3+2-1"
I would like it to work for a variable number of numbers in the expression like I showed in the example.
This is just a bad idea. Regexp is not a parser, nor an evaluator.
Use a grammar to describe your expressions. Parse it with a formal parser like the lovely ruby gem Treetop. Then evaluate the abstract syntax tree (AST) produced by the parser.
Gosh, Treetop's arithmetic example practically gives you the solution for free.
This is a little late, but I wrote a gem for evaluating arbitrary mathematical expressions (and it doesn't use eval internally): https://github.com/rubysolo/dentaku
For addition and subtraction, this should work
(?:(/d+)([-+]))+(/d+)
This means:
one or more digits, followed by exactly one plus or minus
the above can be repeated as many times as required (this is a non capturing group)
and then must end with one or more digits.
Note that each individual number and sign are captured in groups 1..n
So to evaluate, you could take captures 1 and 3, applying the sign from capture 2. Then apply the sign from capture 4 (if it exists) with the previous result and the number from capture 5 (which must exist if capture 4 exists) and so on...
So to evaluate, in psuedo code:
i=1
result=capture(i)
loop while i <= (n-2) (where n is the capture count):
If capture(i+1) == "-" // is subtraction
result = result - capture(i+2)
Else // is addition
result = result + capture(i+2)
End if
i = i + 2
End while
This is only going to work for simple addition and subtraction like in the examples you provided, as it relies on left to right associativity. As others have suggested, you'll probably need to properly parse anything more complex, eg by building a tree of nodes that can then be evaluated in the correct (depth-first?) order.
This is really messy…
math2 = "12+3-4"
head, *tail = math2.scan(/(?<digits>\d+)(?<op>[\+\-\*\/])?/)
.map{|(digits,op)|
[digits.to_i,op]
}
.reverse
tail.inject(head.first){|sum,(digits,op)|
op.nil? ?
digits :
digits.send(op,sum)
}
# => 11
You should really consider a parser though.
Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.