Matching repeated pattern in string - ruby

I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.

You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.

If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.

The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)

I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo

Related

Match joined sentences by period with Regex

Using Ruby. Here's a sample text:
The ride costs E£4. It's worth having a torch to illuminate badly lit
areas.Most tombs described here are usually open to visitors. They are
listed in the order that they are found when entering the site. The
best source of information about the tombs, their decoration and
history is the Theban Mapping Project
(www.thebanmappingproject.com).Tomb of Ramses VII (KV 1) Near the main
entrance is the small, unfinished tomb of Ramses VII (1136-1129 BC).
Only 44.3m long - short for a royal tomb because of Ramses' sudden
death - it consists of a corridor, a burial chamber and an unfinished
third chamber.
I tried the following, but it matches together with the next capital letter:
/\.[A-Z]/ #=> matches .T instead of .
I want to:
match the period . in .Tomb only - any dot that is followed by a capital letter,
not match .3 in 44.3m,
not match .t or .c in www.thebanmappingproject.com.
have you tried?
/(\.)(?=[A-Z])/g
it will match any dot followed by capital letters
If text is your string,
text.scan(/(\.)[A-Z]/).flatten
#=> [".", "."]
returns what you've asked for, but is that really what you want? It might be preferable to write
text.scan(/\.[A-Z]/)
#=> [".M", ".T"]
or
text.gsub(/\.(?=[A-Z])/).with_object([]) { |_,a| a << Regexp.last_match.offset(0).first }
#=> [75, 342]
text[75, 20]
#=> ".Most tombs describe"
text[342, 20]
#=> ".Tomb of Ramses VII "
(20 is arbritary).
The use of String#gsub here is interesting. I've used gsub because, without a block, it returns an enumerator, which I need to chain with Enumerator#with_object. The value returned by gsub is in fact discarded. Since String#scan without a block does not return an enumerator, to use it I'd have to write:
a = []
text.scan(/\.(?=[A-Z])/) { a << Regexp.last_match.offset(0).first }
a #=> [75, 342]
which would hardly be the end of the world.
You were very close.You just need a parenthesis to match the dot and the global modifier g to match EVERY dot followed by a capital letter, not just the first:
/(\.)[A-Z]/g

How use match in ruby?

Im trying to get the uppercase words from a text. How i can use .match() for this?
Example
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
and I need something like:
r = /[A-Z]/
puts r.match(text)
I never used match and i need a method that gets all uppercase words (Acronym).
If you only want acronyms, you can use something like:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
text.scan(/\b[A-Z]+\b/)
# => ["PS"]
It's important to match entire words, which is where \b helps, as it marks word boundaries.
The problem is when your text contains single, stand-alone capital letters:
text = "Pediatric stroke (PS) I U.S.A"
text.scan(/\b[A-Z]+\b/)
# => ["PS", "I", "U", "S", "A"]
At that point we need a bit more intelligence and foreknowledge of the text content being searched. The question is, are single-letter acronyms valid? If not, then a minor modification will help:
text.scan(/\b[A-Z]{2,}\b/)
# => ["PS"]
{2,} is explained in the Regexp documentation, so read that for more information.
i only want acronym type " (ACRONYM) ", in this case PS
It's not easy to tell what you want by your description. An acronym is defined as:
An acronym is an abbreviation used as a word which is formed from the initial components in a phrase or a word. Usually these components are individual letters (as in NATO or laser) or parts of words or names (as in Benelux).
according to Wikipedia. By that definition, lowercase, all caps and mixed case can be valid.
If, you mean you only want all-caps within parenthesis, then you can easily modify the regex to honor that, but you'll fail on other acronyms you could encounter, by either missing ones you should want, or by capturing others you should want to ignore.
text = "(PS) (CT/CAT scan)"
text.scan(/\([A-Z]+\)/) # => ["(PS)"]
text.scan(/\([A-Z]+\)/).map{ |s| s[1..-2] } # => ["PS"]
text.scan(/\(([A-Z]+)\)/) # => [["PS"]]
text.scan(/\(([A-Z]+)\)/).flatten # => ["PS"]
are varying ways grab the text but this only opens a new can of worms when you look at "List of medical abbreviations" and "Medical Acronyms / Abbreviations".
Typically I'd have a table of the ones I'll accept, use a simple pattern to capture anything that looks like something I'd want, check to see if it's in the table then keep it or reject it. How to do that is for you to figure out as it's a completely different question and doesn't belong in this one.
Wrong function for the job. Use String#scan.
To get all words that start with uppercase, use String#scan with \b\p{Lu}\w*\b:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
puts text.scan(/\b\p{Lu}\w*\b/).flatten
See demo
The String.match() will only get you the first match, while scan will return all matches.
The regex \b\p{Lu}\w*\b matches:
\b - word boundary
\p{Lu} - an uppercase Unicode letter
\w* - 0 or more alphanumeric characters
\b - a trailing word boundary
To only match linguistic words (made of letters) you can use
puts text.scan(/\b\p{Lu}\p{M}*+(?>\p{L}\p{M}*+)*\b/).flatten
See another demo
Here, \p{Lu}\p{M}*+ matches any Unicode uppercase letter (even a precomposed one as \p{M} matches diacritics) and (?>\p{L}\p{M}*+)* matches 0 or more letters.
To only get words in ALLCAPS, use
puts text.scan(/\b(?>\p{Lu}\p{M}*+)+\b/).flatten
See the 3rd demo
Yes, you can use String#match for this. It may not be the best way, but you didn't ask if it was. You'd have to do something like this:
text.split.map { |s| s.match(/[A-Z]\w*/) }.compact.map { |md| md[0] }
#=> ["Pediatric", "PS"]
If you knew in advance that text contained two words beginning with a capital letter, you could write:
text.match(/([A-Z]\w*).*([A-Z]\w*)/)
[$1,$2]
#=> ["Pediatric", "PS"]
Note that using a regex is not your only option:
text.delete('.,!?()[]{}').split.select { |str| ('A'..'Z').cover?(str[0]) }
#=> ["Pediatric", "PS"]

In ruby, how do I use string.scan(/regex/) method for numbers from 1 to 12?

That's what I am doing:
c.scan(/[1-9]|1[0-2]/)
For some reason, it returns only numbers from 1 to 9, ignoring the second part. I tried experimenting a little bit, it seems that the method will search for 10-12 only if 1 is excluded from [1-9] part, e.g., c.scan(/[2-9]|1[0-2]/) will do. What is the reason?
P.S. I know that this method lacks lookbehinds and will search for numbers and "part of numbers" as well
Change the order of your patterns and add word boundaries if necessary.
c.scan(/\b(?:1[0-2]|[1-9])\b/)
The pattern before | is used first. So in our case, it matches all the numbers from 10 to 12. After that the next pattern, that is the one after | is used and now it matches all the remaining numbers ranges from 1 to 9. Note that this would match 9 in 59 also. So i suggest you to put your pattern inside a capturing or non-capturing group and add word boundary \b (matches between a word character and a non-word character) before and after to that group .
DEMO
| matches left to right, and the first part of the right side (1) is always matched by the left side. Reverse them:
c.scan(/1[0-2]|[1-9]/)
Here's another way you might consider extracting numbers between 1 and 12 (assuming that's what you want to do):
c = '14 0 11x 15 003 y12'
c.scan(/\d+/).map(&:to_i).select { |n| (1..12).cover?(n) }
#=> [11, 3, 12]
I've returned an array of integers, rather than strings, thinking that probably would be more useful, but if you want strings:
c.scan(/\d+/).map { |s| s.to_i.to_s }
.select { |s| ['10', '11', '12', *'1'..'9'].include?(s) }
#=> ["11", "3", "12"]
I see several advantages to this approach, versus using a single regex:
it's easy to understand;
the regex is simple;
it's easy to modify if the permissible values change; and
it can be broken into three pieces to facilitate testing.

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Resources