ruby regex split difficulties, close but not quite - ruby

I am having quite the difficulty using regex in ruby to split a string along several delimiters these delimiters are:
,
/
&
and
each of these delimiters can have any amount of white space on either side of the delimiter but each item can contain a valid space.
a great example that I've been testing against is the string 1, 2 /3 and 4 12
what I would like is something around the lines of "1, 2 /3 and 4 12".split(regex) =>["1", "2", "3", "4 12"]
The closest I've been able to get is /\s*,|\/|&|and \s*/ but this generates ["1", " 2 ", "3 ", "4 12"] instead of the desired results.
Realize this is very close and I could simply all trim on each item, but being so close and knowing it can be done is sort of driving me mad. Hopefully someone can help me keep the madness at bay.

/\s*,|\/|&|and \s*/
This parses as /(\s*,)|\/|&|(and \s*)/. I.e. the leading \s* only applies to the comma and the trailing \s* only applies to "and". You want:
/\s*(,|\/|&|and )\s*/
Or, to avoid capturing:
/\s*(?:,|\/|&|and )\s*/

Try .scan:
irb(main):030:0> "1, 2 /3 and 4 12".scan(/\d+(?:\s*\d+)*/)
=> ["1", "2", "3", "4 12"]

You can try:
(?:\s*)[,\/](?:\s*)|(?:\s*)and(?:\s*)
But as Nakilon suggested, you may have better luck with scan instead of split.

Related

Regular expression that only leaves the second number

I'm trying to make a regular expression that leaves only the second number.
{"1"=>"2", "3"=>"6"}
For example, that leaves only the numbers:
2, 6
I tried this:
(([0-9]+)[=>]*([0-9]+))
{"1"=>"2", "3"=>"6"}.values
will give you:
["2", "6"]

How to create reqular expression to match all numbers without the numbers at the beginning and end of the string?

Example code:
test = '12asiudas8787hajshd986q756tgs87ta7d6-12js01'
test.scan(regexp)
As a result, I should get:
["8787", "986", "756", "87", "7", "6", "12"]
Like using a /\d+/ regexp, but without the numbers at the beginning and end of the string, in this case 12 and 01.
To match the numbers within the string use following regex.
Regex: (?<=[^\d])(\d+)(?=[^\d])
Explanation:
(?<=[^\d]) Will ensure that it's not followed by a digit. Without this 2 of 12 at beginning will be matched too, and we don't want that.
(\d+) matches your number.
(?=[^\d]) Will ensure that last digit is not followed by a digit. Without this 0 of 01 will be matched too.
P.S: Edited regex on Wiktor Stribiżew's advice
One can also use \D instead of [^\d]. I used [^\d] to make it clear.
Regex101 Demo
Edited Regex101 Demo
test.scan(/(?<=\D)\d+(?=\D)/) # => ["8787", "986", "756", "87", "7", "6", "12"]
This should do:
test.scan(/(?<=[^\d])(\d+)(?=[^\d])/).flatten

In ruby, how do I use string.scan(/regex/) method for numbers from 1 to 12?

That's what I am doing:
c.scan(/[1-9]|1[0-2]/)
For some reason, it returns only numbers from 1 to 9, ignoring the second part. I tried experimenting a little bit, it seems that the method will search for 10-12 only if 1 is excluded from [1-9] part, e.g., c.scan(/[2-9]|1[0-2]/) will do. What is the reason?
P.S. I know that this method lacks lookbehinds and will search for numbers and "part of numbers" as well
Change the order of your patterns and add word boundaries if necessary.
c.scan(/\b(?:1[0-2]|[1-9])\b/)
The pattern before | is used first. So in our case, it matches all the numbers from 10 to 12. After that the next pattern, that is the one after | is used and now it matches all the remaining numbers ranges from 1 to 9. Note that this would match 9 in 59 also. So i suggest you to put your pattern inside a capturing or non-capturing group and add word boundary \b (matches between a word character and a non-word character) before and after to that group .
DEMO
| matches left to right, and the first part of the right side (1) is always matched by the left side. Reverse them:
c.scan(/1[0-2]|[1-9]/)
Here's another way you might consider extracting numbers between 1 and 12 (assuming that's what you want to do):
c = '14 0 11x 15 003 y12'
c.scan(/\d+/).map(&:to_i).select { |n| (1..12).cover?(n) }
#=> [11, 3, 12]
I've returned an array of integers, rather than strings, thinking that probably would be more useful, but if you want strings:
c.scan(/\d+/).map { |s| s.to_i.to_s }
.select { |s| ['10', '11', '12', *'1'..'9'].include?(s) }
#=> ["11", "3", "12"]
I see several advantages to this approach, versus using a single regex:
it's easy to understand;
the regex is simple;
it's easy to modify if the permissible values change; and
it can be broken into three pieces to facilitate testing.

Matching repeated pattern in string

I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.
You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.
If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.
The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)
I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo

Allow alphanumeric and #, replace comma and separate on space

I want to do the following in a regex:
1. allow alphanumeric characters
2. allow the # character, and comma ','
3. replace the comma ',' with a space
4. split on space
sentence = "cool, fun, house234"
>> [cool, fun, house234]
This is a simple way to do it:
sentence.scan(/[a-z0-9#]+/i) #=> ["cool", "fun", "house234"]
Basically it's looking for character runs that contain a to z in upper and lower case, plus 0 to 9, and #, and returning those. Because comma and space aren't matching they're ignored.
You don't show an example using # but I added it 'cuz you said so.
You can do 1 and 2 with a regular expression, but not 3 and 4.
sentence = "cool, fun, house234"
sentence.gsub(',', ' ').split if sentence =~ /[0-9#,]/
=> [ "cool", "fun", "house234" ]
"cool, fun, house234".split(",")
=> ["cool", " fun", " house234"]
You can just pass the "," into the split method to split on the comma, no need to convert it to spaces.
Probably, what you wanted is this?
string.gsub(/[^\w, #]/, '').split(/ +,? +/)

Resources