Allow alphanumeric and #, replace comma and separate on space - ruby

I want to do the following in a regex:
1. allow alphanumeric characters
2. allow the # character, and comma ','
3. replace the comma ',' with a space
4. split on space
sentence = "cool, fun, house234"
>> [cool, fun, house234]

This is a simple way to do it:
sentence.scan(/[a-z0-9#]+/i) #=> ["cool", "fun", "house234"]
Basically it's looking for character runs that contain a to z in upper and lower case, plus 0 to 9, and #, and returning those. Because comma and space aren't matching they're ignored.
You don't show an example using # but I added it 'cuz you said so.

You can do 1 and 2 with a regular expression, but not 3 and 4.
sentence = "cool, fun, house234"
sentence.gsub(',', ' ').split if sentence =~ /[0-9#,]/
=> [ "cool", "fun", "house234" ]

"cool, fun, house234".split(",")
=> ["cool", " fun", " house234"]
You can just pass the "," into the split method to split on the comma, no need to convert it to spaces.

Probably, what you wanted is this?
string.gsub(/[^\w, #]/, '').split(/ +,? +/)

Related

Regex to obfuscate substring of a repeating substring

Given a string like:
abc_1234 xyz def_123aa4a56
I want to replace parts of it so the output is:
abc_*******z def_*******56
The rules are:
abc_ and def_ are kind of delimiters, so anything between the two are part of the previous delimiter string.
The string between the abc_ and def_, and the next delimited string should be replaced by *, except for the last 2 characters of that substring. In the above example, abc_1234 xyz (note trailing space), got turned into abc_*******z
prefixes = %w|abc_ def_|
input = "Hello abc_111def_frg def_333World abc_444"
input.gsub(/(#{Regexp.union(prefixes)})../, "\\1**")
#⇒ "Hello abc_**1def_**g def_**3World abc_**4"
Is this what you are looking for?
str = "Hello abc_111def_frg def_333World abc_444"
str.scan(/(?<=abc_|def_)(?:[[:alpha:]]+|[[:digit:]]+)/)
# => ["111", "frg", "333", "444"]
I've assumed the string following "abc_" or "def_" is either all digits or all letters. It won't work if, for example, you wished to extract "a1b" from "abc_a1b cat". You need to better define the rules for what terminates the strings you want.
The regular expression reads, "Following the string "abc_" or "def_" (a positive lookbehind that is not part of the match), match a string of digits or a string of letters".
Given:
> s
=> "abc_1234 xyz def_123aa4a56"
You can do:
> s.gsub(/(?<=abc_|def_)(.*?)(..)(?=(?:abc_|def_|$))/) { |m| "*" * $1.length<<$2 }
=> "abc_*******z def_*******56"

gsub numbers and +

I'm saving a number with params[:number].gsub(/\D/,''), but I don't want to strip the plus symbol: +
For example if a user saves number +1 (516) 949-9508 it saves as 15169499508 but how can we preserve the + as +15169499508?
In Ruby \D is just an alias for [^0-9]. You may explicitly set [^0-9+]:
params[:number].gsub(/[^0-9+]/,'')
I understand you only want to keep a plus only at the start of the string. You need to use:
.gsub(/\A(\+)|\D+/, '\1')
Here, \A(\+) branch matches a literal plus at the start of the string. The second branch is your \D that matches all chars but digits, just with a + quantifier that matches 1 or more occurrences. The \1 backreference restores that initial plus symbol in the resulting string.
If you don't have any syntactic rules, delete would work just fine:
'+1 (516) 949-9508'.delete('^0-9+') #=> "+15169499508"

Lookbehind and lookahead regex

I have a strings like this:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma
Russo:::Programmer-Defined Control Abstractions in Modula-2
I need to capture Michele Di Santo, Libero Nigro, Wilma Russo but not the last one.
This regex matches almost what I need:
/(?<=::).*?(?=::)/
But it has problem, it captures the third colon
str.scan(/(?<=::).*?(?=::)/) #=> [":Michele Di Santo", ...]
As you can see, the first match has a colon at the beginning.
How to fix this regex to avoid this third colon?
Don't use regex for this. All you need to do is split the input string on :::, take the second string from the resulting array, and split that on ::. Faster to code, faster to run, and easier to read than a regex version.
Edit: The code:
str.split(':::')[1].split('::')
Running on CodePad: http://codepad.org/1BNNwoh6
An expression to do that could be:
(?<=::)[^:].*?(?=::)
Although if the string to be searched is always in the form of "xxx:::A::B::C:::xxx" and you only care about A, B and C, consider using something more specific, and using the capture groups to get A, B and C:
:::(.+?)::(.+?)::(.+?):::
$1, $2 and $3 will contain the group matches.
I'd use a simple split because the string is basically a CSV with colons instead of commas:
str = 'journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2'
items = split(':')
str1, str2, str3 = items[3], items[5], items[7]
=> [
[0] "Michele Di Santo",
[1] "Libero Nigro",
[2] "Wilma Russo"
]
You could also use:
str1, str2, str3 = str.split(':').select{ |s| s > '' }[1, 3]
If it's possible to have quoted colons, use the CSV module and set your field delimiter to ':'.

Ruby Regex Match Between "foo" and "bar"

I have unfortunately wandered into a situation where I need regex using Ruby. Basically I want to match this string after the underscore and before the first parentheses. So the end result would be 'table salt'.
_____ table salt (1) [F]
As usual I tried to fight this battle on my own and with rubular.com. I got the first part
^_____ (Match the beginning of the string with underscores ).
Then I got bolder,
^_____(.*?) ( Do the first part of the match, then give me any amount of words and letters after it )
Regex had had enough and put an end to that nonsense and crapped out. So I was wondering if anyone on stackoverflow knew or would have any hints on how to say my goal to the Ruby Regex parser.
EDIT: Thanks everyone, this is the pattern I ended up using after creating it with rubular.
ingredientNameRegex = /^_+([^(]*)/;
Everything got better once I took a deep breath, and thought about what I was trying to say.
str = "_____ table salt (1) [F]"
p str[ /_{3}\s(.+?)\s+\(/, 1 ]
#=> "table salt"
That says:
Find at least three underscores
and a whitespace character (\s)
and then one or more (+) of any character (.), but as little as possible (?), up until you find
one or more whitespace characters,
and then a literal (
The parens in the middle save that bit, and the 1 pulls it out.
Try this: ^[_]+([^(]*)\(
It will match lines starting with one or more underscores followed by anything not equal to an opening bracket: http://rubular.com/r/vthpGpVr4y
Here's working regex:
str = "_____ table salt (1) [F]"
match = str.match(/_([^_]+?)\(/)
p match[1].strip # => "table salt"
You could use
^_____\s*([^(]+?)\s*\(
^_____ match the underscore from the beginning of string
\s* matches any whitespace character
( grouping start
[^(]+ matches all non ( character at least once
? matches the shortest possible string (non greedy)
) grouping end
\s* matches any whitespace character
\( find the (
"_____ table salt (1) [F]".gsub(/[_]\s(.+)\s\(/, ' >>>\1<<< ')
# => "____ >>>table salt<<< 1) [F]"
It seems to me the simplest regex to do what you want is:
/^_____ ([\w\s]+) /
That says:
leading underscores, space, then capture any combination of word chars or spaces, then another space.

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Resources