Regex too Capture certain words at start of string Ruby - ruby

Looking for help in writing a regex for capturing whether a particular string starts with certain strings and capture the start and remaining string. E.g
Let's say the possible starts of strings are 'P', 'RO', 'RPX' and the sample string is 'PIXR' or 'ROXP' or 'RPX'.
I am looking to write a regex which captures the start and trailing part of string if it starts with the given possible strings e.g
'PIXRT' =~ // outputs 'P' and 'IXRT'
Not very conversant with regexes so any help is really appreciated.

You may use a regex with 2 capturing groups, one capturing the known values at the start and the rest will capture the rest of the string:
rx = /\A(RPX|RO|P)(.*)/m
"PIXRT".scan(rx)
# => [P, IXRT]
See the Ruby demo
Details:
\A - start of string
(RPX|RO|P) - one of the values that must be at the start of the string (mind the order of these alternatives: the longer ones come first!)
(.*) - any 0+ chars up to the end of the string (m modifier will make . match line breaks, too).

def split_after_start_string(str, *start_strings)
a = str.split(/(?<=\A#{start_strings.join('|')})/)
if a.size == 2
a
elsif start_strings.include?(str)
a << ''
else
nil
end
end
start_strings = %w| P RO RPX | #=> ["P", "RO", "RPX"]
split_after_start_string('PIXR', *start_strings) #=> ["P", "IXR"]
split_after_start_string('IPXR', *start_strings) #=> nil
split_after_start_string('ROXP', *start_strings) #=> ["RO", "XP"]
split_after_start_string('RPX', *start_strings) #=> ["RPX", ""]
The regex reads, "match one element of start_stringx at the beginning of the string in a positive lookbehind". For smart_strings in the examples, the regex is:
/(?<=\A#{start_strings.join('|')})/ #=> /(?<=\AP|RO|RPX)/

Related

How to find same characters in two random strings? (Ruby)

I am busy working through some problems I have found on the net and I feel like this should be simple but I am really struggling.
Say you have the string 'AbcDeFg' and the next string of 'HijKgLMnn', I want to be able to find the same characters in the string so in this case it would be 'g'.
Perhaps I wasn't giving enough information - I am doing Advent of Code and I am on day 3. I just need help with the first bit which is where you are given a string of characters - you have to split the characters in half and then compare the 2 strings. You basically have to get the common character between the two. This is what I currently have:
file_data = File.read('Day_3_task1.txt')
arr = file_data.split("\n")
finals = []
arr.each do |x|
len = x.length
divided_by_two = len / 2
second = x.slice!(divided_by_two..len).split('')
first = x.split('')
count = 0
(0..len).each do |z|
first.each do |y|
if y == second[count]
finals.push(y)
end
end
count += 1
end
end
finals = finals.uniq
Hope that helps in terms of clarity :)
Did you try to convert both strings to arrays with the String#char method and find the intersection of those arrays?
Like this:
string_one = 'AbcDeFg'.chars
string_two = 'HijKgLMnn'.chars
string_one & string_two # => ["g"]
One way to do that is to use the method String#scan with the regular expression
rgx = /(.)(?!.*\1.*_)(?=.*_.*\1)/
I'm not advocating this approach. I merely thought some readers might find it interesting.
Suppose
str1 = 'AbcDgeFg'
str2 = 'HijKgLMnbn'
Now form the string
str = "#{str1}_#{str2}"
#=> "AbcDeFg_HijKgLMnbn"
I've assumed the strings contain letters only, in which case they are separated in str with any character other than a letter. I've used an underscore. Naturally, if the strings could contain underscores a different separator would have to be used.
We then compute
str.scan(rgx).flatten
#=> ["b", "g"]
Array#flatten is needed because
str.scan(rgx)
#=>[["b"], ["g"]]
The regular expression can be written in free-spacing mode to make it self-documenting:
rgx =
/
(.) # match any character, same to capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1 # match the contents of capture group 1
.* # match zero or more characters
_ # match an underscore
) # end the negative lookahead
(?= # begin a positive lookahead
.* # match zero or more characters
_ # match an underscore
.* # match zero or more characters
\1 # match the contents of capture group 1
) # end the positive lookahead
/x # invoke free-spacing regex definition mode
Note that if a character appears more than once in str1 and at least once in str2 the negative lookahead ensures that only the last one in str1 is matched, to avoid returning duplicates.
Alternatively, one could write
str.gsub(rgx).to_a
The uses the (fourth) form of String#gsub which takes a single argument and no block and returns an enumerator.

Simple regex - ignoring certain characters

I'm trying to use the match method with an argument of a regex to select a valid phone number, by definition, any string with nine digits.
For example:
9347584987 is valid,
(456)322-3456 is valid,
(324)5688890 is valid.
But
(340)HelloWorld is NOT valid and
456748 is NOT valid.
So far, I'm able to use \d{9} to select the example string of 9 digit characters in a row, but I'm not sure how to specifically ignore any character, such as '-' or '(' or ')' in the middle of the sequence.
What kind of Regex could I use here?
Given:
nums=['9347584987','(456)322-3456','(324)5688890','(340)HelloWorld', '456748 is NOT valid']
You can split on a NON digit and rejoin to remove non digits:
> nums.map {|s| s.split(/\D/).join}
["9347584987", "4563223456", "3245688890", "340", "456748"]
Then filter on the length:
> nums.map {|s| s.split(/\D/).join}.select {|s| s.length==10}
["9347584987", "4563223456", "3245688890"]
Or, you can grab a group of numbers that look 'phony numbery' by using a regex to grab digits and common delimiters:
> nums.map {|s| s[/[\d\-()]+/]}
["9347584987", "(456)322-3456", "(324)5688890", "(340)", "456748"]
And then process that list as above.
That would delineate:
> '123 is NOT a valid area code for 456-7890'[/[\d\-()]+/]
=> "123" # no match
vs
> '123 is NOT a valid area code for 456-7890'.split(/\D/).join
=> "1234567890" # match
I suggest using one regular expression for each valid pattern rather than constructing a single regex. It would be easier to test and debug, and easier to maintain the code. If, for example, "123-456-7890" or 123-456-7890 x231" were in future deemed valid numbers, one need only add a single, simple regex for each to the array VALID_PATTERS below.
VALID_PATTERS = [/\A\d{10}\z/, /\A\(\d{3}\)\d{3}-\d{4}\z/, /\A\(\d{3}\)\d{7}\z/]
def valid?(str)
VALID_PATTERS.any? { |r| str.match?(r) }
end
ph_nbrs = %w| 9347584987 (456)322-3456 (324)5688890 (340)HelloWorld 456748 |
ph_nbrs.each { |s| puts "#{s.ljust(15)} \#=> #{valid?(s)}" }
9347584987 #=> true
(456)322-3456 #=> true
(324)5688890 #=> true
(340)HelloWorld #=> false
456748 #=> false
String#match? made its debut in Ruby v2.4. There are many alternatives, including str.match(r) and str =~ r.
"9347584987" =~ /(?:\d.*){9}/ #=> 0
"(456)322-3456" =~ /(?:\d.*){9}/ #=> 1
"(324)5688890" =~ /(?:\d.*){9}/ #=> 1
"(340)HelloWorld" =~ /(?:\d.*){9}/ #=> nil
"456748" =~ /(?:\d.*){9}/ #=> nil
Pattern: (Rubular Demo)
^\(?\d{3}\)?\d{3}-?\d{4}$ # this makes the expected symbols optional
This pattern will ensure that an opening ( at the start of the string is followed by 3 numbers the a closing ).
^(\(\d{3}\)|\d{3})\d{3}-?\d{4}$
On principle, though, I agree with melpomene in advising that you remove all non-digital characters, test for 9 character length, then store/handle the phone numbers in a single/reliable/basic format.

How to split a string without getting an empty string inserted in the array

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.
I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.
I tried:
2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
=> ["-", " to "]
2.4.0 :008 > str = "M14-19"
=> "M14-19"
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
=> ["", "M", "19"]
Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").
How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?
I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:
string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
TOKENS = ["-", " to "]
r = /
(?<=\A[mMfF]) # match the beginning of the string and then one
# of the 4 characters in a positive lookbehind
(?= # begin positive lookahead
\d+ # match one or more digits
[[:space:]]* # match zero or more spaces
(?:#{TOKENS.join('|')}) # match one of the tokens
) # close the positive lookahead
/x # free-spacing regex definition mode
(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).
This can of course be written in the usual way.
r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/
When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.
"M14-19".split r
#=> ["M", "14-19"]
"M14 to 19".split r
#=> ["M", "14 to 19"]
"M14 To 19".split r
#=> ["M14 To 19"]
If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.
You have a bug brewing in your code. Don't get in the habit of doing this:
#{Regexp.union(MY_SEPARATOR_TOKENS)}
You're setting yourself up with a very hard to debug problem.
Here's what's happening:
regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/
/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.
Consider this situation:
'CAT'[/#{regex}/i] # => nil
We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.
Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:
'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"
See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.
The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).
Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.
Use something like
MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]
See Ruby demo

Stuck in Abbreviation implementation to ruby string

I want to convert all the words(alphabetic) in the string to their abbreviations like i18n does. In other words I want to change "extraordinary" into "e11y" because there are 11 characters between the first and the last letter in "extraordinary". It works with a single word in the string. But how can I do the same for a multi-word string? And of course if a word is <= 4 there is no point to make an abbreviation from it.
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/, "#{x[0]}#{(x.length-2)}#{x[-1]}")
end
end
Test.assert_equals( Abbreviator.abbreviate("banana"), "b4a", Abbreviator.abbreviate("banana") )
Test.assert_equals( Abbreviator.abbreviate("double-barrel"), "d4e-b4l", Abbreviator.abbreviate("double-barrel") )
Test.assert_equals( Abbreviator.abbreviate("You, and I, should speak."), "You, and I, s4d s3k.", Abbreviator.abbreviate("You, and I, should speak.") )
Your mistake is that your second parameter is a substitution string operating on x (the original entire string) as a whole.
Instead of using the form of gsub where the second parameter is a substitution string, use the form of gsub where the second parameter is a block (listed, for example, third on this page). Now you are receiving each substring into your block and can operate on that substring individually.
def short_form(str)
str.gsub(/[[:alpha:]]{4,}/) { |s| "%s%d%s" % [s[0], s.size-2, s[-1]] }
end
The regex reads, "match four or more alphabetic characters".
short_form "abc" # => "abc"
short_form "a-b-c" #=> "a-b-c"
short_form "cats" #=> "c2s"
short_form "two-ponies-c" #=> "two-p4s-c"
short_form "Humpty-Dumpty, who sat on a wall, fell over"
#=> "H4y-D4y, who sat on a w2l, f2l o2r"
I would recommend something along the lines of this:
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/) do |word|
# Skip the word unless it's long enough
next word unless word.length > 4
# Do the same I18n conversion you do before
"#{word[0]}#{(word.length-2)}#{word[-1]}"
end
end
end
The accepted answer isn't bad, but it can be made a lot simpler by not matching words that are too short in the first place:
def abbreviate(str)
str.gsub(/([[:alpha:]])([[:alpha:]]{3,})([[:alpha:]])/i) { "#{$1}#{$2.size}#{$3}" }
end
abbreviate("You, and I, should speak.")
# => "You, and I, s4d s3k."
Alternatively, we can use lookbehind and lookahead, which makes the Regexp more complex but the substitution simpler:
def abbreviate(str)
str.gsub(/(?<=[[:alpha:]])[[:alpha:]]{3,}(?=[[:alpha:]])/i, &:size)
end

Finding the first duplicate character in the string Ruby

I am trying to call the first duplicate character in my string in Ruby.
I have defined an input string using gets.
How do I call the first duplicate character in the string?
This is my code so far.
string = "#{gets}"
print string
How do I call a character from this string?
Edit 1:
This is the code I have now where my output is coming out to me No duplicates 26 times. I think my if statement is wrongly written.
string "abcade"
puts string
for i in ('a'..'z')
if string =~ /(.)\1/
puts string.chars.group_by{|c| c}.find{|el| el[1].size >1}[0]
else
puts "no duplicates"
end
end
My second puts statement works but with the for and if loops, it returns no duplicates 26 times whatever the string is.
The following returns the index of the first duplicate character:
the_string =~ /(.)\1/
Example:
'1234556' =~ /(.)\1/
=> 4
To get the duplicate character itself, use $1:
$1
=> "5"
Example usage in an if statement:
if my_string =~ /(.)\1/
# found duplicate; potentially do something with $1
else
# there is no match
end
s.chars.map { |c| [c, s.count(c)] }.drop_while{|i| i[1] <= 1}.first[0]
With the refined form from Cary Swoveland :
s.each_char.find { |c| s.count(c) > 1 }
Below method might be useful to find the first word in a string
def firstRepeatedWord(string)
h_data = Hash.new(0)
string.split(" ").each{|x| h_data[x] +=1}
h_data.key(h_data.values.max)
end
I believe the question can be interpreted in either of two ways (neither involving the first pair of adjacent characters that are the same) and offer solutions to each.
Find the first character in the string that is preceded by the same character
I don't believe we can use a regex for this (but would love to be proved wrong). I would use the method suggested in a comment by #DaveNewton:
require 'set'
def first_repeat_char(str)
str.each_char.with_object(Set.new) { |c,s| return c unless s.add?(c) }
nil
end
first_repeat_char("abcdebf") #=> b
first_repeat_char("abcdcbe") #=> c
first_repeat_char("abcdefg") #=> nil
Find the first character in the string that appears more than once
r = /
(.) # match any character in capture group #1
.* # match any character zero of more times
? # do the preceding lazily
\K # forget everything matched so far
\1 # match the contents of capture group 1
/x
"abcdebf"[r] #=> b
"abccdeb"[r] #=> b
"abcdefg"[r] #=> nil
This regex is fine, but produces the warning, "regular expression has redundant nested repeat operator '*'". You can disregard the warning or suppress it by doing something clunky, like:
r = /([^#{0.chr}]).*?\K\1/
where ([^#{0.chr}]) means "match any character other than 0.chr in capture group 1".
Note that a positive lookbehind cannot be used here, as they cannot contain variable-length matches (i.e., .*).
You could probably make your string an array and use detect. This should return the first char where the count is > 1.
string.split("").detect {|x| string.count(x) > 1}
I'll use positive lookahead with String#[] method :
"abcccddde"[/(.)(?=\1)/] #=> c
As a variant:
str = "abcdeff"
p str.chars.group_by{|c| c}.find{|el| el[1].size > 1}[0]
prints "f"

Resources