Split specific string by regular expression - ruby

i am trying to get an array that contain of aaaaa,bbbbb,ccccc as split output below.
a_string = "aaaaa[x]bbbbb,ccccc";
split_output a_string.split.split(%r{[,|........]+})
what supposed i put as replacement of ........ ?

No need for a regex when it's just a literal:
irb(main):001:0> a_string = "aaaaa[x]bbbbb"
irb(main):002:0> a_string.split "[x]"
=> ["aaaaa", "bbbbb"]
If you want to split by "open bracket...anything...close bracket" then:
irb(main):003:0> a_string.split /\[.+?\]/
=> ["aaaaa", "bbbbb"]
Edit: I'm still not sure what your criteria is, but let's guess that what you are really doing is looking for runs of 2-or-more of the same character:
irb(main):001:0> a_string = "aaaaa[x]bbbbb,ccccc"
=> "aaaaa[x]bbbbb,ccccc"
irb(main):002:0> a_string.scan(/((.)\2+)/).map(&:first)
=> ["aaaaa", "bbbbb", "ccccc"]
Edit 2: If you want to split by either the of the literal strings "," or "[x]" then:
irb(main):003:0> a_string.split /,|\[x\]/
=> ["aaaaa", "bbbbb", "ccccc"]
The | part of the regular expression allows expressions on either side to match, and the backslashes are needed since otherwise the characters [ and ] have special meaning. (If you tried to split by /,|[x]/ then it would split on either a comma or an x character.)

no regex needed, just use "[x]"

Related

How to split a string without getting an empty string inserted in the array

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.
I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.
I tried:
2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
=> ["-", " to "]
2.4.0 :008 > str = "M14-19"
=> "M14-19"
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
=> ["", "M", "19"]
Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").
How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?
I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:
string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
TOKENS = ["-", " to "]
r = /
(?<=\A[mMfF]) # match the beginning of the string and then one
# of the 4 characters in a positive lookbehind
(?= # begin positive lookahead
\d+ # match one or more digits
[[:space:]]* # match zero or more spaces
(?:#{TOKENS.join('|')}) # match one of the tokens
) # close the positive lookahead
/x # free-spacing regex definition mode
(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).
This can of course be written in the usual way.
r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/
When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.
"M14-19".split r
#=> ["M", "14-19"]
"M14 to 19".split r
#=> ["M", "14 to 19"]
"M14 To 19".split r
#=> ["M14 To 19"]
If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.
You have a bug brewing in your code. Don't get in the habit of doing this:
#{Regexp.union(MY_SEPARATOR_TOKENS)}
You're setting yourself up with a very hard to debug problem.
Here's what's happening:
regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/
/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.
Consider this situation:
'CAT'[/#{regex}/i] # => nil
We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.
Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:
'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"
See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.
The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).
Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.
Use something like
MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]
See Ruby demo

Use ARGV[] argument vector to pass a regular expression in Ruby

I am trying to use gsub or sub on a regex passed through terminal to ARGV[].
Query in terminal: $ruby script.rb input.json "\[\{\"src\"\:\"
Input file first 2 lines:
[{
"src":"http://something.com",
"label":"FOO.jpg","name":"FOO",
"srcName":"FOO.jpg"
}]
[{
"src":"http://something123.com",
"label":"FOO123.jpg",
"name":"FOO123",
"srcName":"FOO123.jpg"
}]
script.rb:
dir = File.dirname(ARGV[0])
output = File.new(dir + "/output_" + Time.now.strftime("%H_%M_%S") + ".json", "w")
open(ARGV[0]).each do |x|
x = x.sub(ARGV[1]),'')
output.puts(x) if !x.nil?
end
output.close
This is very basic stuff really, but I am not quite sure on how to do this. I tried:
Regexp.escape with this pattern: [{"src":".
Escaping the characters and not escaping.
Wrapping the pattern between quotes and not wrapping.
Meditate on this:
I wrote a little script containing:
puts ARGV[0].class
puts ARGV[1].class
and saved it to disk, then ran it using:
ruby ~/Desktop/tests/test.rb foo /abc/
which returned:
String
String
The documentation says:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ā€˜dā€™, instead of a digit.
That means that the regular expression, though it appears to be a regex, it isn't, it's a string because ARGV only can return strings because the command-line can only contain strings.
When we pass a string into sub, Ruby recognizes it's not a regular expression, so it treats it as a literal string. Here's the difference in action:
'foo'.sub('/o/', '') # => "foo"
'foo'.sub(/o/, '') # => "fo"
The first can't find "/o/" in "foo" so nothing changes. It can find /o/ though and returns the result after replacing the two "o".
Another way of looking at it is:
'foo'.match('/o/') # => nil
'foo'.match(/o/) # => #<MatchData "o">
where match finds nothing for the string but can find a hit for /o/.
And all that leads to what's happening in your code. Because sub is being passed a string, it's trying to do a literal match for the regex, and won't be able to find it. You need to change the code to:
sub(Regexp.new(ARGV[1]), '')
but that's not all that has to change. Regexp.new(...) will convert what's passed in into a regular expression, but if you're passing in '/o/' the resulting regular expression will be:
Regexp.new('/o/') # => /\/o\//
which is probably not what you want:
'foo'.match(/\/o\//) # => nil
Instead you want:
Regexp.new('o') # => /o/
'foo'.match(/o/) # => #<MatchData "o">
So, besides changing your code, you'll need to make sure that what you pass in is a valid expression, minus any leading and trailing /.
Based on this answer in the thread Convert a string to regular expression ruby, you should use
x = x.sub(/#{ARGV[1]}/,'')
I tested it with this file (test.rb):
puts "You should not see any number [0123456789].".gsub(/#{ARGV[0]}/,'')
I called the file like so:
ruby test.rb "\d+"
# => You should not see any number [].

regular expression to treat string as mutiple line in ruby

I am new to ruby. I am struck at a point where data needs to match a pattern. I was wondering if there is a regular expression which makes ruby to treat string as multiple lines.
I think you are looking for the m option. m will allow . to match a new line.
a = "this is my
string"
=> "this is my\nstring"
a
=> "this is my\nstring"
a.match /my.string/m
=> #<MatchData "my\nstring">
a.match /not my.string/m
=> nil

Couldn't understand why the Regexp option i got disabled in my code

I have just started playing with Ruby and I'm stuck on something. Is
there some trick to modify the casefold attribute of a Regexp object after
it's been instantiated?
The best idea what I tried is the following:
irb(main):001:0> a = Regexp.new('a')
=> /a/
irb(main):002:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
But none of the below seems to work:
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> nil
Something I don't understand is happening here. Where did the 'i' go on line
8?
irb(main):07:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
irb(main):08:0> aA.to_s
=> "(?-mix:a)"
irb(main):09:0>
I am using Ruby 1.9.3.
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold? #=> false
As your console output shows, a.to_s includes the case sensitiveness as an option for your subexpression, so aA is being defined as
/(?-mix:a)/i
so you're asking ruby for a regular expression that is case insensitive, but the only thing in that case insensitive regexp is a group for when case sensitivity has be turned on, so the net effect is that 'a' is matched case sensitively
Since the result of to_s is just the regular expression string itself - no delimiters or external flags - the flags are translated into the (?i:...) syntax that sets or clears them temporarily inside the expression itself. This lets you get a Regexp object back out via a simple Regexp.new(s) call that will match the same strings.
The wrapping, unfortunately, includes explicitly clearing the flags that are not set on the object. So your first regex gets stringified into something between (?:-i...) - that is, the casefold option is explicitly turned off between the parentheses. Turning it back on for the object doesn't have any effect.
You can use a.source instead of a.to_s to get just the original expression, without the flag settings:
irb(main):001:0> a=/a/
=> /a/
irb(main):002:0> aA = Regexp.new(a.source, Regexp::IGNORECASE)
=> /a/i
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> 0
As Frederick already explains, calling to_s on a regex will add modifiers around it that ensure that its properties like case-sensitiveness are preserved. So if you insert a case-sensitive regex into a case-insensitive regex, the inserted part will still be case-sensitive. Likewise the modifiers given to Regexp.new will have no effect if the first argument is a regex or the result of calling to_s on one.
To solve this issue, call source on the regex instead of to_s. Unlike to_s, source simply returns the source of regex without adding anything:
aA = Regexp.new(a.source, Regexp::IGNORECASE)
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold?
Because (?i:...) sets the i flag locally, not globally. It only applies to the part of the regex within the parentheses, not the whole regex. Of course in this case the whole regex is within the parentheses, but that doesn't matter as far as methods like casefold? are concerned.

Replace single quote with backslash single quote

I have a very large string that needs to escape all the single quotes in it, so I can feed it to JavaScript without upsetting it.
I have no control over the external string, so I can't change the source data.
Example:
Cote d'Ivoir -> Cote d\'Ivoir
(the actual string is very long and contains many single quotes)
I'm trying to this by using gsub on the string, but can't get this to work:
a = "Cote d'Ivoir"
a.gsub("'", "\\\'")
but this gives me:
=> "Cote dIvoirIvoir"
I also tried:
a.gsub("'", 92.chr + 39.chr)
but got the same result; I know it's something to do with regular expressions, but I never get those.
The %q delimiters come in handy here:
# %q(a string) is equivalent to a single-quoted string
puts "Cote d'Ivoir".gsub("'", %q(\\\')) #=> Cote d\'Ivoir
The problem is that \' in a gsub replacement means "part of the string after the match".
You're probably best to use either the block syntax:
a = "Cote d'Ivoir"
a.gsub(/'/) {|s| "\\'"}
# => "Cote d\\'Ivoir"
or the Hash syntax:
a.gsub(/'/, {"'" => "\\'"})
There's also the hacky workaround:
a.gsub(/'/, '\#').gsub(/#/, "'")
# prepare a text file containing [ abcd\'efg ]
require "pathname"
backslashed_text = Pathname("/path/to/the/text/file.txt").readlines.first.strip
# puts backslashed_text => abcd\'efg
unslashed_text = "abcd'efg"
unslashed_text.gsub("'", Regexp.escape(%q|\'|)) == backslashed_text # true
# puts unslashed_text.gsub("'", Regexp.escape(%q|\'|)) => abcd\'efg

Resources