Ruby split by comma absorbing trailing space - ruby

I need to split a string into two variables. For example, the following would work fine:
first,second = "red,blue".split(',')
I would like to split user input, which might have an optional space after the comma. How do I write it so a space after the comma is absorbed? I need to correctly handle all these possibilities:
"red,blue" # first="red" second="blue"
"red, blue" # first="red" second="blue"
"red,dark blue" # first="red" second="dark blue"
"red, light blue" # first="red" second="light blue"

Just trim the resulting entries. The way you do this depends on whether you want to support exactly one space after the comma, or whether you want to remove all leading whitespace (and maybe trailing whitespace too). If your goal is to get words, like it looks like in your sample, you should just remove all surrounding whitespace.
first,second = "red, blue".split(',').map(&:strip)

There is no regexp in your code - you split using a string, which makes a difference.
"red,blue".split(/\s*,\s*/) should work as you expect.

list.split(/, */)
This is a regular expression that works with or without a space after the comma.

Related

Removing trailing newlines with regex in Ruby's 'String#scan'

I have a string, which contains a bunch of HTML documents, tagged with #name:
string = "#one\n\n<html>\n</html>\n\n#two\n<html>\n</html>\n\n\n"
I want to get an array of two-element arrays, each of which with a tag as the first element and the HTML document as the second:
[ ["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"] ]
In order to solve the problem, I crafted the following regular expression:
regex = /(#.+)\n+([^#]+)\n+/
and applied it in string.scan regex.
However, instead of the desired output, I get the following:
[ ["#one", "<html>\n</html>\n"], ["#two", "<html>\n</html>\n\n"] ]
There are trailing newline characters at the end of each document. It appears that only one newline character was removed from the documents, but others stayed at the place.
How can the aforementioned regular expression be changed in order to remove all the trailing characters from the resulting documents?
The reason only the last \n was thrown away is because the two relevant capturing parts in your regex: .+ and [^#]+ capture everything up to the last \n (in order to make matching possible at all). It does not matter that they are followed by \n+. Remember that regex works from the left to the right. If some substring (sequences of \n in this case) can fit in either the preceding part of the following part of a regex, it actually fits in the preceding part.
With generality, I would suggest doing this:
string.split(/\s+(?=#)/).map{|s| s.strip.split(/\s+/, 2)}
# => [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]
You can remove duplicated newlines first:
string.gsub(/\n+/, "\n").scan(regex)
=> [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]

Replace non-word characters, unless given sequence matches

I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"

How do I split names with a regular expression?

I am new to ruby and regular expressions and trying to figure out how to attack seperating the attached string of baseball players into first/last name combinations.
This is a sample string:
"JohnnyCuetoJ.J.PutzBrianMcCann"
This is the desired output:
Johnny Cueto
J.J. Putz
Brian McCann
I have figured out how to separate by capital letters which gets me close, but the outlier names like J.J. and McCann mess that pattern up. Anyone have ideas on the best way to approach this?
If you don't have to do it in one single gsub than it gets a bit easier.
string = "JohnnyCuetoJ.J.PutzBrianMcCann"
string.gsub!(/([A-Z][^A-Z]+)/, '\1 ') # separate by capital letters
string.gsub!(/(\.) ([A-Z]\.)/, '\1\2') # paste together "J. J." -> "J.J."
string.gsub!(/Mc /, 'Mc') # Remove the space in "Mc "
string.strip # Remove the extra space after "Cann "
...and of course you can put this on a single line by chaining the gsub calls, but that will basically kill the readability of the code (but on the other hand, how readable is a block of regexen anyway?)

Match comma separated list with Ruby Regex

Given the following string, I'd like to match the elements of the list and parts of the rest after the colon:
foo,bar,baz:something
I.e. I am expecting the first three match groups to be "foo", "bar", "baz". No commas and no colon. The minimum number of elements is 1, and there can be arbitrarily many. Assume no whitespace and lower case.
I've tried this, which should work, but doesn't populate all the match groups for some reason:
^([a-z]+)(?:,([a-z]+))*:(something)
That matches foo in \1 and baz (or whatever the last element is) in \2. I don't understand why I don't get a match group for bar.
Any ideas?
EDIT: Ruby 1.9.3, if that matters.
EDIT2: Rubular link: http://rubular.com/r/pDhByoarbA
EDIT3: Add colon to the end, because I am not just trying to match the list. Sorry, oversimplified the problem.
This expression works for me: /(\w+)/i
If you want to do it with regex, how about this?
(?<=^|,)("[^"]*"|[^,]*)(?=,|$)
This matches comma-separated fields, including the possibility of commas appearing inside quoted strings like 123,"Yes, No". Regexr for this.
More verbosely:
(?<=^|,) # Must be preceded by start-of-line or comma
(
"[^"]*"| # A quote, followed by a bunch of non-quotes, followed by quote, OR
[^,]* # OR anything until the next comma
)
(?=,|$) # Must end with comma or end-of-line
Usage would be with something like Python's re.findall(), which returns all non-overlapping matches in the string (working from left to right, if that matters.) Don't use it with your equivalent of re.search() or re.match() which only return the first match found.
(NOTE: This actually doesn't work in Python because the lookbehind (?<=^|,) isn't fixed width. Grr. Open to suggestions on this one.)
Edit: Use a non-capturing group to consume start-of-line or comma, instead of a lookbehind, and it works in Python.
>>> test_str = '123,456,"String","String, with, commas","Zero-width fields next",,"",nyet,123'
>>> m = re.findall('(?:^|,)("[^"]*"|[^,]*)(?=,|$)',test_str)
>>> m
['123', '456', '"String"', '"String, with, commas"',
'"Zero-width fields next"', '', '""', 'nyet', '123']
Edit 2: The Ruby equivalent of Python's re.findall(needle, haystack) is haystack.scan(needle).
Maybe split will be better solution for this case?
'foo,bar,baz'.split(',')
=> ["foo", "bar", "baz"]
If I am interpreting your post correctly, you want everything separated by commas before the colon (:).
The appropriate regex for this would be:
[^\s:]*(,[^\s:]*)*(:.*)?
This should find everything you are looking for.

In Ruby, what's the easiest way to "chomp" at the start of a string instead of the end?

In Ruby, sometimes I need to remove the new line character at the beginning of a string. Currently what I did is like the following. I want to know the best way to do this. Thanks.
s = "\naaaa\nbbbb"
s.sub!(/^\n?/, "")
lstrip seems to be what you want (assuming trailing white space should be kept):
>> s = "\naaaa\nbbbb" #=> "\naaaa\nbbbb"
>> s.lstrip #=> "aaaa\nbbbb"
From the docs:
Returns a copy of str with leading whitespace removed. See also
String#rstrip and String#strip.
http://ruby-doc.org/core-1.9.3/String.html#method-i-lstrip
strip will remove all trailing whitespace
s = "\naaaa\nbbbb"
s.strip!
Little hack to chomp leading whitespace:
str = "\nmy string"
chomped_str = str.reverse.chomp.reverse
To be perfectly accurate chomp not only can delete whitespace, from the end of a string, but can also delete arbitrary characters.
If the latter functionality is sought, one can use:
'\naaaa\nbbbb'.delete_prefix( "\n" )
As opposed to strip this works for arbitrary characters exactly like chomp.
So, just for a bit of clarification, there are three ways that you can go about this: sub, reverse.chomp.reverse and lstrip.
I'd recommend against sub because it's a bit less readable, but also because of how it works: by creating a new string that inherits from your old string. Plus you need a regular expression for something that's fairly simple.
So then you're down to reverse.chomp.reverse and lstrip. Most likely, you want lstrip because it's a bit faster, but keep in mind that the strip operations are not the same as the chomp operations. strip will remove all leading newlines and whitespace:
"\n aaa\nbbb".reverse.chomp.reverse # => " aaa\nbbb"
"\n aaa\nbbb".lstrip # => "aaa\nbbb"
If you want to make sure you only remove one character and that it's definitely a newline, use the reverse.chomp.reverse solution. If you consider all leading newlines and whitespace garbage, go with lstrip.
The one case I can think of for using regular expressions would be if you have an unknown number of \rs and \ns at the beginning and want to trim them all but avoid touching any whitespace. You could use a loop and the more String methods for trimming but it would just be uglier. The performance implications don't really matter that much.
s.sub(/^[\n\r]*/, '')
This removes leading newlines (carriage returns and line feeds, as in chomp), not any whitespace.
Not sure if it's the best way but you could try:
s.reverse.chomp.reverse
if you want to leave the trailing newline (if it exists).
This should work for you: s.strip.
A way to do this for whitespace or non-whitespace characters is like this:
s = "\naaaa\nbbbb"
s.slice!("\n") # returns "\n" but s also has the first newline removed.
puts s # shows s has the first newline removed

Resources