Complex requirements for string split around select commas - ruby

TL;DR
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
Explanation of the problem:
This is a pretty complex problem so any help is greatly appreciated. I'm adding functionality to a library I've been using for a while now. I have this string that contains an array
"['d,og,f:asdf,:hello,",,\",,alsee',,,'ho,la', "-123,4,5.3", true, :good, false,,, "gr\'\'\'true,\',\'ee\"n", ":::testme", true]"
I would like to split this string only around select commas so that I have an array containing the following values
'd,og,f:asdf,:hello,",,\",,alsee'
nil
nil
'ho,la'
"-123,4,5.3"
true
:good
false
nil
nil
"gr\'\'\'true,\',\'ee\"n"
":::testme"
true
Then nil values are coming from the side by side commas that are not contained in any string. I wrote the following regex to split the string above (I already got rid of the start and end brackets):
/(?<=(?:['\"]|false|true|^|,)),(?=(?:\s*(?:(?::[\w]+)|(?:(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*'))|(?:false|true)))\s*(?:,|$)))/
This splits the string so I get these values:
(0) "'d,og,f:asdf,:hello,",,\",,alsee',,"
(1) "'ho,la'"
(2) " "-123,4,5.3""
(3) " true"
(4) " :good, false,,"
(5) " "gr\'\'\'true,\',\'ee\"n""
(6) " ":::testme""
(7) " true"
All the values are strings as can be seen by their surrounding double quotes. They will not all end up that way though. A true or false will be converted to a boolean. The values surrounded by internal quotes will end up as strings. Then a value preceded with a : will end up as a symbol.
There are problems with the values at index 0 and 4. Index 0 should be this:
(0.0) "'d,og,f:asdf,:hello,",,\",,alsee'"
(0.1) nil
(0.2) nil
As you can see, the two commas at the end are gone. They have become the two nil values you see above. Then the string starts at the first single quote and ends at the last single quote, signifying that this value in the array is a string.
Then index 4 (" :good, false,,") should be this:
(4.0) " :good"
(4.1) " false"
(4.2) nil
(4.3) nil
The two commas at the end have become nil. Then " false" is it's own value which will later be converted to a boolean, while " :good" is also it's own value and will later be converted to a symbol.
To fix the problem with index 4 I have all the values run through a second regex. Here it is:
/^(\s*:(?:(?:[\w]+|\"[\s\S]+\"|'[\s\S]+')\s*)),([\s\S]*)$/
Instead of splitting this one I get the capture groups. It ends up returning this array for the value at index 4:
(4.0) " :good"
(4.1) " false,,"
That's what I wanted except for one problem. The value at index 4.1 (" false,,") has the two trailing commas which should be nil values in the array.
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.

"['d,og,f:asdf,:hello,"
,,\
",,alsee',,,'ho,la', "
-123,4,5.3
", true, :good, false,,, "
gr\
'\'
I count 4 strings. 3 in double quotes and the last one in single quotes?
You say this is broken down into smaller strings by your regx. But what about the characters outside the 4 strings?
Sorry, it looks a bit of a mess.
Try putting it all in a here document string and then breaking it down by a regx.

I finally figured it out myself. You can see how it fits in with the rest if you look at the description of the question above.
/^(([\s]*,)*)[\s]*((?::[\w]+)|(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*')|false|true))?(([\s]*,)*)$/

Related

Ruby: %q with strings [case]

I encountered this line:
at = #seq.slice(#seq.length - 2, 2).count(%q[at])
where #seq is a string. I know how slice and %q work, but I don't get the idea of putting a variable at (which we define here) as an argument of [] after %q.
It is a very verbose code.
#seq.length - 2 gives the index of the second to last character in #seq.
#seq.slice(#seq.length - 2, 2) gives the last two characters in #seq.
Applying count(%q[at]) to it returns the number of occurrences of characters in %q[at] (i.e., "at") in it, which counts "a" and "t". Since there are only two characters, it would be either 0, 1, or 2.
%q with paired delimiters are similar to the single quoted strings. In other words, %q[at], or %q!at!, or %q{at}, are all equivalent to 'at'.
%q[at]
# => "at"
P.S, %Q works similarly, but like double quoted strings.

Removing all whitespace from a string in Ruby

How can I remove all newlines and spaces from a string in Ruby?
For example, if we have a string:
"123\n12312313\n\n123 1231 1231 1"
It should become this:
"12312312313123123112311"
That is, all whitespaces should be removed.
You can use something like:
var_name.gsub!(/\s+/, '')
Or, if you want to return the changed string, instead of modifying the variable,
var_name.gsub(/\s+/, '')
This will also let you chain it with other methods (i.e. something_else = var_name.gsub(...).to_i to strip the whitespace then convert it to an integer). gsub! will edit it in place, so you'd have to write var_name.gsub!(...); something_else = var_name.to_i. Strictly speaking, as long as there is at least one change made,gsub! will return the new version (i.e. the same thing gsub would return), but on the chance that you're getting a string with no whitespace, it'll return nil and things will break. Because of that, I'd prefer gsub if you're chaining methods.
gsub works by replacing any matches of the first argument with the contents second argument. In this case, it matches any sequence of consecutive whitespace characters (or just a single one) with the regex /\s+/, then replaces those with an empty string. There's also a block form if you want to do some processing on the matched part, rather than just replacing directly; see String#gsub for more information about that.
The Ruby docs for the class Regexp are a good starting point to learn more about regular expressions -- I've found that they're useful in a wide variety of situations where a couple of milliseconds here or there don't count and you don't need to match things that can be nested arbitrarily deeply.
As Gene suggested in his comment, you could also use tr:
var_name.tr(" \t\r\n", '')
It works in a similar way, but instead of replacing a regex, it replaces every instance of the nth character of the first argument in the string it's called on with the nth character of the second parameter, or if there isn't, with nothing. See String#tr for more information.
You could also use String#delete:
str = "123\n12312313\n\n123 1231 1231 1"
str.delete "\s\n"
#=> "12312312313123123112311"
You could use String#delete! to modify str in place, but note delete! returns nil if no change is made
Alternatively you could scan the string for digits /\d+/ and join the result:
string = "123\n\n12312313\n\n123 1231 1231 1\n"
string.scan(/\d+/).join
#=> "12312312313123123112311"
Please note that this would also remove alphabetical characters, dashes, symbols, basically everything that is not a digit.

Splitting with empty space in Ruby [duplicate]

This question already has an answer here:
How do I avoid trailing empty items being removed when splitting strings?
(1 answer)
Closed 8 years ago.
In both Ruby and JavaScript I can write expression " x ".split(/[ ]+/)
. In JavaScript I get somehow reasonable result ["", "x", ""], but in Ruby (2.0.0) i get ["", "x"], which is for me quite counterintuitive. I have problems to understand how regular expressions works in Ruby. Why don't I get the same result as in JavaScript or just ["x"]?
From string#split documentation, emphasis my own:
split(pattern=$;, [limit])
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is the default), str is split on whitespace as if ` ' were specified.
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
So if you were to use " x ".split(/[ ]+/, -1) you would get your expected result of ["", "x", ""]
*edited to reflect Wayne's comment
I found this in the C code for String#split, almost right at the end:
if (NIL_P(limit) && lim == 0) {
long len;
while ((len = RARRAY_LEN(result)) > 0 &&
(tmp = RARRAY_AREF(result, len-1), RSTRING_LEN(tmp) == 0))
rb_ary_pop(result);
}
So it actually pops empty strings off the end of the result array before returning! It looks like the creators of Ruby didn't want String#split to return a bunch of empty strings.
Notice the check for NIL_P(limit) -- this accords exactly with what the documentation says, as #dax pointed out.

String gsub - Replace characters between two elements, but leave surrounding elements

Suppose I have the following string:
mystring = "start/abc123/end"
How can you splice out the abc123 with something else, while leaving the "/start/" and "/end" elements intact?
I had the following to match for the pattern, but it replaces the entire string. I was hoping to just have it replace the abc123 with 123abc.
mystring.gsub(/start\/(.*)\/end/,"123abc") #=> "123abc"
Edit: The characters between the start & end elements can be any combination of alphanumeric characters, I changed my example to reflect this.
You can do it using this character class : [^\/] (all that is not a slash) and lookarounds
mystring.gsub(/(?<=start\/)[^\/]+(?=\/end)/,"7")
For your example, you could perhaps use:
mystring.gsub(/\/(.*?)\//,"/7/")
This will match the two slashes between the string you're replacing and putting them back in the substitution.
Alternatively, you could capture the pieces of the string you want to keep and interpolate them around your replacement, this turns out to be much more readable than lookaheads/lookbehinds:
irb(main):010:0> mystring.gsub(/(start)\/.*\/(end)/, "\\1/7/\\2")
=> "start/7/end"
\\1 and \\2 here refer to the numbered captures inside of your regular expression.
The problem is that you're replacing the entire matched string, "start/8/end", with "7". You need to include the matched characters you want to persist:
mystring.gsub(/start\/(.*)\/end/, "start/7/end")
Alternatively, just match the digits:
mystring.gsub(/\d+/, "7")
You can do this by grouping the start and end elements in the regular expression and then referring to these groups in in the substitution string:
mystring.gsub(/(?<start>start\/).*(?<end>\/end)/, "\\<start>7\\<end>")

ruby parametrized regular expression

I have a string like "{some|words|are|here}" or "{another|set|of|words}"
So in general the string consists of an opening curly bracket,words delimited by a pipe and a closing curly bracket.
What is the most efficient way to get the selected word of that string ?
I would like do something like this:
#my_string = "{this|is|a|test|case}"
#my_string.get_column(0) # => "this"
#my_string.get_column(2) # => "is"
#my_string.get_column(4) # => "case"
What should the method get_column contain ?
So this is the solution I like right now:
class String
def get_column(n)
self =~ /\A\{(?:\w*\|){#{n}}(\w*)(?:\|\w*)*\}\Z/ && $1
end
end
We use a regular expression to make sure that the string is of the correct format, while simultaneously grabbing the correct column.
Explanation of regex:
\A is the beginnning of the string and \Z is the end, so this regex matches the enitre string.
Since curly braces have a special meaning we escape them as \{ and \} to match the curly braces at the beginning and end of the string.
next, we want to skip the first n columns - we don't care about them.
A previous column is some number of letters followed by a vertical bar, so we use the standard \w to match a word-like character (includes numbers and underscore, but why not) and * to match any number of them. Vertical bar has a special meaning, so we have to escape it as \|. Since we want to group this, we enclose it all inside non-capturing parens (?:\w*\|) (the ?: makes it non-capturing).
Now we have n of the previous columns, so we tell the regex to match the column pattern n times using the count regex - just put a number in curly braces after a pattern. We use standard string substition, so we just put in {#{n}} to mean "match the previous pattern exactly n times.
the first non skipped column after that is the one we care about, so we put that in capturing parens: (\w*)
then we skip the rest of the columns, if any exist: (?:\|\w*)*.
Capturing the column puts it into $1, so we return that value if the regex matched. If not, we return nil, since this String has no nth column.
In general, if you wanted to have more than just words in your columns (like "{a phrase or two|don't forget about punctuation!|maybe some longer strings that have\na newline or two?}"), then just replace all the \w in the regex with [^|{}] so you can have each column contain anything except a curly-brace or a vertical bar.
Here's my previous solution
class String
def get_column(n)
raise "not a column string" unless self =~ /\A\{\w*(?:\|\w*)*\}\Z/
self[1 .. -2].split('|')[n]
end
end
We use a similar regex to make sure the String contains a set of columns or raise an error. Then we strip the curly braces from the front and back (using self[1 .. -2] to limit to the substring starting at the first character and ending at the next to last), split the columns using the pipe character (using .split('|') to create an array of columns), and then find the n'th column (using standard Array lookup with [n]).
I just figured as long as I was using the regex to verify the string, I might as well use it to capture the column.

Resources