ruby string split with terminal strings empty [duplicate] - ruby

This question already has an answer here:
Why does Ruby String#split not treat consecutive trailing delimiters as separate entities?
(1 answer)
Closed 9 years ago.
If I define a string with nulls
string = "a,b,,c,d,e,f,,"
then
string.split(',')
=> ["a", "b", "", "c", "d", "e", "f"]
The empty string between "b" and "c" is accounted for, but the two at the end have been lost. How can I split a string and preserve those trailing empty strings in the returned array?

You need to say:
string.split(',',-1)
to avoid omitting the trailing blanks.
per Why does Ruby String#split not treat consecutive trailing delimiters as separate entities?
The second parameter is the "limit" parameter, documented at http://ruby-doc.org/core-2.0.0/String.html#method-i-split as follows:
If the "limit" parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.

Related

Splitting into empty substrings

I get this result (notice that the first "" is for the preceding empty match):
"babab".split("b")
# => ["", "a", "a"]
By replacing "a" with an empty string in the input above as follows,
"bbb".split("b")
I expected to get the following result:
["", "", ""]
But in reality, I get:
[]
What is the logic behind this?
Logic is described in the documentation:
If the limit parameter is omitted, trailing null fields are suppressed.
Trailing empty fields are removed, but not leading ones.
If, by any chance, what you were asking is "yeah, but where's the logic in that?", then imagine we're parsing some CSV.
fname,sname,id,email,status
,,1,sergio#example.com,
We want the first two position to remain empty (rather than be removed and have fname become 1 and sname - sergio#example.com).
We care less about trailing empty fields. Removed or kept, they don't shift data.

How does pack work in Ruby?

I am a tad confused about what I see here:
a = [ "a", "b", "c" ]
n = [ 65, 66, 67 ]
a.pack("A3A3A3") #=> "a b c "
a.pack("a3a3a3") #=> "a\000\000b\000\000c\000\000"
n.pack("ccc") #=> "ABC"
From the docs:
Packs the contents of arr into a binary sequence according to the directives in aTemplateString (see the table below) Directives “A,'' “a,'' and “Z'' may be followed by a count, which gives the width of the resulting field.
Here are the directives:
So we're using the A directive 3 times it seems? What does it mean to pack the string a into an arbitrary binary string (space padded, count is width?) Can you help me understand the output? Why are there so many 0s?
In the first case, you're printing "a" but padding its length to 3 with spaces, hence the two spaces to get the total length to 3.
In the second case, you're doing the same but padding with null bytes instead (ASCII value 0). Null bytes in Ruby are printed (and can be read) using the escape syntax \000 (this is one character), so \000\000 is actually just two null bytes.
The variable n is irrelevant, so you can ignore it.
In the pack statements, the bytes "a", "b" and "c" are concatenated ("packed") into a single string, with padding between them. The padding is such that the number of bytes (the width) taken up by the contents plus the padding equals the number provided.
So in the first pack statement, the "a" is padded with two spaces to make these three bytes: "a.." where I've put a . in place of the spaces to make it clear. That is concatenated with the "b" and the "c" similarly padded, to produce "a..b..c..".
In the second pack statement, null characters ('\000') are used instead of spaces. The \xxx notation (called an "escape sequence") means the byte with octal value xxx. It's used when there isn't a useful ASCII character (like 'a' or ' ') to show. A null character has no useful ASCII character, so the \xxx notation is used instead.

Split doesn't return empty string

Is there a way to obtain:
"[][][]".split('[]')
#=> ["", "", ""]
instead of
#=>[]
without having to write a function?
The behavior is surprising here because sometimes irb would respond as expected:
"[]a".split('[]')
#=>["", "a"]`
From the docs:
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
And so:
"[][][]".split("[]", -1)
# => ["", "", "", ""]
This yields four empty strings rather than your three, but if you think about it it's the only result that makes sense. If you split ,,, on each comma you would expect to get four empty strings as well, since there's one empty item "before" the first comma and one "after" the last.
String#split takes two arguments: a pattern to split on, and a limit to the number of results returned. In this case, limit can help us.
The documentation for String#split says:
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if *limit( is 1, the entire string is returned as the only entry in an array).
The key phrase here is trailing null fields are suppressed, in other words, if you have extra, empty matches at the end of the string, they'll be dropped from the result unless you have set a limit.
Here's an example:
"[]a[][]".split("[]")
#=> ["", "a"]
You might expect to get ["", "a", "", ""], but because trailing null fields are suppressed, everything after the last non-empty match (the a) is dropped.
We could set a limit, and only get that many results:
"[]a[][]".split("[]", 3)
#=> ["", "a", "[]"]
In this case, since we've asked for 3 results, the last [] is ignored and forms part of the last result. This is useful when we know how many results we expect, but not so useful in your specific case.
Fortunately, the docs continue:
If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
In other words, we can pass a limit of -1, and get all the matches, even the trailing empty ones:
"[]a[][]".split('[]', -1)
#=> ["", "a", "", ""]
This even works when all the matches are empty:
"[][][]".split('[]', -1)
#=> ["", "", "", ""]

Splitting with empty space in Ruby [duplicate]

This question already has an answer here:
How do I avoid trailing empty items being removed when splitting strings?
(1 answer)
Closed 8 years ago.
In both Ruby and JavaScript I can write expression " x ".split(/[ ]+/)
. In JavaScript I get somehow reasonable result ["", "x", ""], but in Ruby (2.0.0) i get ["", "x"], which is for me quite counterintuitive. I have problems to understand how regular expressions works in Ruby. Why don't I get the same result as in JavaScript or just ["x"]?
From string#split documentation, emphasis my own:
split(pattern=$;, [limit])
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is the default), str is split on whitespace as if ` ' were specified.
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
So if you were to use " x ".split(/[ ]+/, -1) you would get your expected result of ["", "x", ""]
*edited to reflect Wayne's comment
I found this in the C code for String#split, almost right at the end:
if (NIL_P(limit) && lim == 0) {
long len;
while ((len = RARRAY_LEN(result)) > 0 &&
(tmp = RARRAY_AREF(result, len-1), RSTRING_LEN(tmp) == 0))
rb_ary_pop(result);
}
So it actually pops empty strings off the end of the result array before returning! It looks like the creators of Ruby didn't want String#split to return a bunch of empty strings.
Notice the check for NIL_P(limit) -- this accords exactly with what the documentation says, as #dax pointed out.

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Resources