I get this result (notice that the first "" is for the preceding empty match):
"babab".split("b")
# => ["", "a", "a"]
By replacing "a" with an empty string in the input above as follows,
"bbb".split("b")
I expected to get the following result:
["", "", ""]
But in reality, I get:
[]
What is the logic behind this?
Logic is described in the documentation:
If the limit parameter is omitted, trailing null fields are suppressed.
Trailing empty fields are removed, but not leading ones.
If, by any chance, what you were asking is "yeah, but where's the logic in that?", then imagine we're parsing some CSV.
fname,sname,id,email,status
,,1,sergio#example.com,
We want the first two position to remain empty (rather than be removed and have fname become 1 and sname - sergio#example.com).
We care less about trailing empty fields. Removed or kept, they don't shift data.
Related
I code a process to process bunch of text files and capture its name if any of 2000 literals exists in it (1 or many). So I'm thinking to combine that many values into one regex, do you think it's doable, I did test for 100 and looks like it's OK. Tx all
Code below depics my flow and sample code, just without looping.
# 1. read regex value list as file [alpha,fox, delta] # 2000 values
# 2. read file into s #5000 files
# 3. find if any of #1 values exists in each #2 file. *with regex tweaks to match format dbname.dob.table
s = '1 dbName.dbo.ALPHA 2 DBNAME.bcd.ALPHA 3 dbName..ALPHA 4 ALPHA 5x dbName.alphA 6x alpha.XX 7x ###dbName.###a.alpha --alpha
dbName..FOX dbName.dbo.DELTA clarity.aba..fox '
value1 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:alpha)(?=\s|$)'
value2 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:fox)(?=\s|$)'
##...
value2000 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:delta)(?=\s|$)'
regex = /#{value1}|#{value2}|#{value2000}/i ## can I union 2000 regex's ???
puts 'reg1: ' + regex.to_s
puts 'result: ' + s.scan(regex).to_s
if s.scan(regex) then puts '...Match!!!d' end
Declaring 2000 variables is highly unnecessary; you should define all values in a single array, then somehow loop through them.
Also, the regular expression is highly repetitive - e.g. the use of (?:dbName\.[a-z]*\.) 2000 times. This can be simplified by grouping all of your values within the non-capture group as follows:
values = %w(alpha fox delta)
regex = /(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:#{Regexp.union(values)})(?=\s|$)/
This is the result:
/(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:(?-mix:alpha|fox|delta))(?=\s|$)/
If you extend that values array to contain 2000 strings, the other code does not need to change.
Provided two conditions are met, I would do it as follows, which I think would be far more efficient than using a gigantic regular expression, which, by its nature, requires that a linear search of the "bad words" be performed for each word in the string, until a match is found or it is determined that there are no matches.
We are given a file whose path is contained in a variable fname and an array of bad words:
arr = ["alpha", "fox", "delta", "charlie", "mabel"]
The first condition that I spoke of above is that, by way of example, "ALPHA" and "Alpha" match "alpha", but "aLPha" does not (or some variant of that).
The second condition is that there is a regular expression with a capture group that would capture a bad word if a bad word were present at the given location in a match. For example:
regex = (?<=^|\s)(?:dbName\.[a-z]*\.)?(\p{Alpha}+)(?=\s|$)
Wherever there is a match, the capture group (\p{Alpha}+) would capture a string of one or more alphanumeric characters whose value is assigned to the global variable $1. We will then check to see if the value of $1 is a bad word. (The regular expression might have other capture groups as well, in which case we might be looking for $2 or $3, say, or a named capture group.)
If there were more than one such regular expression to check for, the code below could be executed for each of them until a match is found or it is determined that there are no more matches.
The first step is to convert the array of bad words to a set:
require 'set'
bad_words = arr.flat_map { |w| [w, w.capitalize, w.upcase] }.to_set
#=> #<Set: {"alpha", "Alpha", "ALPHA", "fox", "Fox", "FOX",
# "delta", "Delta", "DELTA", "charlie", "Charlie", "CHARLIE",
# "mabel", "Mabel", "MABEL"}>
This allows very fast word lookups--much faster than stepping through an array. We may then search the file as follows.
rv = IO.foreach(fname).any? do |line|
line.gsub(regex).any? { bad_words.include?($1) }
end
IO::foreach without a block is seen to return an enumerator. We can then chain that to any? to determine if there is a line that contains a match of the regular expression and the value of its capture group is contained in the set bad_words. If such a line is found the search terminates and true is returned; else, false is returned.
It is seen that String#gsub without a block returns an enumerator, which here I've chained to any?. This form of gsub has nothing to do with string replacements; it just generates matches. Those matches are passed to the block, but we are only interested in the contents of the capture group, which are held by $1. Hence the expression bad_words.include?($1).
Is there a way to obtain:
"[][][]".split('[]')
#=> ["", "", ""]
instead of
#=>[]
without having to write a function?
The behavior is surprising here because sometimes irb would respond as expected:
"[]a".split('[]')
#=>["", "a"]`
From the docs:
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
And so:
"[][][]".split("[]", -1)
# => ["", "", "", ""]
This yields four empty strings rather than your three, but if you think about it it's the only result that makes sense. If you split ,,, on each comma you would expect to get four empty strings as well, since there's one empty item "before" the first comma and one "after" the last.
String#split takes two arguments: a pattern to split on, and a limit to the number of results returned. In this case, limit can help us.
The documentation for String#split says:
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if *limit( is 1, the entire string is returned as the only entry in an array).
The key phrase here is trailing null fields are suppressed, in other words, if you have extra, empty matches at the end of the string, they'll be dropped from the result unless you have set a limit.
Here's an example:
"[]a[][]".split("[]")
#=> ["", "a"]
You might expect to get ["", "a", "", ""], but because trailing null fields are suppressed, everything after the last non-empty match (the a) is dropped.
We could set a limit, and only get that many results:
"[]a[][]".split("[]", 3)
#=> ["", "a", "[]"]
In this case, since we've asked for 3 results, the last [] is ignored and forms part of the last result. This is useful when we know how many results we expect, but not so useful in your specific case.
Fortunately, the docs continue:
If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
In other words, we can pass a limit of -1, and get all the matches, even the trailing empty ones:
"[]a[][]".split('[]', -1)
#=> ["", "a", "", ""]
This even works when all the matches are empty:
"[][][]".split('[]', -1)
#=> ["", "", "", ""]
This question already has an answer here:
How do I avoid trailing empty items being removed when splitting strings?
(1 answer)
Closed 8 years ago.
In both Ruby and JavaScript I can write expression " x ".split(/[ ]+/)
. In JavaScript I get somehow reasonable result ["", "x", ""], but in Ruby (2.0.0) i get ["", "x"], which is for me quite counterintuitive. I have problems to understand how regular expressions works in Ruby. Why don't I get the same result as in JavaScript or just ["x"]?
From string#split documentation, emphasis my own:
split(pattern=$;, [limit])
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is the default), str is split on whitespace as if ` ' were specified.
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
So if you were to use " x ".split(/[ ]+/, -1) you would get your expected result of ["", "x", ""]
*edited to reflect Wayne's comment
I found this in the C code for String#split, almost right at the end:
if (NIL_P(limit) && lim == 0) {
long len;
while ((len = RARRAY_LEN(result)) > 0 &&
(tmp = RARRAY_AREF(result, len-1), RSTRING_LEN(tmp) == 0))
rb_ary_pop(result);
}
So it actually pops empty strings off the end of the result array before returning! It looks like the creators of Ruby didn't want String#split to return a bunch of empty strings.
Notice the check for NIL_P(limit) -- this accords exactly with what the documentation says, as #dax pointed out.
This question already has an answer here:
Why does Ruby String#split not treat consecutive trailing delimiters as separate entities?
(1 answer)
Closed 9 years ago.
If I define a string with nulls
string = "a,b,,c,d,e,f,,"
then
string.split(',')
=> ["a", "b", "", "c", "d", "e", "f"]
The empty string between "b" and "c" is accounted for, but the two at the end have been lost. How can I split a string and preserve those trailing empty strings in the returned array?
You need to say:
string.split(',',-1)
to avoid omitting the trailing blanks.
per Why does Ruby String#split not treat consecutive trailing delimiters as separate entities?
The second parameter is the "limit" parameter, documented at http://ruby-doc.org/core-2.0.0/String.html#method-i-split as follows:
If the "limit" parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).