Better way to tokenize string in Ruby? - ruby

I have a string of arbitrary characters. I would like to turn it into an array, where each character is in a single array element, EXCEPT of successive word-characters (\w+), which should end up together in one array element. Example:
ab.:u/87z
should become
['ab','.',':','u','/','87z']
My first approach went like this:
mystring.split(/\b/)
Of course this groups together non-word characters:
['ab','.:','u','/','87','z']
I can take them apart in a subsequent step, but I'm looking for a more elegant way. Next I tried these:
mystring.split(/(\w+|\W)/)
mystring.split(/(\b|\W)/)
Both return nearly the desired result, only that they also return array elements containing empty strings, so I have to write something like
mystring.split(/(\b|\W)/).reject(&:empty?)
Now my question: Is there a simpler way to do this?
UPDATE: I made a silly mistake when I explained my example. Of course '87' and 'z' should be together, i.e. '87z'. I fixed my example.

'ab.:u/87z'.scan(/\w+|./) #=>["ab", ".", ":", "u", "/", "87z"]
I'm not exactly sure what you want because you said word-characters (\w+) but split the 87 and z. If I'm correct, \w should match letters, digits and underscores. Hence the "87z".
'ab.:u/87z'.scan(/[A-Za-z]+|\d+|./) #=>["ab", ".", ":", "u", "/", "87", "z"]
You could always do this to achieve what you showed there though

Don't use split, use the scan method:
> "ab.:u/87z".scan(/\w+|\W/)
=> ["ab", ".", ":", "u", "/", "87z"]

Related

How to use gsubstitution with more letters

I've printed the code, wit ruby
string = "hahahah"
pring string.gsub("a","b")
How do I add more letter replacements into gsub?
string.gsub("a","b")("h","l") and string.gsub("a","b";"h","l")
didnt work...
*update I have tried this too but without any success .
letters = {
"a" => "l"
"b" => "n"
...
"z" => "f"
}
string = "hahahah"
print string.gsub(\/w\,letters)
You're overcomplicating. As with most method calls in Ruby, you can simply chain #gsub calls together, one after the other:
str = 'adfh'
print str.gsub("a","b").gsub("h","l") #=> 'bdfl'
What you're doing here is applying the second #gsub to the result of the first one.
Of course, that gets a bit long-winded if you do too many of them. So, when you find yourself stringing too many together, you'll want to look for a regex solution. Rubular is a great place to tinker with them.
The way to use your hash trick with #gsub and a regex expression is to provide a hash for all possible matches. This has the same result as the two #gsub calls:
print str.gsub(/[ah]/, {'a'=>'b', 'h'=>'l'}) #=> 'bdfl'
The regex matches either a or h (/[ah]/), and the hash is saying what to substitute for each of them.
All that said, str.tr('ah', 'bl') is the simplest way to solve your problem as specified, as some commenters have mentioned, so long as you are working with single letters. If you need to work with two or more characters per substitution, you'll need to use #gsub.

How use match in ruby?

Im trying to get the uppercase words from a text. How i can use .match() for this?
Example
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
and I need something like:
r = /[A-Z]/
puts r.match(text)
I never used match and i need a method that gets all uppercase words (Acronym).
If you only want acronyms, you can use something like:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
text.scan(/\b[A-Z]+\b/)
# => ["PS"]
It's important to match entire words, which is where \b helps, as it marks word boundaries.
The problem is when your text contains single, stand-alone capital letters:
text = "Pediatric stroke (PS) I U.S.A"
text.scan(/\b[A-Z]+\b/)
# => ["PS", "I", "U", "S", "A"]
At that point we need a bit more intelligence and foreknowledge of the text content being searched. The question is, are single-letter acronyms valid? If not, then a minor modification will help:
text.scan(/\b[A-Z]{2,}\b/)
# => ["PS"]
{2,} is explained in the Regexp documentation, so read that for more information.
i only want acronym type " (ACRONYM) ", in this case PS
It's not easy to tell what you want by your description. An acronym is defined as:
An acronym is an abbreviation used as a word which is formed from the initial components in a phrase or a word. Usually these components are individual letters (as in NATO or laser) or parts of words or names (as in Benelux).
according to Wikipedia. By that definition, lowercase, all caps and mixed case can be valid.
If, you mean you only want all-caps within parenthesis, then you can easily modify the regex to honor that, but you'll fail on other acronyms you could encounter, by either missing ones you should want, or by capturing others you should want to ignore.
text = "(PS) (CT/CAT scan)"
text.scan(/\([A-Z]+\)/) # => ["(PS)"]
text.scan(/\([A-Z]+\)/).map{ |s| s[1..-2] } # => ["PS"]
text.scan(/\(([A-Z]+)\)/) # => [["PS"]]
text.scan(/\(([A-Z]+)\)/).flatten # => ["PS"]
are varying ways grab the text but this only opens a new can of worms when you look at "List of medical abbreviations" and "Medical Acronyms / Abbreviations".
Typically I'd have a table of the ones I'll accept, use a simple pattern to capture anything that looks like something I'd want, check to see if it's in the table then keep it or reject it. How to do that is for you to figure out as it's a completely different question and doesn't belong in this one.
Wrong function for the job. Use String#scan.
To get all words that start with uppercase, use String#scan with \b\p{Lu}\w*\b:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
puts text.scan(/\b\p{Lu}\w*\b/).flatten
See demo
The String.match() will only get you the first match, while scan will return all matches.
The regex \b\p{Lu}\w*\b matches:
\b - word boundary
\p{Lu} - an uppercase Unicode letter
\w* - 0 or more alphanumeric characters
\b - a trailing word boundary
To only match linguistic words (made of letters) you can use
puts text.scan(/\b\p{Lu}\p{M}*+(?>\p{L}\p{M}*+)*\b/).flatten
See another demo
Here, \p{Lu}\p{M}*+ matches any Unicode uppercase letter (even a precomposed one as \p{M} matches diacritics) and (?>\p{L}\p{M}*+)* matches 0 or more letters.
To only get words in ALLCAPS, use
puts text.scan(/\b(?>\p{Lu}\p{M}*+)+\b/).flatten
See the 3rd demo
Yes, you can use String#match for this. It may not be the best way, but you didn't ask if it was. You'd have to do something like this:
text.split.map { |s| s.match(/[A-Z]\w*/) }.compact.map { |md| md[0] }
#=> ["Pediatric", "PS"]
If you knew in advance that text contained two words beginning with a capital letter, you could write:
text.match(/([A-Z]\w*).*([A-Z]\w*)/)
[$1,$2]
#=> ["Pediatric", "PS"]
Note that using a regex is not your only option:
text.delete('.,!?()[]{}').split.select { |str| ('A'..'Z').cover?(str[0]) }
#=> ["Pediatric", "PS"]

ruby regex to match multiple occurrences of pattern

I am looking to build a ruby regex to match multiple occurrences of a pattern and return them in an array. The pattern is simply: [[.+]]. That is, two left brackets, one or more characters, followed by two right brackets.
This is what I have done:
str = "Some random text[[lead:first_name]] and more stuff [[client:last_name]]"
str.match(/\[\[(.+)\]\]/).captures
The regex above doesn't work because it returns this:
["lead:first_name]] and another [[client:last_name"]
When what I wanted was this:
["lead:first_name", "client:last_name"]
I thought if I used a noncapturing group that for sure it should solve the issue:
str.match(/(?:\[\[(.+)\]\])+/).captures
But the noncapturing group returns the same exact wrong output. Any idea on how I can resolve my issue?
The problem with your regex is that the .+ part is "greedy", meaning that if the regex matches both a smaller and larger part of the string, it will capture the larger part (more about greedy regexes).
In Ruby (and most regex syntaxes), you can qualify your + quantifier with a ? to make it non-greedy. So your regex would become /(?:\[\[(.+?)\]\])+/.
However, you'll notice this still doesn't work for what you want to do. The Ruby capture groups just don't work inside a repeating group. For your problem, you'll need to use scan:
"[[a]][[ab]][[abc]]".scan(/\[\[(.+?)\]\]/).flatten
=> ["a", "ab", "abc"]
Try this:
=> str.match(/\[\[(.*)\]\].*\[\[(.*)\]\]/).captures
=> ["lead:first_name", "client:last_name"]
With many occurrences:
=> str
=> "Some [[lead:first_name]] random text[[lead:first_name]] and more [[lead:first_name]] stuff [[client:last_name]]"
=> str.scan(/\[(\w+:\w+)\]/)
=> [["lead:first_name"], ["lead:first_name"], ["lead:first_name"], ["client:last_name"]]

Ruby Regex, Only One Capture (Very Simple!)

I guess this will be a silly mistake but for me, the following returns an array containing only "M". See this:
/(.)+?/.match("Many many characters!").captures
=> ["M"]
Why doesn't it return an array of every character? I must have missed something blatantly obvious because I can't see whats wrong with this?
Edit: Just realised, I don't need the +? but it still doesn't work without it.
Edit: Apologies! I will clarify: my goal is to allow users to enter a regular expression and styling and an input text file, wherever there is a match, the text will be surrounded with a html element and styling will be applied, I am not just splitting the string into characters, I only used the given regex because it was the simplest although that was stupid on my part. How do I get capture groups from scan() or is that not possible? I see that $1 contains "!" (last match?) and not any others.
Edit: Gosh, it really isn't my day. As injekt has informed me, the captures are stored in separate arrays. How do I get the offset of these captures from the original string? I would like to be able to get the offset of a captures then surround it with another string. Or is that what gsub is for? (I thought that only replaced the match, not a capture group)
Hopefully final edit: Right, let me just start this again :P
So, I have a string. The user will use a configuration file to enter a regular expression, then a style associated with each capture group. I need to be able to scan the entire string and get the start and finish or offset and size of each group match.
So if a user had configured ([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}) (email address) then I should be able to get:
[ ["elliotpotts", 0, 11],
["sample.", 12, 7],
["com", 19, 3] ]
from the string:
"elliotpotts#sample.com"
If that is not clear, there is simply something wrong with me :P. Thanks a lot so far guys, and thank you for being so patient!
Because your capture is only matching one single character. (.)+ is not the same as (.+)
>> /(.)+?/.match("Many many characters!").captures
=> ["M"]
>> /(.+)?/.match("Many many characters!").captures
=> ["Many many characters!"]
>> /(.+?)/.match("Many many characters!").captures
=> ["M"]
If you want to match every character recursively use String#scan or String#split if you don't care about capture groups
Using scan:
"Many many characters!".scan(/./)
#=> ["M", "a", "n", "y", " ", "m", "a", "n", "y", " ", "c", "h", "a", "r", "a", "c", "t", "e", "r", "s", "!"]
Note that other answer are using (.) whilst that's fine if you care about the capture group, it's a little pointless if you don't, otherwise it'll return EVERY CHARACTER in it's own separate Array, like this:
[["M"], ["a"], ["n"], ["y"], [" "], ["m"], ["a"], ["n"], ["y"], [" "], ["c"], ["h"], ["a"], ["r"], ["a"], ["c"], ["t"], ["e"], ["r"], ["s"], ["!"]]
Otherwise, just use split: "Many many characters!".split(' ')"
EDIT In reply to your edit:
reg = /([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})/
str = "elliotpotts#sample.com"
str.scan(reg).flatten.map { |capture| [capture, str.index(capture), capture.size] }
#=> [["elliotpotts", 0, 11], ["sample.", 12, 7], ["com", 19, 3]]`
Oh, and you don't need scan, you're not really scanning so you dont need to traverse, at least not with the example you provided:
str.match(reg).captures.map { |capture| [capture, str.index(capture), capture.size] }
Will also work
Yes, something important was missed ;-)
(...) only introduces ONE capture group: the number of times the group matches is irrelevant as the index is determined only by the regular expression itself and not the input.
The key is a "global regular expression", which will apply the regular expression multiple times in order. In Ruby this is done with inverting from Regex#match to String#scan (many other languages have a "/g" regular expression modifier):
"Many many chara­cters!".sc­an(/(.)+?/­)
# but more simply (or see answers using String#split)
"Many many chara­cters!".sc­an(/(.)/­)
Happy coding
It's only returning one character because that's all you've asked it to match. You probably want to use scan instead:
str = "Many many characters!"
matches = str.scan(/(.)/)
The following code is from Get index of string scan results in ruby and modified for my liking.
[].tap {|results|
"abab".scan(/a/) {|capture|
results.push(([capture, Regexp::last_match.offset(0)]).flatten)
}
}
=> [["a", 0], ["a", 2]]

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Resources