How to split an expression into an array? - ruby

I want to convert an expression string into arrays. For string (10+2)/33, the expected result is ['(', '10', '+', '2', ')', '/', '33']. There maybe some spaces between them, and the valid operators are +-*/().

You can use this pattern:
"(10+2)/33".split(/\b|(?!\d)/)
(Obviously, the goal here is to split the string, not to check if characters are allowed).
The idea is to use the fact that (, ), +, -, /, * are single characters that are not in the \w character class. So \b will match when a digit is followed by one of these characters and vice versa. (?!\d) (negative lookakead: not followed by a digit), since it is the second alternative, is like \B(?!\d) and will match between two signs.
If you want to deal with eventual spaces, you only need to add \s* in each branch:
"(10+2)/33".split(/\s*\b\s*|\s*(?!\d)/)
Note that it may generate an empty item at the begining.
To avoid the problem you can use the scan method with a different pattern:
" ( 2 (10+2) / 33) ".scan(/\G\s*\K(?:\d+|\S)/)
Where \G ensures that all matches are contiguous from the start of the string and \K discards all on the left from the match result (the eventual white-spaces).

"(10 + 2) / 33".delete(" ").split(/(\D)/).reject(&:empty?)
# => ["(", "10", "+", "2", ")", "/", "33"]

You may use scan function also.
> s = "(10+2)/33"
> s.scan(/\d+|[^\s\w]/)
=> ["(", "10", "+", "2", ")", "/", "33"]
> s.scan(/\d+|[^\s\d]/)
=> ["(", "10", "+", "2", ")", "/", "33"]

Related

Using regex to strip all characters and punctuation from a string except apostrophe

I attempted to let this method call:
alternate_words(". . . . don’t let this stop you")
return every other word in the string, less punctuations except for '.
This is the method definition:
def alternate_words(sentence)
sentence.gsub(/[^a-z0-9\s']/i, "").split(" ").delete_if.with_index
{|word,index| index.odd? }
end
The result is:
["dont", "this", "you"]
The correct words are returned, but no ' is included. Changing the regex to:
/[^a-z0-9\s][']/i
returns
[".", ".", "don’t", "this", "you"]
Now, it correctly recognizes the apostrophe, but it incorrectly includes the periods. I don't understand why.
You may actually match words with apostrophes and hyphens with scan:
def alternate_words(sentence)
sentence.scan(/[[:alnum:]]+(?:[’'-][[:alnum:]]+)*/).delete_if.with_index { |_,index|
index.odd?
}
end
p alternate_words(". . . . . don’t let this stop you")
# => ["don’t", "this", "you"]
See a Ruby demo
The [[:alnum:]]+(?:[’'-][[:alnum:]]+)* pattern may be enclosed with a word boundary - \b - if you want to only match whole word.
Details:
[[:alnum:]]+ - 1 or more alphanumeric symbols
(?:[’'-][[:alnum:]]+)* - zero or more (due to *, replace with another quantifier as per requirements) occurrences of:
[’'-] - an apostrophe or a hyphen (the list may be adjusted_
[[:alnum:]]+ - 1 or more alphanumeric symbols.

How to create reqular expression to match all numbers without the numbers at the beginning and end of the string?

Example code:
test = '12asiudas8787hajshd986q756tgs87ta7d6-12js01'
test.scan(regexp)
As a result, I should get:
["8787", "986", "756", "87", "7", "6", "12"]
Like using a /\d+/ regexp, but without the numbers at the beginning and end of the string, in this case 12 and 01.
To match the numbers within the string use following regex.
Regex: (?<=[^\d])(\d+)(?=[^\d])
Explanation:
(?<=[^\d]) Will ensure that it's not followed by a digit. Without this 2 of 12 at beginning will be matched too, and we don't want that.
(\d+) matches your number.
(?=[^\d]) Will ensure that last digit is not followed by a digit. Without this 0 of 01 will be matched too.
P.S: Edited regex on Wiktor Stribiżew's advice
One can also use \D instead of [^\d]. I used [^\d] to make it clear.
Regex101 Demo
Edited Regex101 Demo
test.scan(/(?<=\D)\d+(?=\D)/) # => ["8787", "986", "756", "87", "7", "6", "12"]
This should do:
test.scan(/(?<=[^\d])(\d+)(?=[^\d])/).flatten

Splitting the content of brackets without separating the brackets ruby

I am currently working on a ruby program to calculate terms. It works perfectly fine except for one thing: brackets. I need to filter the content or at least, to put the content into an array, but I have tried for an hour to come up with a solution. Here is my code:
splitted = term.split(/\(+|\)+/)
I need an array instead of the brackets, for example:
"1-(2+3)" #=>["1", "-", ["2", "+", "3"]]
I already tried this:
/(\((?<=.*)\))/
but it returned:
Invalid pattern in look-behind.
Can someone help me with this?
UPDATE
I forgot to mention, that my program will split the term, I only need the content of the brackets to be an array.
If you need to keep track of the hierarchy of parentheses with arrays, you won't manage it just with regular expressions. You'll need to parse the string word by word, and keep a stack of expressions.
Pseudocode:
Expressions = new stack
Add new array on stack
while word in string:
if word is "(": Add new array on stack
Else if word is ")": Remove the last array from the stack and add it to the (next) last array of the stack
Else: Add the word to the last array of the stack
When exiting the loop, there should be only one array in the stack (if not, you have inconsistent opening/closing parentheses).
Note: If your ultimate goal is to evaluate the expression, you could save time and parse the string in Postfix aka Reverse-Polish Notation.
Also consider using off-the-shelf libraries.
A solution depends on the pattern you expect between the parentheses, which you have not specified. (For example, for "(st12uv)" you might want ["st", "12", "uv"], ["st12", "uv"], ["st1", "2uv"] and so on). If, as in your example, it is a natural number followed by a +, followed by another natural number, you could do this:
str = "1-( 2+ 3)"
r = /
\(\s* # match a left parenthesis followed by >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
(\+) # match a plus sign in a capture group
\s* # match >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
\) # match a right parenthesis
/x
str.scan(r0).first
=> ["2", "+", "3"]
Suppose instead + could be +, -, * or /. Then you could change:
(\+)
to:
([-+*\/])
Note that, in a character class, + needn't be escaped and - needn't be escaped if it is the first or last character of the class (as in those cases it would not signify a range).
Incidentally, you received the error message, "Invalid pattern in look-behind" because Ruby's lookarounds cannot contain variable-length matches (i.e., .*). With positive lookbehinds you can get around that by using \K instead. For example,
r = /
\d+ # match one or more digits
\K # forget everything previously matched
[a-z]+ # match one or more lowercase letters
/x
"123abc"[r] #=> "abc"

Allow alphanumeric and #, replace comma and separate on space

I want to do the following in a regex:
1. allow alphanumeric characters
2. allow the # character, and comma ','
3. replace the comma ',' with a space
4. split on space
sentence = "cool, fun, house234"
>> [cool, fun, house234]
This is a simple way to do it:
sentence.scan(/[a-z0-9#]+/i) #=> ["cool", "fun", "house234"]
Basically it's looking for character runs that contain a to z in upper and lower case, plus 0 to 9, and #, and returning those. Because comma and space aren't matching they're ignored.
You don't show an example using # but I added it 'cuz you said so.
You can do 1 and 2 with a regular expression, but not 3 and 4.
sentence = "cool, fun, house234"
sentence.gsub(',', ' ').split if sentence =~ /[0-9#,]/
=> [ "cool", "fun", "house234" ]
"cool, fun, house234".split(",")
=> ["cool", " fun", " house234"]
You can just pass the "," into the split method to split on the comma, no need to convert it to spaces.
Probably, what you wanted is this?
string.gsub(/[^\w, #]/, '').split(/ +,? +/)

Regex to remove non letters

I'm trying to remove non-letters from a string. Would this do it:
c = o.replace(o.gsub!(/\W+/, ''))
Just gsub! is sufficient:
o.gsub!(/\W+/, '')
Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')
Remove anything that is not a letter:
> " sd 190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"
EDIT: as ikegami points out, this doesn't take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as "not a letter". Also, what your input will be.
Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it
string.gsub!(/\W+/, '')
Otherwise, you need to do this:
string.gsub!(/[^a-zA-Z]/, '')
That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.
If you just want a replaced string, it can be simpler:
c = o.gsub(/\W+/, '')
Using \W or \w to select or delete only characters won't work. \w means A-Z, a-z, 0-9, and "_":
irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
So, stripping using \W preserves digits and underscores.
If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.
The final format is the Unicode property for 'A'..'Z' + 'a'..'z' in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.
use Regexp#union to create a big matching object
allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")
I don't see what that o.replace is in there for if you have a string:
string = 't = 4 6 ^'
And you do:
string.gsub!(/\W+/, '')
You get:
t46
If you want to get rid of the number characters too, you can do:
string.gsub!(/\W+|\d+/, '')
And you get:
t

Resources