Why is this negative look behind wrong? - ruby

def get_hashtags(post)
tags = []
post.scan(/(?<![0-9a-zA-Z])(#+)([a-zA-Z]+)/){|x,y| tags << y}
tags
end
Test.assert_equals(get_hashtags("two hashs##in middle of word#"), [])
#Expected: [], instead got: ["in"]
Should it not look behind to see if the match doesnt begin with a word or number? Why is it still accepting 'in' as a valid match?

You should use \K rather than a negative lookbehind. That allows you to simplify your regex considerably: no need for a pre-defined array, capture groups or a block.
\K means "discard everything matched so far". The key here is that variable-length matches can precede \K, whereas (in Ruby and most other languages) variable-length matches are not permitted in (negative or positive) lookbehinds.
r = /
[^0-9a-zA-Z#] # do not match any character in the character class
\#+ # match one or more pound signs
\K # discard everything matched so far
[a-zA-Z]+ # match one or more letters
/x # extended mode
Note # in \#+ need not be escaped if I weren't writing the regex in extended mode.
"two hashs##in middle of word#".scan r
#=> []
"two hashs&#in middle of word#".scan r
#=> ["in"]
"two hashs#in middle of word&#abc of another word.###def ".scan r
#=> ["abc", "def"]

Related

How to find same characters in two random strings? (Ruby)

I am busy working through some problems I have found on the net and I feel like this should be simple but I am really struggling.
Say you have the string 'AbcDeFg' and the next string of 'HijKgLMnn', I want to be able to find the same characters in the string so in this case it would be 'g'.
Perhaps I wasn't giving enough information - I am doing Advent of Code and I am on day 3. I just need help with the first bit which is where you are given a string of characters - you have to split the characters in half and then compare the 2 strings. You basically have to get the common character between the two. This is what I currently have:
file_data = File.read('Day_3_task1.txt')
arr = file_data.split("\n")
finals = []
arr.each do |x|
len = x.length
divided_by_two = len / 2
second = x.slice!(divided_by_two..len).split('')
first = x.split('')
count = 0
(0..len).each do |z|
first.each do |y|
if y == second[count]
finals.push(y)
end
end
count += 1
end
end
finals = finals.uniq
Hope that helps in terms of clarity :)
Did you try to convert both strings to arrays with the String#char method and find the intersection of those arrays?
Like this:
string_one = 'AbcDeFg'.chars
string_two = 'HijKgLMnn'.chars
string_one & string_two # => ["g"]
One way to do that is to use the method String#scan with the regular expression
rgx = /(.)(?!.*\1.*_)(?=.*_.*\1)/
I'm not advocating this approach. I merely thought some readers might find it interesting.
Suppose
str1 = 'AbcDgeFg'
str2 = 'HijKgLMnbn'
Now form the string
str = "#{str1}_#{str2}"
#=> "AbcDeFg_HijKgLMnbn"
I've assumed the strings contain letters only, in which case they are separated in str with any character other than a letter. I've used an underscore. Naturally, if the strings could contain underscores a different separator would have to be used.
We then compute
str.scan(rgx).flatten
#=> ["b", "g"]
Array#flatten is needed because
str.scan(rgx)
#=>[["b"], ["g"]]
The regular expression can be written in free-spacing mode to make it self-documenting:
rgx =
/
(.) # match any character, same to capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1 # match the contents of capture group 1
.* # match zero or more characters
_ # match an underscore
) # end the negative lookahead
(?= # begin a positive lookahead
.* # match zero or more characters
_ # match an underscore
.* # match zero or more characters
\1 # match the contents of capture group 1
) # end the positive lookahead
/x # invoke free-spacing regex definition mode
Note that if a character appears more than once in str1 and at least once in str2 the negative lookahead ensures that only the last one in str1 is matched, to avoid returning duplicates.
Alternatively, one could write
str.gsub(rgx).to_a
The uses the (fourth) form of String#gsub which takes a single argument and no block and returns an enumerator.

how to get formatted data from a array and convert it back to new array in ruby

I have a data array like below. I need to format it like shown
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Need to format like
["EC006", "ED009", "AX009"]
arr = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
To merely extract the strings of interest, assuming the data is formatted correctly, we may write the following.
arr.map { |s| s[/(?<=\[)[^\]]*/] }
#=> ["EC006", "ED009", "AX009"]
See String#[] and Demo
In the regular expression (?<=\[) is a positive lookbehind that asserts the previous character is '['. The ^ at the beginning of the character class [^\]] means that any character other than ']' must be matched. Appending the asterisk ([^\]]*) causes the character class to be matched zero or more times.
Alternatively, we could use the regular expression
/\[\K[^\]]*/
where \K causes the beginning of the match to be reset to the current string location and all previously-matched characters to be discarded from the match that is returned.
To confirm the correctness of the formatting as well, use
arr.map { |s| s[/\A[1-9]\d{3} \[\K[A-Z]{2}\d{3}(?=]\z)/] }
#=> ["EC006", "ED009", "AX009"]
Demo
Note that at the link I replaced \A and \z with ^ and $, respectively, in order to test the regex against multiple strings.
This regular expression can be broken down as follows.
\A # match beginning of string
[1-9] # match a digit other than zero
\d{3} # match 3 digits
[ ] # match one space
\[ # match '['
\K # reset start of match to current stringlocation and discard
# all characters previously matched from match that is returned
[A-Z]{2} # match 2 uppercase letters
\d{3} # match 3 digits
(?=]\z) # positive lookahead asserts following character is
# ']' and that character is at the end of the string
In the above I placed a space character in a character class ([ ]) merely to make it visible to the reader.
Input
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Code
p a.collect { |x| x[/\[(.*)\]/, 1] }
Output
["EC006", "ED009", "AX009"]

How to remove a certain character after substring in Ruby

I have a string with exclamation marks. I want to remove the exclamation marks at the end of the word, not the ones before a word. Assume there is no exclamation mark by itself/ not accompanied by a word. By word I mean [a..z], can be uppercased.
For example:
exclamation("Hello world!!!")
#=> ("Hello world")
exclamation("!!Hello !world!")
#=> ("!!Hello !world")
I have read How do I remove substring after a certain character in a string using Ruby? ; these two are close, but different.
def exclamation(s)
s.slice(0..(s.index(/\w/)))
end
# exclamation("Hola!") returns "Hol"
I have also tried s.gsub(/\w!+/, ''). Although it retains the '!' before word, it removes both the last letter and exclamation mark. exclamation("!Hola!!") #=> "!Hol".
How can I remove only the exclamation marks at the end?
If you don't want to use regex that sometimes difficult to understand use this:
def exclamation(sentence)
words = sentence.split
words_wo_exclams = words.map do |word|
word.split('').reverse.drop_while { |c| c == '!' }.reverse.join
end
words_wo_exclams.join(' ')
end
Although you haven't given a lot of test data, here's an example of something that might work:
def exclamation(string)
string.gsub(/(\w+)\!(?=\s|\z)/, '\1')
end
The \s|\z part means either a space or the end of the string, and (?=...) means to just peek ahead in the string but not actually match against it.
Note that this won't work in the case of things like "I'm mad!" where the exclamation mark is not adjacent to a space, but you could always add that as another potential end-of-word match.
"!!Hello !world!, world!! I say".gsub(r, '')
#=> "!!Hello !world, world! I say"
where
r = /
(?<=[[:alpha:]]) # match an uppercase or lowercase letter in a positive lookbehind
! # match an exclamation mark
/x # free-spacing regex definition mode
or
r = /
[[:alpha:]] # match an uppercase or lowercase letter
\K # discard match so far
! # match an exclamation mark
/x # free-spacing regex definition mode
If the above example should return "!!Hello !world, world I say", change ! to !+ in the regexes.

I need a regex to capture just the number from a string

I do not have access to the code, this is via an interface that only allows me to edit the regex that parses user responses. I need to extract the weight after users text, where they text things like:
wt 172.5
172.5 lbs
180
wt. 173.22
172,5
I need to capture the weight as a float field, but I want to restrict it to at most 1 decimal place. I tried using /(?<val>[\d+((\.|,)\d\d?)?]/ but it is only saving the first digit "1" in the field
Sometimes what seems most simple is not. I suggest using this regex:
r = /(?<=\A|\s)\d+(?:[.,]\d)?(?=\d|\s|\z)/
We can alternatively define the regex using extended or free-spacing mode (by adding the modifier x after the final /), which allows us to include documentation:
r = /
(?<=\A|\s) # match beginning of string or space in a positive lookbehind
\d+ # match one or more digits
(?:[.,]\d)? # optionally (? after non-capture group) match a . or , then a digit
(?=\d|\s|\z) # match a digit, space or the end of the string in a positive lookahead
/x
"wt 172.5"[r] #=> "172.5"
"172.5 lbs"[r] #=> "172.5"
"180"[r] #=> "180"
"wt. 173.22"[r] #=> "173.2"
"172,5"[r] #=> "172,5"
"A1 143.66"[r] #=> "143.6"
"A1 1.3.4 43.6"[r] #=> "43.6"
\d+(?:[,.]\d{1,2})?
Guess you wanted this .[] is character class,not what you think.Your character class captures just one out of all characters you have defined.
See demo.
https://regex101.com/r/eB8xU8/12

Ruby regular expression non capture group

I'm trying to grab id number from the string, say
id/number/2000GXZ2/ref=sr
using
(?:id\/number\/)([a-zA-Z0-9]{8})
for some reason non capture group is not worked, giving me:
id/number/2000GXZ2
As mentioned by others, non-capturing groups still count towards the overall match. If you don't want that part in your match use a lookbehind.
Rubular example
(?<=id\/number\/)([a-zA-Z0-9]{8})
(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text
Ruby Doc Regexp
Also, the capture group around the id number is unnecessary in this case.
You have:
str = "id/number/2000GXZ2/ref=sr"
r = /
(?:id\/number\/) # match string in a non-capture group
([a-zA-Z0-9]{8}) # match character in character class 8 times, in capture group 1
/x # extended/free-spacing regex definition mode
Then (using String#[]):
str[r]
#=> "id/number/2000GXZ2"
returns the entire match, as it should, not just the contents of capture group 1. There are a few ways to remedy this. Consider first ones that do not use a capture group.
#jacob.m suggested putting the first part in a positive lookbehind (modified slightly from his code):
r = /
(?<=id\/number\/) # match string in positive lookbehind
[[:alnum:]]{8} # match >= 1 alphameric characters
/x
str[r]
#=> "2000GXZ2"
An alternative is:
r = /
id\/number\/ # match string
\K # forget everything matched so far
[[:alnum:]]{8} # match 8 alphanumeric characters
/x
str[r]
#=> "2000GXZ2"
\K is especially useful when the match to forget is variable-length, as (in Ruby) positive lookbehinds do not work with variable-length matches.
With both of these approaches, if the part to be matched contains only numbers and capital letters, you may want to use [A-Z0-9]+ instead of [[:alnum:]] (though the latter includes Unicode letters, not just those from the English alphabet). In fact, if all the entries have the form of your example, you might be able to use:
r = /
\d # match a digit
[A-Z0-9]{7} # match >= 0 capital letters or digits
/x
str[r]
#=> "2000GXZ2"
The other line of approach is to keep your capture group. One simple way is:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
/x
str =~ r
str[r, 1] #=> "2000GXZ2"
Alternatively, you could use String#sub to replace the entire string with the contents of the capture group:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
.* # match the remainder of the string
/x
str.sub(r, '\1') #=> "2000GXZ2"
str.sub(r, "\\1") #=> "2000GXZ2"
str.sub(r) { $1 } #=> "2000GXZ2"
This is Ruby Regexp expected match consistency evilness. Some Regexp-style methods will return the global-match while others will return specified matches.
In this case, one method we can use to get the behavior you're looking for is scan.
I don't think anyone here actually mentions how to get your Regexp working as you originally intended, which was to get the capture-only match. To do that, you would use the scan method like so with your original pattern:
test_me.rb
test_string="id/number/2000GXZ2/ref=sr"
result = test_string.scan(/(?:id\/number\/)([a-zA-Z0-9]{8})/)
puts result
2000GXZ2
That said, replacing (?:) with (?<=) for non-capture groups for look-behinds will benefit you both when you use scan as well as other parts of ruby that use Regexps.

Resources