In my app, I have to monitor what users type. So I have to prevent any bad words from the web site. Just for example, suppose all my bad words were in this array.
bad_words = ['bad', 'evil', 'terrible', 'villain', 'enemy']
If a user typed those, I would like them to be deleted. Here was one thing I tried.
bad_words.each {|word| string.gsub(word, '')}
Help is appreciated.
You can use a Gem to do the clean job:
https://github.com/tjackiw/obscenity
including the gem will allow you methods like:
Obscenity.configure { |config| config.whitelist = bad_words }
and then:
Obscenity.sanitize(string)
Here's one way:
bad_words = ['bad', 'evil', 'terrible', 'villain', 'enemy']
orig_str =
"Evil is embodied by a terrible villain named 'Bad' who plays badmitten"
no_bad_str = orig_str.gsub(/(?<=^|\W)\w+(?=\W|$)/) { |w|
(bad_words.include?(w.downcase)) ? '' : w }
#=> " is embodied by a named '' who plays badmitten"
(?<=^|\W) is a positive lookbehind
(?=\W|$) is a positive lookahead
Can bad, evil and terrible words sneak by? Of course. Some examples for orig_str:
badbadbad
evilterribleenemy
eviloff
flyingevil
Good luck!
You can either do
bad_words.each {|word| string = string.gsub(word, '')}
or
bad_words.each {|word| string.gsub!(word, '')}
Either should work issue with your original was that it was returning a new string not modifying the old one like the to solutions I have proposed above.
You can use Regexp.union to create a regular expression containing all the words in yours list:
bad_words = ['bad', 'evil', 'terrible', 'villain', 'enemy']
Regexp.union(bad_words)
# => /bad|evil|terrible|villain|enemy/
string.gsub(Regexp.union(bad_words), '')
Related
Here are my test cases.
Expected:
JUNKINFRONThttp://francium.tech should be http://francium.tech
JUNKINFRONThttp://francium.tech/http should be http://francium.tech/http
francium.tech/http should be francium.tech/http (unaffected)
Actual result:
http://francium.tech
francium.tech/http
http
I am trying to write a regex replace for this. I tried this,
text.sub(/.*http/,'http')
However, my second and third test cases fail because it searches till the end. It would help if the answer could also do the case insensitivity.
2.5.0 :001 > url = 'francium.tech/http'
=> "francium.tech/http"
2.5.0 :002 > url.sub(/^.*?(?=http)/i,'')
=> "http"
As per my original comments, you can use the pattern as shown below. If you want a really small performance gain, you can remove one step in the regex by using the second pattern instead. If you're especially concerned with performance, the last one performs even quicker.
^.*?(?=https?://)
^.*?(?=https?:/{2})
^.*?(?=ht{2}ps?:/{2})
See code in use here
strings = [
"JUNKINFRONThttp://francium.tech",
"JUNKINFRONThttp://francium.tech/http",
"francium.tech/http"
]
strings.each { |s| puts s.sub(%r{^.*?(?=https?://)}, '') }
Outputs the following:
http://francium.tech
http://francium.tech/http
francium.tech/http
I think this may solve your problem.
str1 = 'JUNKINFRONThttp://francium.tech'# should be http://francium.tech
str2 = 'JUNKINFRONThttp://francium.tech/http'# should be http://francium.tech/http
str3 = 'francium.tech/http' #should be francium.tech/http (unaffected)
str4 = 'JUNKINFRONThttps://francium.tech/http'# should be https://francium.tech/http
[str1, str2, str3, str4].each do |str|
puts str.gsub(/^.*(http|https):\/\//i, "\\1://")
end
Result:
http://francium.tech
http://francium.tech/http
francium.tech/http
https://francium.tech/http
When using regex you should make sure to use unique strings like http:\\ or better http:\\[SOMETHING].[AT_LEAST_TWO_CHARS][MAYBE_A_SLASH] and so on...
This works for your given cases:
str = ['JUNKINFRONThttp://francium.tech',
'JUNKINFRONThttp://francium.tech/http',
'francium.tech/http']
str.each do |str|
puts str.sub(/^.*?(https?:\/{2})/, '\1') # with capturing group
puts str.sub(/^.*?(?=https?:\/{2})/, '') # with positive lookahead
end
By using a group we can use it for the replacement, another method would be to use a positive lookahead
I have a string as given below,
./component/unit
and need to split to get result as component/unit which I will use this as key for inserting hash.
I tried with .split(/.\//).last but its giving result as unit only not getting component/unit.
I think, this should help you:
string = './component/unit'
string.split('./')
#=> ["", "component/unit"]
string.split('./').last
#=> "component/unit"
Your regex was almost fine :
split(/\.\//)
You need to escape both . (any character) and / (regex delimiter).
As an alternative, you could just remove the first './' substring :
'./component/unit'.sub('./','')
#=> "component/unit"
All the other answers are fine, but I think you are not really dealing with a String here but with a URI or Pathname, so I would advise you to use these classes if you can. If so, please adjust the title, as it is not about do-it-yourself-regexes, but about proper use of the available libraries.
Link to the ruby doc:
https://docs.ruby-lang.org/en/2.1.0/URI.html
and
https://ruby-doc.org/stdlib-2.1.0/libdoc/pathname/rdoc/Pathname.html
An example with Pathname is:
require 'pathname'
pathname = Pathname.new('./component/unit')
puts pathname.cleanpath # => "component/unit"
# pathname.to_s # => "component/unit"
Whether this is a good idea (and/or using URI would be cool too) also depends on what your real problem is, i.e. what you want to do with the extracted String. As stated, I doubt a bit that you are really intested in Strings.
Using a positive lookbehind, you could do use regex:
reg = /(?<=\.\/)[\w+\/]+\w+\z/
Demo
str = './component'
str2 = './component/unit'
str3 = './component/unit/ruby'
str4 = './component/unit/ruby/regex'
[str, str2, str3, str4].each { |s| puts s[reg] }
#component
#component/unit
#component/unit/ruby
#component/unit/ruby/regex
Here is my array:
a = ['a','b','c', 'C!', 'D!']
I would like to select any upcase letters followed by the ! character and display them. I was trying:
puts a.select! {|i| i.upcase + "!"}
which gave me null set. Any help would be greatly appreciated.
puts a.grep(/[A-Z]!/)
will do.
Try the following:
a.select {|i| i =~ /[A-Z]!/}
Here's another way using the Regexp match method in Ruby.
a.select { |letter| /[A-Z]!/.match(letter) }
Also, one note: consider a more meaningful and contextually relevant variable name than "i" in a.select! {|i| i.upcase + "!"}. For example, I chose the name "letter", although there may be a more meaningful name. It's just a good naming practice that a lot of Ruby programmers tend to follow. Same thing applies to the array named a.
I'm writing a Rack app to split hostnames ending with certain prefixes.
For example, the hostname (and port) hello.world.lvh.me:3000 needs to be split into tokens hello.world, .lvh.me and :3000. Additionally, the prefix (hello.world), suffix (.lvh.me) and port (:3000) are all optional.
So far, I have a (Ruby) regex that looks like /(.*)(\.lvh\.me)(\:\d+)?/.
This successfully breaks the hostname into component parts but it falls down when one or more of the optional components is missing, e.g. for hello.world:3000 or lvh.me:3000 or even plain old hello.world.
I've tried adding ? to each group to make them optional (/(.*)?(\.lvh\.me)?(\:(\d+)?/) but this invariably ends up with the first group, (.*), capturing the entire string and stopping there.
My gut feeling is that this is something which might be solved using lookaround but I'll admit this is a totally new realm of regex for me.
You can try with this pattern:
\A(?=[^:])(.+?)??((?:\.|\A)lvh\.me)?(:[0-9]+)?\z
the lookahead (?=[^:]) checks there is at least one character that is not the : (in other words, not the port alone). This means that at least hello.word or lvh.me is present.
The first group is optional and non-greedy ??, this means that it is matched only when needed.
\A and \z are anchors for the start and the end of the string (when ^ and $ are used for the line)
Note that the character class \d matches all unicode digits in Ruby, but in this case you only need ascii digits. It's better to use [0-9]
Note too that \A(?=[^:])((?>[^l:\n.]+|\.|\Bl|l(?!vh\.me\b))*)((?:\.|\A)lvh\.me)?(:[0-9]+)?\z may be more performant.
online demo
Try ^(.*?)?(\.?lvh\.me)?(\:\d+)?$
I added:
a ? to the first group making the * non-greedy
^,$ to anchor it to the start and end.
a ? to the \. before lvh because you want to match lvh.me:3000 not .lvh.me:3000
A Tokenizing Answer
Just for fun, I decided to see if there was a relatively simple way to do what you wanted without a complicated regular expression. The only regular expressions I used were for splitting and validation.
This works for me with your provided corpus, and several variations.
str = 'hello.world.lvh.me:3000'
tokens = str.split /[.:]/
port = tokens.last =~ /\A\d+\z/ ? ?: + tokens.pop : ''
domain = sprintf '.%s.%s', *tokens.pop(2)
prefix = tokens.join ?.
You'll certainly need to check for empty strings in certain cases, but it seems like it might be more straightforward and/or flexible than a pure regex solution. I find it more readable, anyway. If you truly need a single regular expression, though, I'm sure one of the other answers will help you out.
You could try splitting rather than matching,
irb(main):012:0> "hello.world.lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me", "3000"]
irb(main):013:0> "hello.world:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "3000"]
irb(main):014:0> "lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["lvh.me", "3000"]
irb(main):015:0> "hello.world".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world"]
irb(main):016:0> "hello.world.lvh.me".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me"]
Look, ma, no regex!
def split_up(str)
str.sub(':','.:')
.split('.')
.each_slice(2)
.map { |arr| arr.join('.') }
end
split_up("hello.world.lvh.me:3000") #=> ["hello.world", "lvh.me", ":3000"]
split_up("hello.world:3000") #=> ["hello.world", ":3000"]
split_up("hello.world.lvh.me") #=> ["hello.world", "lvh.me"]
split_up("hello.world") #=> ["hello.world"]
split_up("") #=> []
Steps:
str1 = "hello.world.lvh.me:3000" #=> "hello.world.lvh.me:3000"
str2 = str1.sub(':','.:') #=> "hello.world.lvh.me.:3000"
arr = str2.split('.') #=> ["hello", "world", "lvh", "me", ":3000"]
enum = arr.each_slice(2) #=> #<Enumerator: ["hello", "world", "lvh",
# "me", ":3000"]:each_slice(2)>
enum.to_a #=> [["hello", "world"], ["lvh", "me"],
# [":3000"]]
enum.map { |arr| arr.join('.') } #=> ["hello.world", "lvh.me", ":3000"]
I often remove substrings from strings by doing this:
"don't use bad words like".gsub("bad", "").gsub("words", "").gsub("like", "")
What's a more concise/better way of excising long lists of substrings from a string in Ruby?
I would go with nronas' answer, however people tend to forget about Regexp.union:
str = "don't use bad words like"
str.gsub(Regexp.union('bad', 'words', 'like'), '')
# or
str.gsub(Regexp.union(['bad', 'words', 'like']), '')
You can always use regex when you gsubing :P. like:
str = "don't use bad words like"
str.gsub(/bad|words|like/, '')
I hope that helps
Edit2: Upon reflection, I think what I have below (or any solution that first breaks the string into an array of words) is really what you want. Suppose:
str = "Becky darned her socks before playing badmitten."
bad_words = ["bad", "darn", "socks"]
Which of the following would you want?
str.gsub(Regexp.union(*bad_words), '')
#=> "Becky ed her before playing mitten."
or
(str.split - bad_words).join(' ')
#=> "Becky darned her before playing badmitten."
Alternatively,
bad_words.reduce(str.split) { |arr,bw| arr.delete(bw); arr }.join(' ')
#=> "Becky darned her before playing badmitten."
:2tidE
Edit1: I've come to my senses and purged my solution. It was much too elaborate (and inefficient) for such a simple problem. I've just left an observation. :1tidE
If you want to end up with just a single space between words, you need to take a different tack:
(str.split - bad_words).join(' ')
#=> "don't use
I already suggested this to Cary, but it's here:
bad_words = %w[bad words like]
h = Hash.new{|h, k| k}.merge(bad_words.product(['']).to_h)
"don't use bad words like".gsub(/\w+/, h)