Regular expression help - ruby

I am currently doing a bunch of processing on a string using regular expressions with gsub() but I'm chaining them quite heavily which is starting to get messy. Can you help me construct a single regex for the following:
string.gsub(/\.com/,'').gsub(/\./,'').gsub(/&/,'and').gsub(' ','-').gsub("'",'').gsub(",",'').gsub(":",'').gsub("#39;",'').gsub("*",'').gsub("amp;",'')
Basically the above removes the following:
.com
.
,
:
*
switches '&' for 'and'
switches ' ' for '-'
switches ' for ''
Is there an easier way to do this?

You can combine the ones that remove characters:
string.gsub(/\.com|[.,:*]/,'')
The pipe | means "or". The right side of the or is a character class; it means "one of these characters".

A translation table is more scalable as you add more options:
translations = Hash.new
translations['.com'] = ''
translations['&'] = 'and'
...
translations.each{ |from, to| string.gsub from, to }

Building on Tim's answer:
You can pass a block to String.gsub, so you could combine them all, if you wanted:
string.gsub(/\.com|[.,:*& ']/) do |sub|
case(sub)
when '&'
'and'
when ' '
'-'
else
''
end
end
Or, building off echoback's answer, you could use a translation hash in the block (you may need to call translations.default = '' to get this working):
string.gsub(/\.com|[.,:*& ']/) {|sub| translations[sub]}
The biggest perk of using a block is only having one call to gsub (not the fastest function ever).
Hope this helps!

Related

Remove everything from trailing forward slash

Starting with this string:
file = "[/my/directory/file-*.log]"
I want to remove anything that comes after the trailing / and the closing square bracket, so the returned string is:
file = "[/my/directory/]"
I wondered if someone could recommend the safest way to do this.
I have been experimenting with chomp but I’m not getting anywhere, and gsub or sub doesn’t seem to fit either.
You can use File.dirname:
File.dirname("/my/directory/file-*.log")
=> "/my/directory"
If it's stuck inside the brackets, you can always write a custom replacement function that calls out to File.dirname:
def squaredir(file)
file.sub(/\[([^]]+)\]/) do |m|
'[%s]' % File.dirname($1)
end
end
Then you get this:
squaredir("[/my/directory/file-*.log]")
# => "[/my/directory]"
Here are three non-regex solutions:
file[0..file.rindex('/')] << ']'
file.sub(file[file.rindex('/')+1..-2], '')
"[#{File.dirname(file[1..-2])}]"
All return "[/my/directory/]".
file.split('/').take(file.split('/').size - 1).join('/') + ']'
=> "[/my/directory]"
Split the string into an array of strings separated by /
Take all the array elements except the last element
Join the strings together, re-inserting / between them
Add a trailing ]
file = "[/my/directory/file-*.log]"
file.sub(/^(.+)\/([^\/]+)\]/, '\1/]')
=> "[/my/directory/]"

Deleting all special characters from a string - ruby

I was doing the challenges from pythonchallenge writing code in ruby, specifically this one. It contains a really long string in page source with special characters. I was trying to find a way to delete them/check for the alphabetical chars.
I tried using scan method, but I think I might not use it properly. I also tried delete! like that:
a = "PAGE SOURCE CODE PASTED HERE"
a.delete! "!", "#" #and so on with special chars, does not work(?)
a
How can I do that?
Thanks
You can do this
a.gsub!(/[^0-9A-Za-z]/, '')
try with gsub
a.gsub!(/[!#%&"]/,'')
try the regexp on rubular.com
if you want something more general you can have a string with valid chars and remove what's not in there:
a.gsub!(/[^abcdefghijklmnopqrstuvwxyz ]/,'')
When you give multiple arguments to string#delete, it's the intersection of those arguments that is deleted. a.delete! "!", "#" deletes the intersections of the sets ! and # which means that nothing will be deleted and the method returns nil.
What you wanted to do is a.delete! "!#" with the characters to delete passed as a single string.
Since the challenge is asking to clean up the mess and find a message in it, I would go with a whitelist instead of deleting special characters. The delete method accepts ranges with - and negations with ^ (similar to a regex) so you can do something like this: a.delete! "^A-Za-z ".
You could also use regular expressions as shown by #arieljuod.
gsub is one of the most used Ruby methods in the wild.
specialname="Hello!#$#"
cleanedname = specialname.gsub(/[^a-zA-Z0-9\-]/,"")
I think a.gsub(/[^A-Za-z0-9 ]/, '') works better in this case. Otherwise, if you have a sentence, which typically should start with a capital letter, you will lose your capital letter. You would also lose any 1337 speak, or other possible crypts within the text.
Case in point:
phrase = "Joe can't tell between 'large' and large."
=> "Joe can't tell between 'large' and large."
phrase.gsub(/[^a-z ]/, '')
=> "oe cant tell between large and large"
phrase.gsub(/[^A-Za-z0-9 ]/, '')
=> "Joe cant tell between large and large"
phrase2 = "W3 a11 f10a7 d0wn h3r3!"
phrase2.gsub(/[^a-z ]/, '')
=> " a fa dwn hr"
phrase2.gsub(/[^A-Za-z0-9 ]/, '')
=> "W3 a11 f10a7 d0wn h3r3"
If you don't want to change the original string - i.e. to solve the challenge.
str.each_char do |letter|
if letter =~ /[a-z]/
p letter
end
end
You will have to write down your own string sanitize function, could easily use regex and the gsub method.
Atomic sample:
your_text.gsub!(/[!#\[;\]^%*\(\);\-_\/&\\|$\{#\}<>:`~"]/,'')
API sample:
Route: post 'api/sanitize_text', to: 'api#sanitize_text'
Controller:
def sanitize_text
return render_bad_request unless params[:text].present? && params[:text].present?
sanitized_text = params[:text].gsub!(/[!#\[;\]^%*\(\);\-_\/&\\|$\{#\}<>:`~"]/,'')
render_response( {safe_text: sanitized_text})
end
Then you call it
POST /api/sanitize_text?text=abcdefghijklmnopqrstuvwxyz123456<>$!#%23^%26*[]:;{}()`,.~'"\|/

Ruby gsub / regex with several arguments [duplicate]

This question already has answers here:
Match a string against multiple patterns
(2 answers)
Closed 8 years ago.
I'm new to ruby and I'm trying to solve a problem.
I'm parsing through several text field where I want to remove the header which has different values. It works fine when the header always is the same:
variable = variable.gsub(/(^Header_1:$)/, '')
But when I put in several arguments it doesn't work:
variable = variable.gsub(/(^Header_1$)/ || /(^Header_2$)/ || /(^Header_3$)/ || /(^Header_4$)/ || /^:$/, '')
You can use Regexp.union:
regex = Regexp.union(
/^Header_1/,
/^Header_2/,
/^Header_3/,
/^Header_4/,
/^:$/
)
variable.gsub(regex, '')
Please note that ^something$ will not work on strings containing something more than something :)
Cause ^ is for matching beginning of string and $ is for end of string.
So i intentionally removed $.
Also you do not need brackets when you only need to remove the matched string.
You can also use it like this:
headers = %w[Header_1 Header_2 Header_3]
regex = Regexp.union(*headers.map{|s| /^#{s}/}, /^\:$/, /etc/)
variable.gsub(regex, '')
And of course you can remove headers without explicitly define them.
Most likely there are a white space after headers?
If so, you can do it as simple as:
variable = "Header_1 something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
variable = "Header_BLAH something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
Just use a proper regexp:
variable.gsub(/^(Header_1|Header_2|Header_3|Header_4|:)$/, '')
If the header is always the same format of Header_n, where n is some integer value, then you can simplify your regex greatly:
/Header_\d+/
will find every one of these:
%w[Header_1 Header_2 Header_3].grep(/Header_\d+/)
[
[0] "Header_1",
[1] "Header_2",
[2] "Header_3"
]
Tweaking it to handle finding words, not substrings:
/^Header_\d+$/
or:
/\bHeader_\d+\b/
As mentioned, using Regexp.union is a good start, but, used blindly, can result in very slow or inefficient patterns, so think ahead and help out the engine by giving it useful sub-patterns to work with:
values = %w[foo bar]
/Header_(?:\d+|#{ values.join('|') })/
=> /Header_(?:\d+|foo|bar)/
Unfortunately, Ruby doesn't have the equivalent to Perl's Regexp::Assemble module, which can build highly optimized patterns from big lists of words. Search here on Stack Overflow for examples of what it can do. For instance:
use Regexp::Assemble;
my #values = ('Header_1', 'Header_2', 'foo', 'bar', 'Header_3');
my $ra = Regexp::Assemble->new;
foreach (#values) {
$ra->add($_);
}
print $ra->re, "\n";
=> (?-xism:(?:Header_[123]|bar|foo))

Why pipes are not deleted using "gsub" in Ruby?

I would like to delete from notes everything starting from the example_header. I tried to do:
example_header = <<-EXAMPLE
-----------------
---| Example |---
-----------------
EXAMPLE
notes = <<-HTML
Hello World
#{example_header}
Example Here
HTML
puts notes.gsub(Regexp.new(example_header + ".*", Regexp::MULTILINE), "")
but the output is:
Hello World
||
Why || isn't deleted?
The pipes in your regular expression are being interpreted as the alternation operator. Your regular expression will replace the following three strings:
"-----------------\n---"
" Example "
"---\n-----------------"
You can solve your problem by using Regexp.escape to escape the string when you use it in a regular expression (ideone):
puts notes.gsub(Regexp.new(Regexp.escape(example_header) + ".*",
Regexp::MULTILINE),
"")
You could also consider avoiding regular expressions and just using the ordinary string methods instead (ideone):
puts notes[0, notes.index(example_header)]
Pipes are part of regexp syntax (they mean "or"). You need to escape them with a backslash in order to have them count as actual characters to be matched.

Ruby: Escaping special characters in a string

I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here

Resources