gsub a special chracter - ruby

hy
i try to use gsub for remove this character ’ be carful it's not ' or ` he come from Word(microsoft) i think .
i really dont understand why i cant remove this character because i can remove all others
when i use gsub like that :
pattern = /(\’|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
restring = string.gsub(pattern){|match|" " }
i get this error below
syntax error, unexpected $end, expecting keyword_end
pattern = /(\’|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
^

When I ran your RegEx through Rubular's site, I got this;
I figured it was a UTF-8 issue and after some additional stack overflow, it seems pretty common in a rails app to add # encoding: utf-8 to the top of your file.

You might add the following to your regex:
/\u2018|\u2019|\u201A/
which are some curly single quotes: ["‘", "’", "‚"].
In case you're interested, here is a simple method I've used before for cleaning up Word text (pieced together from a number of resources online):
def replace(text)
text.
gsub(/[\u2018|\u2019|\u201A]/, "\'").
gsub(/[\u201C|\u201D|\u201E]/, "\"").
gsub(/\u2026/, "...").
gsub(/[\u2013|\u2014]/, "-").
gsub(/\u02C6/, "^").
gsub(/\u2039/, "<").
gsub(/\u203A/, ">").
gsub(/[\u02DC|\u00A0]/, " ")
end

Related

Delete character in an XML file using Ruby

I am working with Ruby, and I want to delete all the \ characters from my XML file.
Here is my XML file:
<w:numId w:val=\"2\"/></w:numPr></w:pPr><w:bookmarkStart w:id=\"0\" w:name=\"__DdeLink__0_226207805\"/><w:bookmarkEnd w:id=\"0\"/><w:r><w:rPr></w:rPr><w:t>Serve high quality food</w:t></w:r></w:p>, <w:p><w:pPr><w:pStyle w:val=\"style17\"/><w:numPr><w:ilvl w:val=\"0\"/><w:numId w:val=\"2\"/></w:numPr></w:pPr><w:bookmarkStart w:id=\"0\" w:name=\"__DdeLink__0_226207805\"/><w:bookmarkEnd w:id=\"0\"/>
There's actually no backslash character (\) in your file. The backslash in your example simply escapes the following double-quote and prevents it terminating the string and thereby resulting in a syntax error due to an unterminated double-quote.
What you see when you print that string in IRB is actually not the backslash as is, but the backslash in combination with the following double-quote as an indication that the double-quote is escaped. The idea is kind of hard to grasp when you encounter it the first time. Have a look at "Escape sequences".
Saying it short and sweet, there is no backslash in your file so you can't remove it.
Let me explain with an example:
> text = "This is sample text for escape character\""
#=> "This is sample text for escape character\""
Is equivalent to:
> text = 'This is sample text for escape character"'
#=> "This is sample text for escape character\""
To remove the backslash (\) , just remove "
> text.tr!('"', '')
#=> "This is sample text for escape character"
I hope this makes it clear.
Thank you guys for you answers, here is what i dit and it worked as i wanted:
text = ''
File.open("#{temp_dir}/plan_report_template/word/document.xml").each { |line|
text << line
}
open("#{temp_dir}/plan_report_template/word/document.xml", "w") { |file| file.write(text.gsub('\"', '"')) }

Reformatting dates

I'm trying to reformat German dates (e.g. 13.03.2011 to 2011-03-13).
This is my code:
str = "13.03.2011\n14:30\n\nHannover Scorpions\n\nDEG Metro Stars\n60\n2 - 3\n\n\n\n13.03.2011\n14:30\n\nThomas Sabo Ice Tigers\n\nKrefeld Pinguine\n60\n2 - 3\n\n\n\n"
str = str.gsub("/(\d{2}).(\d{2}).(\d{4})/", "/$3-$2-$1/")
I get the same output like input. I also tried my code with and without leading and ending slashes, but I don't see a difference. Any hints?
I tried to store my regex'es in variables like find = /(\d{2}).(\d{2}).(\d{4})/ and replace = /$3-$2-$1/, so my code looked like this:
str = "13.03.2011\n14:30\n\nHannover Scorpions\n\nDEG Metro Stars\n60\n2 - 3\n\n\n\n13.03.2011\n14:30\n\nThomas Sabo Ice Tigers\n\nKrefeld Pinguine\n60\n2 - 3\n\n\n\n"
find = /(\d{2}).(\d{2}).(\d{4})/
replace = /$3-$2-$1/
str = str.gsub(find, replace)
TypeError: no implicit conversion of Regexp into String
from (irb):4:in `gsub'
Any suggestions for this problem?
First mistake is the regex delimiter. You do not need place the regex as string. Just place it inside a delimiter like //
Second mistake, you are using captured groups as $1. Replace those as \\1
str = str.gsub(/(\d{2})\.(\d{2})\.(\d{4})/, "\\3-\\2-\\1")
Also, notice I have escaped the . character with \., because in regex . means any character except \n

Why is ruby warning me over this regex?

I want to remove a single * character and any white space from the start of a string.
This is the regex I have /^\*{1}(?:\s+)?/
Here's a Rubular link http://rubular.com/r/r5i4FpQdK2
However Ruby is throwing a warning when I try to use it.
001:0> regex = /^\*{1}(?:\s+)?/
warning: nested repeat operator + and ? was replaced with '*': /^\*{1}(?:\s+)?/
=> /^\*{1}(?:\s+)?/
The actual test still works
002:0> "* foo" =~ regex
=> 0
but I can't figure out what's causing the warning.
Any advice would be appreciated.
Instead of (?:\s+)? use (?:\s*) or just \s*
\s+ allows one or more spaces and the following ? makes it optional, which can be replaced with zero or more space as \s*

Ruby: Escaping special characters in a string

I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources