Ruby gsub with string manipulation - ruby

I am new to ruby and writing the expression to replace the string between the xml tags by hashing the value inside that.
I did the following to replace with the new password
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,'New \0')
RESULT: <password>New check1</password> (EXPECTED)
My expectation is to get the result like this (Md5 checksum of the value "New check1")
<password>6aaf125b14c97b307c85fc6e681c410e</password>
I tried it in the following ways and none of them was successful (I have included the required libraries "require 'digest'").
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,Digest::MD5.hexdigest('\0'))
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,Digest::MD5.hexdigest '\0')
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/, "Digest::MD5.hexdigest \0")
Any help on this to achieve the expectation is very much appreciated

This will work:
require 'digest'
line = "<other>stuff</other><password>check1</password><more>more</more>"
line.sub(/<password>(?<pwd>[^<]+)<\/password>/, Digest::SHA2.hexdigest(pwd))
=> "<other>stuff</other>8a859fd2a56cc37285bc3e307ef0d9fc1d2ec054ea3c7d0ec0ff547cbfacf8dd<more>more</more>"
Make sure the input is one line at a time, and you'll probably want sub, not gsub
P.S.: agree with Tom Lord's comment.. if your XML is not gargantuan in size, try to use an XML library to parse it... Ox or Nokogiri perhaps?
Different libraries have different advantages.

This is a variant of Tilo's answer.
require 'digest'
line = "<other>stuff</other><password>check1</password><more>more</more>"
r = /(?<=<password>).+?(?=<\/password>)/
line.sub(r) { |pwd| Digest::SHA2.hexdigest(pwd) }
#=> "<other>stuff</other><password>8a859fd2a56cc37285bc3e307ef0d9f
# c1d2ec054ea3c7d0ec0ff547cbfacf8dd</password><more>more</more>"
(I've displayed the returned string on two lines so make it readable without the need for horizontal scrolling.)
The regular expression reads, "match '<password>' in a positive lookbehind ((?<=...)), followed by any number of characters, lazily ('?'), followed by the string '</password>' in a positive lookahead ((?=...)).

Related

Ruby Base64 check if it's encoded [duplicate]

i may recieve these two strings:
base = Base64.encode64(File.open("/home/usr/Desktop/test", "rb").read)
=> "YQo=\n"
string = File.open("/home/usr/Desktop/test", "rb").read
=> "a\n"
what i have tried so far is to check string with regular expression i-e. /([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==$)/ but this would be very heavy if the file is big.
I also have tried base.encoding.name and string.encoding.name but both returns the same.
I have also seen this post and got regular expression solution but any other solution ?
Any idea ? I just want to get is the string is actually text or base64 encoded text....
You can use something like this, not very performant but you are guaranteed not to get false positives:
require 'base64'
def base64?(value)
value.is_a?(String) && Base64.strict_encode64(Base64.decode64(value)) == value
end
The use of strict_encode64 versus encode64 prevents Ruby from inadvertently inserting newlines if you have a long string. See this post for details.

Regex for series of four digits each up to 100

I'm trying to write a regex to validate a string and accepts only a series of four comma-separated digits, each up to 100. Something like this would be valid:
20,30,40,50
and these invalid:
120,0,20,0
20,30,40,ss
invalid_string
Any thoughts?
They're used for CMYK colours. We just need to store them here, not use them.
Number Range and Subroutine
In Ruby 2+, for a compact regex, use this:
^([0-9]|[1-9][0-9]|100)(?:,\g<1>){3}$
Explanation
The ^ anchor asserts that we are at the beginning of the string
The parentheses around ([0-9]|[1-9][0-9]|100) match a number from 0 to 100 and define subroutine #1
(?:,\g<1>) matches one comma and the expression defined by subroutine # 1
The {3} quantifier repeats that three times
The $ anchor asserts that we are at the end of the string
I'd save myself the headache of using regex for a number related problem. Also the validation message will look akward so it's better to make your own:
validate :that_string_has_only_4_numbers_upto_100
def that_string_has_only_4_numbers_upto_100
errors.add(:str, 'is not valid.') unless str.split(/,/).all? { |n| 1..100 === n.to_i }
end
Unless you a re regex jedi guru like #zx81 :p.
^(?:\d{1,2},){3}\d{1,2}$
Try this

Check if a string contains a character in a unicode range (using Ruby)

I want to create a simple function in Ruby that will check if the given string contains any unicode characters in the ranges such as the following:
U+007B -- U+00BF
U+02B0 -- U+037F
U+2000 -- U+2BFF
How can I accomplish this? Google is coming up blank for me, all things about removing unicode characters or checking if a string contains unicode.
The easiest thing would probably be a regex using String#index, String#match, or even String#[]:
string.index(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string.match(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string[/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/]
All three will give you nil (which is falsey) if they don't find the pattern and non-nil (which will be truthy) if they do.
I would do as below:
my_string = "{ How are you ?}"
puts my_string.chars.any? { |chr| ("\u007B".."\u00BF").include?(chr) }
#=> true

Ruby Regexp - Matching multiple result when within markup

I have the following string:
nothing to match
<-
this rocks should match as should this still and this rocks and still
->
should not match still or rocks
<- no matches here ->
And i want to find all matches of 'rocks' and 'still', but only when they are within <- ->
The purpose is to markup glossary words but be able to only mark them up in areas of text that are defined by the editor.
I currently have:
<-.*?(rocks|still).*?->
This unfortunately only matches the first 'rocks' and ignores all subsequent instances and all the 'still's
I have this in a Rubular
The usage of this will be somthing like
Regexp.new( '<-.*?(' + self.all.map{ |gt| gt.name }.join("|") + ').*?->', Regexp::IGNORECASE, Regexp::MULTILINE )
Thanks in advance for any help
There may be a way to do this with a single regex, but it will probably be simpler to just do it in two steps. First match all of the markups, and then search the markups for the glossary words:
text = <<END
nothing to match
<-
this rocks should match as should this still and this rocks and still
->
should not match still or rocks
<- no matches here ->
END
text.scan(/<-.*?->/m).each do |match|
print match.scan(/rocks|still/), "\n"
end
Also, you should probably note that regex is only a good solution here if there is never any nested markup (<-...<-...->...->) and no escaped <- or -> whether it is inside or outside of a markup.
Don't forget your Ruby string methods. Use them first before considering regular expressions
$ ruby -0777 -ne '$_.split("->").each{|x| x.split("<-").each{|y| puts "#{y}" if (y[/rocks.*still/]) } }' file
In Ruby, it depends on what you want to do with the regexp. You're matching a regular expression against a string, so you'll be using String methods. Certain of these will have an effect on all matches (e.g. gsub or rpartition); others will have an effect on only the first match (e.g. rindex, =~).
If you're working with any of the latter (that return only the first match), you'll want to make use of a loop that calls the method again, starting from a certain offset. For example:
# A method to print the indices of all matches
def print_match_indices(string, regex)
i = string.rindex(regex, 0)
while !i.nil? do
puts i
i = string.rindex(regex, i+1)
end
end
(Yes, you can use split first, but I expect that a regex loop like the foregoing would require fewer system resources.)

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources