I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.
Related
I stumbled over this problem using the following simplified example:
line = searchstring.dup
line.gsub!(Regexp.escape(searchstring)) { '' }
My understanding was, that for every String stored in searchstring, the gsub! would cause that line is afterwards empty. Indeed, this is the case for many strings, but not for this case:
searchstring = "D "
line = searchstring.dup
line.gsub!(Regexp.escape(searchstring)) { '' }
p line
It turns out, that line is printed as "D " afterwards, i.e. no replacement had been performed.
This happens to any searchstring containing a space. Indeed, if I do a
p(Regexp.escape(searchstring))
for my example, I see "D\\ " being printed, while I would expect to get "D " instead. Is this a bug in the Ruby core library, or did I misuse the escape function?
Some background: In my concrete application, where this simplified example is derived from, I just want to do a literal string replacement inside a long string, in the following way:
REPLACEMENTS.each do
|from, to|
line.chomp!
line.gsub!(Regexp.escape(from)) { to }
end
. I'm using Regexp.escape just as a safety measure in the case that the string being replaced contains some regex metacharacter.
I'm using the Cygwin port of MRI Ruby 2.6.4.
line.gsub!(Regexp.escape(searchstring)) { '' }
My understanding was, that for every String stored in searchstring, the gsub! would cause that line is afterwards empty.
Your understanding is incorrect. The guarantee in the docs is
For any string, Regexp.new(Regexp.escape(str))=~str will be true.
This does hold for your example
Regexp.new(Regexp.escape("D "))=~"D " # => 0
therefore this is what your code should look like
line.gsub!(Regexp.new(Regexp.escape(searchstring))) { '' }
As for why this is the case, there used to be a bug where Regex.escape would incorrectly handle space characters:
# in Ruby 1.8.4
Regex.escape("D ") # => "D\\s"
My guess is they tried to keep the fix as simple as possible by replacing 's' with ' '. Technically this does add an unnecessary escape character but, again, that does not break the intended use of the method.
This happens to any searchstring containing a space. Indeed, if I do a
p(Regexp.escape(searchstring))
for my example, I see "D\\ " being printed, while I would expect to get "D " instead. Is this a bug in the Ruby core library, or did I misuse the escape function?
This looks to be a bug. In my opinion, whitespace is not a Regexp meta character, there is no need to escape it.
Some background: In my concrete application, where this simplified example is derived from, I just want to do a literal string replacement inside a long string […]
If you want to do literal string replacement, then don't use a Regexp. Just use a literal string:
line.gsub!(from, to)
I need to check if the last character in a string is a digit, and if so, increment it.
I have a directory structure of /u01/app/oracle/... and that's where it goes off the rails. Sometimes it ends with the version number, sometimes it ends with dbhome_1 (or 2, or 3), and sometimes, I have to assume, it will take some other form. If it ends with dbhome_X, I need to parse that and bump that final digit, if it is a digit.
I use split to split the directory structure on '/', and use include? to check if the final element is something like "dbhome". As long as my directory structure ends with dbhome_X it seems to work. As I was testing, though, I tried a path that ended with dbhome, and found that my check for the last character being a digit didn't work.
db_home = '/u01/app/oracle/product/11.2.0/dbhome'
if db_home.split('/')[-1].include?('dbhome')
homedir=db_home.split('/')[-1]
if homedir[-1].to_i.is_a? Numeric
homedir=homedir[0...-1]+(homedir[-1].to_i+1).to_s
new_path="/"+db_home.split('/')[1...-1].join("/")+"/"+homedir.to_s
end
else
new_path=db_home+"/dbhome_1"
end
puts new_path
I did not expect the output to be /u01/app/oracle/11.2.0/product/dbhom1 - it seems to have fallen into the if block that added 1 to the final character.
If I set the initial path to /u01/app/.../dbhome_1, I get the expected /u01/app/.../dbhome_2 as the output.
You could use a regular expression to make matching a tad bit easier
if !!(db_home[/.*dbhome.*\z]) ..
You could use regex's
/[0-9]$/.match("How3").nil?
I need to check if the last character in a string is a digit, and if
so, increment it.
This is one option:
s = 'string9'
s[-1].then { |last| last.to_i.to_s == last ? [s[0..-2], last.to_i+1].join : s }
#=> "string10"
'/u01/app/11.2.0/dbhome'.sub(/\d\z/) { |s| s.succ }
#=> "/u01/app/11.2.0/dbhome"
'/u01/app/11.2.0/dbhome9'.sub(/\d\z/) { |s| s.succ }
#=> "/u01/app/11.2.0/dbhome10"
This is a starting point if you're running Ruby v2.6+:
fname = 'filename1'
fname[/\d+$/].then { |digits|
fname[/\d+$/] = digits.to_i.next.to_s if digits
}
fname # => "filename2"
And it's safe if the filename doesn't end with a digit:
fname = 'filename'
fname[/\d+$/].then { |digits|
fname[/\d+$/] = digits.to_i.next.to_s if digits
}
fname # => "filename"
I'm not sure if I like doing it that way better than the more traditional way which works with much older Rubies:
digits = fname[/\d+$/]
fname[/\d+$/] = digits.to_i.next.to_s if digits
except for the fact that digits gets stuck into the variable space after only being used once. There's probably worse things that happen in my code though.
This is taking advantage of String's [] and []= methods.
I'm using Ruby 2.2 and have a string that looks like this:
myvar = '{"myval1"=>"value1","mayval2"=>"value2"}'
How can I get this into a key-value pair and/or hash of some sort? When I do myvar['myval1'] I get back 'myval1', which isn't quite what I'm after. The answer's probably staring right at me but nothing's worked so far.
As I've seen times and times again - simply mentioning eval makes people instantly upset, even if it was a proper use case (which this is not).
So I'm going to go with another hate magnet - parsing nested structures with regexes.
Iteration (1) - a naive approach:
JSON.parse(myvar.gsub(/=>/, ':'))
Problem - will mess up your data if the string key/values contain =>.
Iteration (2) - even number of "s remaining mean you are not inside a string:
JSON.parse(myvar.gsub(/=>(?=(?:[^"]*"){2}*[^"]*$)/, ':'))
Problem - there might be a " inside a string, that is escaped with a slash.
Iteration (3) - like iteration (2), but count only " that are preceded by unescaped slashes. An unescaped slash would be a sequence of odd number of slashes:
eq_gt_finder = /(?<non_quote>
(?:
[^"\\]|
\\{2}*\\.
)*
){0}
=>(?=
(?:
\g<non_quote>
"
\g<non_quote>
){2}*
$
)/x
JSON.parse(myvar.gsub(eq_gt_finder, ':'))
See it in action
Q: Are you an infallible divine creature that is absolutely certain this will work 100% of the time?
A: Nope.
Q: Isn't this slow and unreadable as shit?
Q: Ok?
A: Yep.
You can change that string to valid JSON easily and use JSON.parse then:
require 'JSON'
myvar = '{"myval1"=>"value1","mayval2"=>"value2"}'
hash = JSON.parse(myvar.gsub(/=>/, ': '))
#=> { "myval1" => "value1", "mayval2" => "value2" }
hash['myval1']
#=> "value1"
Lets say I have the following string and I want the below output without requiring csv.
this, "what I need", to, do, "i, want, this", to, work
this
what i need
to
do
i, want, this
to
work
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
"([^"]+)"|[^, ]+
The left side of the alternation | matches complete "quotes" and captures the contents to Group1. The right side matches characters that are neither commas nor spaces, and we know they are the right ones because they were not matched by the expression on the left.
Option 2: Allowing Multiple Words
In your input, all tokens are single words, but if you also want the regex to work for my cat scratches, "what I need", your dog barks, use this:
"([^"]+)"|[^, ]+(?:[ ]*[^, ]+)*
The only difference is the addition of (?:[ ]*[^, ]+)* which optionally adds spaces + characters, zero or more times.
This program shows how to use the regex (see the results at the bottom of the online demo):
subject = 'this, "what I need", to, do, "i, want, this", to, work'
regex = /"([^"]+)"|[^, ]+/
# put Group 1 captures in an array
mymatches = []
subject.scan(regex) {|m|
$1.nil? ? mymatches << $& : mymatches << $1
}
mymatches.each { |x| puts x }
Output
this
what I need
to
do
i, want, this
to
work
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').