How do I count a sub string using a regex in ruby? - ruby

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks

I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.

If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])

Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but

Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

Related

Minify XML with Ruby

Given an XML string:
xml = "<org><people> <person>Joe Shmoe</person> <person>Bo Bob</person>
<person>New Guy</person> </people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
How can I get rid of newlines and spaces between the tags, without affecting tag text, CDATA, etc?
Result should be:
xml = "<org><people><person>Joe Shmoe</person><person>Bo Bob</person><person>New Guy</person></people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
UPDATE:
This is what I've come up with so far- I just can't figure out how to have it ignore CDATA content...
xml.gsub(/>\s+</,"><")
Also, would much rather use an XML parser for this, as from what I hear regexing XML is a bad thing.
Yes! What you want is canonicalization!
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-canonicalize
LibXML-Ruby gem can do this. Since the docs are shitty and doesn't even say what it does, here are the specs
http://www.w3.org/TR/xml-c14n
This is used a lot in XML signing.
And yes! Using regular expressions on XML is bad.
BTW you can also print your xml object as a string, and set indentation:
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-to_s

How to use Regular Expression to insert text in between text?

I have a unique scenario. There is a web application which is a simulator to check sending of data in XML and getting the data back in xml and verifying few details in xml.
Now the xml data which I am sending has a lot of details. In that xml I will have to insert a parameter which I have defined in my test. I am not able to get, how to send the data as parameter in the xml before sending it.
the xml structre looks like this
id='12345'><version>1.3.4<</version><accno>1234567890</accno>add<address details</> ..........
Now int this xml structure, I have parameterized <accno>1234567890</accno> ... Mean in begin of the script I am declaring accno='1234567890'
Now I want to using accno as parameter in the xml instead of the hard coded value in the xml. Please suggest how to do this.
XML is not regular, but context-free. Use a proper parser like Nokogiri instead of regex. See RegEx match open tags except XHTML self-contained tags.
As answer, as requested.
I will say editing xml, by regex is a bad idea.
but just to answer the direct question use gsub. eg.
str.gsub(/reg_match/, newstring)
but better way of doing it will be use of hpricot,
Or you can also use ruby templates.
require 'erb'
require 'ostruct'
data = {:accno => "1234567890"}
variables = OpenStruct.new(data)
template = "<id='12345'><version>1.3.4</version><accno><%= accno%></accno>"
res = ERB.new(template).result(variables.instance_eval { binding })
puts res
First identify the pattern, then replace it using gsub!
xml_data.gsub! (pattern, replacement)
http://ruby-doc.org/docs/ProgrammingRuby/html/ref_c_string.html#String.gsub_oh
The fast way to do it is with gsub (like Rajkaran says). The right way to do it is rexml or some other xml library. Investment should be related to how much you will use this kind of thing in the future.

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

ruby regex links not already in anchor tag

I am using ruby 1.8.7. I am not using rails.
How do I find all the links which are not already in anchor tag.
s = %Q{ <a href='www.a.com'><b>www.a.com</b></a> www.b.com <div>www.c.com</div> }
The output of above string should be
www.b.com
www.c.com
I know "b" tag before www.a.com complicates the case but that's what I have to work with.
You are going to want to use a real XML parser (Nokogiri will do). Regexes are unsuitable for a task like this. Especially so in ruby 1.8.7 where negative look behind is not supported.
Dirty way to get rid of anchor tags. Doesn't work the way you want if they're nested. Also use a real parser ;-)
s.gsub(%r[<a\b.*?</a>]i, "")
=> " www.b.com <div>www.c.com</div> "

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources