Why is my Ruby lookahead regex not working [duplicate] - ruby

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I tested my regex in rubular.com and it works, but when I run the code it behaves differently.
I want to parse whole paragraphs out of some HTML code
Here is my regex
description = ad_page.body.scan(/(?<=<span id="preview-local-desc">).+(?=<\/span>)/m)
Here is some of the HTML source
<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>
The match begins where I need it to but then it keeps matching all the way to the end of the document.

Aside from the fact that you shouldn't parse HTML with regex, you want non-greedy matching:
/(?<=<span id="preview-local-desc">).+?(?=<\/span>)/m

Parsing XML or HTML with a regex is marginally OK for trivial tasks, if you own or control the file's format. If you don't, then a simple change to the file could break your regex.
Using a parser will avoid that problem; I've parsed some horrible XML with Nokogiri and it didn't even notice. After writing a RSS aggregator that was handling 1000+ feeds I was hooked on using a parser.
require 'nokogiri'
html = '<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>'
doc = Nokogiri.HTML(html)
doc.at('span').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "
If there are multiple <span> tags you want:
doc.search('span').map(&:text)
# => [" I want to pick up everything typed here.\n Paragraphs, everything.\n "]
If there are multiple <span> tags and you only want this one:
doc.at('span#preview-local-desc').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "

Related

Ruby Regex to capture everything between two strings (inclusive)

I'm trying to sanitize some HTML and just remove a single tag (and I'd really like to avoid using nokogiri, etc). So I've got the following string appearing I want to get rid of:
<div class="the_class>Some junk here that's different every time</div>
This appears exactly once in my string, and I'd like to find a way to remove it. I've tried coming up with a regex to capture it all but I can't find one that works.
I've tried /<div class="the_class">(.*)<\/div>/m and that works, but it'll also match up to and including any further </div> tags in the document, which I don't want.
Any ideas on how to approach this?
I believe you're looking for an non-greedy regex, like this:
/<div class="the_class">(.*?)<\/div>/m
Note the added ?. Now, the capturing group will capture as little as possible (non-greedy), instead of as most as possible (greedy).
Because it adds another dependency and slows my work down. Makes things more complicated. Plus, this solution is applicable to more than just HTML tags. My start and end strings can be anything.
I used to think the same way until I got a job writing spiders and web-site analytics, then writing a big RSS-aggregation system -- A parser was the only way out of that madness. Without it the work would never have been finished.
Yes, regex are good and useful, but there are dragons waiting for you. For instance, this common string will cause problems:
'<div class="the_class"><div class="inner_div">foo</div></div>'
The regex /<div class="the_class">(.*?)<\/div>/m will return:
"<div class=\"the_class\"><div class=\"inner_div\">foo</div>"
This malformed, but renderable HTML:
<div class="the_class"><div class="inner_div">foo
is even worse:
'<div class="the_class"><div class="inner_div">foo'[/<div class="the_class">(.*?)<\/div>/m]
=> nil
Whereas, a parser can deal with both:
require 'nokogiri'
[
'<div class="the_class"><div class="inner_div">foo</div></div>',
'<div class="the_class"><div class="inner_div">foo'
].each do |html|
doc = Nokogiri.HTML(html)
puts doc.at('div.the_class').text
end
Outputs:
foo
foo
Yes, your start and end strings could be anything, but there are well-recognized tools for parsing HTML/XML, and as your task grows the weaknesses in using regex will become more apparent.
And, yes, it's possible to have a parser fail. I've had to process RSS feeds that were so badly malformed the parser blew up, but a bit of pre-processing fixed the problem.

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

How to I have several haml lines appear on the same line?

I have the following haml:
9 %strong Asked by:
10 = link_to #user.full_name, user_path(#user)
11 .small= "(#{#question.created_at.strftime("%B %d, %Y")})"
This currently puts the link and the date on separate lines, when it should look like "link (date)" and date has a class span of small.....
Your code will generate something like this html:
<strong>Asked by:</strong>
User name
<div class='small'>April 26, 2011</div>
When you use something like .small (i.e. use the dot without specifying the element type) haml creates an implicit div. Since div elements are by default block level elements the date will be in a new block and so will appear on a new line. In order to get it to appear on the same line, you'll need an inline level element.
You could change the css for the "small" class to explicitly make it display inline, but html already provides an inline version of the div - the span, so you can change the last line from
.small= "(#{#question.created_at.strftime("%B %d, %Y")})"
to
%span.small= "(#{#question.created_at.strftime("%B %d, %Y")})"
which will give you
<strong>Asked by:</strong>
User name
<span class='small'>April 26, 2011</span>
which are all inline elements, so will appear as one line.
As for having it all on the same line in the haml, I don't think that's possible with plain haml syntax. Haml uses the indentation and whitespace in order to determine what to do, and having just one line means there's no indentation.
The haml FAQ says:
Expressing the structure of a document and expressing inline formatting are two very different problems. Haml is mostly designed for structure, so the best way to deal with formatting is to leave it to other languages that are designed for it.
You seem to be at the edge of what haml is intended for. You could write your html directly if you really wanted it all on one line:
<strong>Asked by:</strong> #{link_to #user.full_name, user_path(#user)} <span class="small">(#{#question.created_at.strftime("%B %d, %Y")})</span>
or perhaps you could create a helper that will generate the block for you.
To make it show up on the same line in the browser, use %span.small, as in the comment above.
To make it all on one line in the HTML output, you will need to use the whitespace removal syntax in Haml. Please understand that newlines in the HTML output do not effect the arrangement of text in the browser.

xml tag with a dot in haml

I have a tag that contains a dot (.) that I want haml to preserve:
Haml:
%text
%text.resource
...
I would like Haml to expand to:
<text>
<text.resource>...
</text.resource>
<text>
but it keeps doing:
<text>
<text class="resource">...
<text>
<text>
Is there any easy way to "escape" "class" expansion in Haml?
HAML is made to generate HTML of various forms, but you can trick it to generate other things by being creative. Putting in what you want to get back out:
<text>
<text.resource>...
</text.resource>
<text>
will work, because if HAML sees a line that doesn't start with one of its reserved characters it'll output it as is. You can't indent though, or it will get mad.
From the docs:
Note that HTML tags are passed through unmodified as well. If you have some HTML you don’t want to convert to Haml, or you’re converting a file line-by-line, you can just include it as-is. For example:
%p
<div id="blah">Blah!</div>
is compiled to:
<p>
<div id="blah">Blah!</div>
</p>
You could do:
<text>
= " <text.resource>..."
= " </text.resource>"
<text>
if you insist on indentation:
>> <text>
>> <text.resource>...
>> </text.resource>
>> <text>
EDIT:
The OP says:
the problem I have is that the elypsis (...) means that I have to add more haml code there (a bunch of xml tags that would be "children" of and therefore I need to "indent" the lines after the comments...
XML doesn't care about indentation; Indentation is a for-human-eyes-only aesthetic. I'd worry more about being functionally and syntactically correct. If you absolutely have to have "pretty" XML, then consider running the HAML output through xmllint, or tidy with the xml flags set.
Or, abandon HAML because you're starting to abuse it, and use something like ERB and/or Erubis which is more free form and less caring about syntax, or go old-school and generate the XML via print and puts statements. If you insist on using HAML and having your indentation, then I'd suggest consulting with the HAML developers and see if they have a recommendation. There might be a HAML filter that would be of use, or some other way of forcing the indentation level inline.
My advice, as someone who's been doing this a long time and been there too many times is: We, as software developers, can lose sight of the end-goal of being functional and spin off into some yak-shaving exercise worrying about minutia that don't accomplish anything real. Unless it's a specification that every indenting space is sacred I'd worry more about getting correct XML and move on, then later return to it and see if it can be tweaked to perfection.

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources