Problem with Ruby Regular Expression - ruby

I have this HTML code, that's on a single line:
<h3 class='r'>fkdsafjldsajl</h3><h3 class='r'>fkdsafjldsajl</h3>
Here is the line-friendly version (that i can't use)
<h3 class='r'>fkdsafjldsajl</h3>
<h3 class='r'>fkdsafjldsajl</h3>
And i'm trying to extract just the URLs, with this REGEX
/<h3 class="r"><a href="(.*)">(.*)<\/a>/
And it returns
www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com"
What can I do to stop it when find a " ?

Sigh. Regex and HTML are such awkward bedfellows:
require 'nokogiri'
html = %q{<h3 class='r'>fkdsafjldsajl</h3><h3 class='r'>fkdsafjldsajl</h3>}
doc = Nokogiri::HTML(html)
puts doc.css('a').map{ |a| a['href'] }
# >> www.google.com
# >> www.google.com
This will find them, whether they are deeply nested or all on one line.

The problem is that * is greedy. Put a question mark after it to make it ungreedy.
Working regex (tested on rubular)
href\=\"(.*?)\"

Related

Why is the following Nokogiri/XPath code removing tags inside the node?

The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link

How do I replace a specific string with another string?

I have some content read from an XML file:
page_content = doc.xpath("/somenode/body").inner_text
This content holds some data:
<p> Hello World, ""How are you today""
Hello
etc.
</p>
As you can see, some of the content is wrapped with two pairs of double quotes.
My desired result is to replace the two pairs of double quotes with a single pair:
<p> Hello World, "How are you today"
Hello
etc.
</p>
What I have tried is:
page_content.gsub!(/[""]/, '"')
page_content.gsub!("\"\"", '"')
This does not seem to do the job. Any suggestions on how I can obtain my desired result?
It's important to understand how a parser like Nokogiri works.
To help you, it tries to fix-up damaged/malformed HTML or XML. Your HTML is malformed, so it's GOING to be fixed as Nokogiri parses it, however, that process can make Nokogiri mangle the HTML further. To avoid that, we sometimes have to preprocess the content before we hand it to Nokogiri, or we have to unravel it afterwards by replacing nodes.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p> Hello World, ""How are you today""
Hello
etc.
</p>
EOT
That parses the HTML into a DOM.
doc.at('p').to_html
# => "<p> Hello World, \"\"How are you today\"\"\n<a href=\"\" www.hello.comm>Hello</a>\netc.\n</p>"
The text ""How are you today"" was processed without any mangling because it's a text node:
doc.at('p').child.class # => Nokogiri::XML::Text
doc.at('p').child.content # => " Hello World, \"\"How are you today\"\"\n"
That's easily fixed after parsing:
doc.at('p').child.content = doc.at('p').child.content.gsub('""', '"')
# => " Hello World, \"How are you today\"\n"
Trying to fix the parameters of the <a> tag are an entirely different story, because, by that point, Nokogiri has fixed the doubled-quotes, causing the markup to be wrong:
doc.at('a').to_html
# => "<a href=\"\" www.hello.comm>Hello</a>"
Notice that www.hello.comm has been promoted outside its containing quotes.
To fix this requires some preprocessing before handing the HTML to Nokogiri, OR to fix the node and replace the damaged one with the fixed one.
Here's the basis for preprocessing the <a> tag:
html = <<EOT
<p> Hello World, ""How are you today""
Hello
etc.
</p>
EOT
html.gsub(/href=""([^"]+)""/, 'href="\1"')
# => "<p> Hello World, \"\"How are you today\"\"\nHello\netc.\n</p>\n"
If you go that route, don't get fancy. Write small, atomic changes, to avoid your pattern breaking if the HTML changes.
A more robust way (where "robust" is somewhat less than we'd normally get using a parser) is:
bad_a = doc.at('a')
fixed_a = bad_a.to_html.gsub(/""\s([^>]+)>/, '"\1">')
bad_a.replace(fixed_a)
doc.at('p')
# => #(Element:0x3fe4ce9de9e4 {
# name = "p",
# children = [
# #(Text " Hello World, \"How are you today\"\n"),
# #(Element:0x3fe4ce9e0fdc {
# name = "a",
# attributes = [
# #(Attr:0x3fe4ce9e0fa0 {
# name = "href",
# value = "www.hello.comm"
# })],
# children = [ #(Text "Hello")]
# }),
# #(Text "\netc.\n")]
# })
doc.at('p').to_html
# => "<p> Hello World, \"How are you today\"\nHello\netc.\n</p>"
It's possible to use a blanket gsub to massage the text, but that's got a high risk of collateral damage in large/complicated documents. Imagine what would happen to a document if
html.gsub('""', '"')
was used when there are many tags containing empty strings like:
<input value="" name="foo"><input value="" name="bar">
The result of the search/replace would be:
<input value=" name="foo"><input value=" name="bar">
That hardly improves things, and instead would have horribly mangled the document further.
Instead, it's better to surgically fix the problem. Back in the dark, early, pioneer days of the the web, we used to see a huge amount of malformed content, and having to process it with regular expressions was the normal plan of attack. Now, with parsers, we can usually avoid it and can isolate the problem and selectively fix exactly what we want. Looking at the code necessary to do so shows it doesn't take a lot to do it right.
page_content.gsub!('\"\"', '"')
page_content.gsub!(/"{2}/, '"')
rubular.com
a='<p> Hello World, ""How are you today""
Hello
etc.
</p>'
a.gsub! '""', '"'
[19] pry(main)> puts a
<p> Hello World, "How are you today"
Hello
etc.
</p>

How does Nokogiri handle unclosed HTML tags like <br>?

When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br> is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br> separately.

How do I grab this value from Nokogiri?

Say I have:
<div class="amt" id="displayFare-1_69-61-0" style="">
<div class="per">per person</div>
<div class="per" id="showTotalSubIndex-1_69-61-0" style="">Total $334</div>
$293
</div>
I want to grab just the $334. It will always have "Total $" but the id showTotalSubIndex... will be dynamic so I can't use that.
You can use a nokogiri xpath expression to iterate over all the div nodes
and scan the string for the 'Total $' Prefix like this
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML.parse( open( "test.xml" ))
doc.xpath("//div/text()").each{ |t|
tmp = t.to_str.strip
puts tmp[7..-1] if tmp.index('Total $') == 0
}
Rather than finding the text:
html = Nokogiri::HTML(html)
html.css("div.amt").children[1].text.gsub(/^Total /, '')
I assume here that the HTML is structured in such a way that the second child of any div.amt tag is the value that you're after, and then we'll just grab the text of that and gsub it.
Both of these work:
require 'nokogiri'
doc = Nokogiri::XML(xml)
doc.search('//div[#id]/text()').select{ |n| n.text['Total'] }.first.text.split.last
and
doc.search('//div/text()').select{ |n| n.text['Total'] }.first.text.split.last
The difference is the first should run a bit faster if you know the div you're looking for always has an id.
If the ID always starts with "showTotalSubIndex" you could use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').first.text.split.last
and if you know there's only going to be one in the document, you can use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').text.split.last
EDIT:
Ryan posits the idea the XML structure might be consistent. If so:
doc.at('//div[2]').text[/(\$\d+)/, 1]
:-)

Getting portion of href attribute using hpricot

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Resources