Nokogiri/Mechanize xpath locator breaks when there is a stray start tag - ruby

I loaded a page using Mechanize:
url = 'http://www.blah.com'
agent = Mechanize.new
page = agent.get(url)
and tried to access an element using an XPath selector:
found = page.at('/html/body/table')
It returns nil because the HTML, which is out of my control, has an opening tag where it shouldn't be:
<html>
<body>
<tr>
<table>
. . .
The "stray start tag," as Firefox calls it, is ignored when the browser renders the page in real life (and Firefox gives me xpaths that ignore it), but Nokogiri can't see anything past that extra <tr>.
Is there any way to clean the HTML of hanging tags like this?

In your example it would be:
page.at '/html/body/tr/table'
But maybe it makes more sense to just do:
page.at 'table'

Use a less brittle XPath query?
found = page.at('//table')

You can clean this using Nokogiri easily:
require 'nokogiri'
html = '<html><body><tr><table><tr><td>foo</td></tr></table></tr></body></html>'
doc = Nokogiri::HTML(html)
inner_table = doc.at('//body/tr/table')
if (inner_table)
doc.at('body tr').replace(inner_table)
end
puts doc.to_html
With the result being:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr><td>foo</td></tr></table></body></html>
If your HTML is more complex, then find some sort of marker similar to the <body><tr><table> node-chain, and substitute it into the code above.
Note that I'm mixing both XPath and CSS accessors. I prefer CSS for their readability, but sometimes XPath makes it easier to get at something or is more self-documenting.
Also notice that I'm using both XPath and CSS with Nokogiri's at method. Though Nokogiri supports both at, at_css and at_xpath, I rely on at unless I need to explicitly tell Nokogiri that what I'm using as an accessor is CSS or XPath. It's a convenience thing. The same applies to Nokogiri's search method.

Related

element not supported by watir webdriver

So element_by_xpath has been removed from watir webdriver and I am wondering if there is something similar to this still in existence. Basically there is a path element (pie chart) that is clickable and I need to have a regression test set up to for it. The people who designed the website apparently thought it would be cool to have everything inside this thing to be custom tags and there is nothing supported by watir webdriver that I can point to, at least to my knowledge.
element_by_xpath was replaced with an :xpath locator, which is used like any other locator such as :text.
For example, say you have a custom 'asdf' element in your html:
<html>
<body>
<asdf>text</asdf>
</body>
</html>
Then you could locate the element via xpath like:
browser.element(:xpath => '//asdf').text
#=> "text"
The usual suggestion is to avoid xpath. If you are only using the xpath because of the tag name, you can use the :tag_name locator instead:
browser.element(:tag_name => 'asdf').text
#=> "text"

Parsing multiple elements with Mechanize

I have this in html:
<meta name='DC.creator' scheme='inventor' content='Chen Yonghong' />
<meta name='DC.creator' scheme='inventor' content='Chen Yuan' />
If I want to get first creator I can do it like this:
:author => page.at('meta[#name="DC.creator"]')[:content]
The question is, how do I get second one with mechanize selectors?
You can use:
page.search('meta[#name="DC.creator"]')[1][:content]
at is equivalent to search(...).first so using the same selector with search and grabbing the second element found will work as long as there are truly two tags that match. If not, you'll get an exception because you can't take the index of a nil value.
And, as a FYI, Mechanize uses Nokogiri internally to handle its HTML parsing and manipulation. Nokogiri supports both CSS and XPath selectors so you can use whichever makes it easier for you to find the tag or element you want. I lean toward CSS for readability, but use both. See the Nokogiri tutorials for more information about searching.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<meta name='DC.creator' scheme='inventor' content='Chen Yonghong' />
<meta name='DC.creator' scheme='inventor' content='Chen Yuan' />
EOT
doc.search('meta[#name="DC.creator"]')[1][:content]
=> "Chen Yuan"

Nokogiri grab only visible inner_text

Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?
You could try:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.traverse{ |x|
if x.text? && x.text !~ /^\s*$/
puts x.text
end
}
I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).
You can ignore JavaScript and there is a better way. You're ignoring the power of Nokogiri. Badly.
Rather than provide you with the direct answer, it will do you well to learn to "fish" using Nokogiri.
In a document like:
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
I recommend starting with CSS accessors because they're generally more familiar to people:
doc = Nokogiri::HTML(var_containing_html) will parse and return the HTML DOM in doc.
doc.at('p') will return a Node, which basically points to the first <p> node.
doc.search('p') will return a NodeSet of all matching nodes, which acts like an array, in this case all <p> nodes.
doc.at('p').text will return the text inside a node.
doc.search('p').map{ |n| n.text } will return all the text in the <p> nodes as an array of text strings.
As your document gets more complex you need to drill down. Sometimes you can do it using a CSS accessor, such as 'body p' or something similar, and sometimes you need to use XPaths. I won't go into those but there are great tutorials and references out there.
Nokogiri's tutorials are very good. Go through them and they will reveal all you need to know.
In addition, there are many answers on Stack Overflow discussing this sort of problem. Check out the "Related" links on the right of the page.
Ignore the tags where JavaScript lives (<script>). While we’re at it, we should also ignore CSS (<styles>).
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.css('style').each(&:remove)
doc.css('script').each(&:remove)
puts doc.text
# Alternatively, for cleaner output:
# puts doc.text.split("\n").map(&:strip).reject(&:empty?)

How to get a mail address from HTML code with Nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.

How do I find matching <pre> tags using a reqular expression?

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.
This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.
Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

Resources