No output from xpath using nokogiri - xpath

require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://wwp.greenwichmeantime.com/"))
puts page.xpath(".//*[#id='offset']/span[1]").text
this should output the gmt time but it outputs nothing, what could be the reason?

The element you want to find is hidden in an iframe. The URL you want to open is http://wwp.greenwichmeantime.com/time/scripts/clock-8/runner.php?tz=gmt. You could get it from src attribute of the iframe element on the original page.

Related

How to fix an "undefined method" when trying to scrape a website with Nokogiri

I want to get some data from the HMs website, using this scraper:
require 'nokogiri'
require 'open-uri'
require 'rmagick'
require 'mechanize'
product = "http://www2.hm.com/es_es/productpage.0250933004.html"
web = Nokogiri::HTML(open(product))
puts web.at_css('.product-item-headline').text
Nokogiri returns NIL for each selector and raises undefined method for nilClass. I don't know if this particular website has something that can avoid scraping.
In the URL DOM, I can see there is a .product-item-headline class, and I can fetch the info in the JavaScript console, but I can't with Nokogiri.
I tried targeting the whole body text, and this is the only thing I get printed.
var callcoremetrix = function(){cmSetClientID(getCoremetricsClientId(), true, "msp.hm.com", "hm.com");};
Maybe some JavaScript is ruining my scrape?
One idea is to use IRB and go step by step:
irb
> require 'open-uri'
> html = open(product).read
Does the HTML contain the class name text?
> html =~ /product-item-headline/
=> 56099
Yes it does, and here's the line:
<h1 class="product-item-headline">
So try Nokogiri:
> require 'nokogiri'
web = Nokogiri::HTML(html)
=> success
Read the HTML text and try increasingly-broad queries related to your issue that take you nearer the top of the HTML, and see if they find results:
web.css("h1") # on line 2217 of the HTML
=> []
web.css(".product-detail-meta") # on line 2215
=> []
web.css(".wrapper") # on line 86
=> []
web.css("body") # on line 84
=> [#<Nokogiri::XML::Element …
This shows you there's a problem in the HTML. The parsing is disrupted between lines 84 and 86.
Let's guess that line 85 may be the issue: it is a <header> tag, and we happen to know that doesn't contain your target, so we can delete it. Save the HTML to a file, then use any text editor to delete the tag and all its contents, then re-parse.
Does it work now?
web.css("h1") # on line 359 of the HTML
=> []
Nope. So we repeat this process, cutting down the HTML.
I also like to cut down the HTML by removing pieces that I know don't contain my target, such as the <head> area, <footer> areas, <script> areas etc.
You may like to use an auto-indenting editor, because it can quickly show you that something is unbalanced with the HTML.
Eventually we find that the HTML has many incorrect tags, such as unclosed section tags.
You can solve this a variety of ways:
The pure way is to fix the unclosed section tags, any way to you want.
The hack way is to narrow the HTML to the area you know you need, which is in the h1 tag.
Here's the hack way:
area = html.match(/<h1 class="product-item-headline\b.*?<\/h1>/m)[0]
web = Nokogiri::HTML(area)
puts web.at_css(".product-item-headline").text.strip
=> "Funda de cojín de jacquard"
Heads up that the hack way isn't truly HTML-savvy, and you can see that it will fail if the HTML page author changes to use a different tag, or uses another class name before the class name you want, etc.
The best long-term solution is to contact the author of the HTML page and show him how to validate the HTML. A good site for this is http://validator.w3.org/ -- when you validate your URL, the site shows 100 errors and 6 warnings, and explains each one and how to solve it.

watir webdriver get into iframe and get src of elements and click on link

With Watir-Webdriver I want to go change the focus to the iframe and get the link that is inside it.
Here is the html code
<iframe id="top_right" src="otherwebsite.com/need content src">
<img src="need this" />
So what I would like is to go into the iframe, get the src of it, capture the href and the src from the img element and in the end click on these elements retrieving the data.
This is my attempt using Ruby:
require 'watir-webdriver'
b = Watir::Browser.new
b.goto 'somesite.com'
b.wait
f = b.frame(:id => 'top_right').link(:index => 1).click
I have got until here but unfortunately i still get the following response:
in `assert_exists': unable to locate element, using {:index=>1, :tag_name=>"a"} (Watir::Exception::UnknownObjectException)
so if anybody have some help it would be great tnx.
You are trying to click the second link (:index=>1) in the frame. Looks like the frame does not have two links. Try clicking the first link (:index=>0):
b.frame(:id => 'top_right').link(:index => 0).click

Nokogiri grab only visible inner_text

Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?
You could try:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.traverse{ |x|
if x.text? && x.text !~ /^\s*$/
puts x.text
end
}
I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).
You can ignore JavaScript and there is a better way. You're ignoring the power of Nokogiri. Badly.
Rather than provide you with the direct answer, it will do you well to learn to "fish" using Nokogiri.
In a document like:
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
I recommend starting with CSS accessors because they're generally more familiar to people:
doc = Nokogiri::HTML(var_containing_html) will parse and return the HTML DOM in doc.
doc.at('p') will return a Node, which basically points to the first <p> node.
doc.search('p') will return a NodeSet of all matching nodes, which acts like an array, in this case all <p> nodes.
doc.at('p').text will return the text inside a node.
doc.search('p').map{ |n| n.text } will return all the text in the <p> nodes as an array of text strings.
As your document gets more complex you need to drill down. Sometimes you can do it using a CSS accessor, such as 'body p' or something similar, and sometimes you need to use XPaths. I won't go into those but there are great tutorials and references out there.
Nokogiri's tutorials are very good. Go through them and they will reveal all you need to know.
In addition, there are many answers on Stack Overflow discussing this sort of problem. Check out the "Related" links on the right of the page.
Ignore the tags where JavaScript lives (<script>). While we’re at it, we should also ignore CSS (<styles>).
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.css('style').each(&:remove)
doc.css('script').each(&:remove)
puts doc.text
# Alternatively, for cleaner output:
# puts doc.text.split("\n").map(&:strip).reject(&:empty?)

How to get a mail address from HTML code with Nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.

Getting all the domains a page depends on using Nokogiri

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:
Image URLs <img src="..."
Javascript URLs <script src="..."
CSS and any CSS url(...) elements
Frames and IFrames
I'd also want to follow any CSS imports.
Any suggestions / help would be appreciated. The project is already using Anemone.
Here's what I have at the moment.
Anemone.crawl(site, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.doc.xpath('//img').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//script').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//link').each do |link|
process_dependency(page, link[:href])
end
puts page.url
end
end
Code would be great but I'm really just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.
Get the content of the page, then you can extract an array of URIs from the page with
require 'uri'
URI.extract(page)
After that it's just a matter of using a regular expression to parse each link and extract the domain name.

Resources