I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:
Image URLs <img src="..."
Javascript URLs <script src="..."
CSS and any CSS url(...) elements
Frames and IFrames
I'd also want to follow any CSS imports.
Any suggestions / help would be appreciated. The project is already using Anemone.
Here's what I have at the moment.
Anemone.crawl(site, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.doc.xpath('//img').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//script').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//link').each do |link|
process_dependency(page, link[:href])
end
puts page.url
end
end
Code would be great but I'm really just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.
Get the content of the page, then you can extract an array of URIs from the page with
require 'uri'
URI.extract(page)
After that it's just a matter of using a regular expression to parse each link and extract the domain name.
Related
I use Markdown to write posts on my websites.
Sometimes I have to link to other pages on my website, other times to external links. I would like to automatically add the properties rel="nofollow noopener noreferrer" target="_blank" whenever I have an external markdown link. In other words, the links
[External](www.google.com)
[Internal](/posts/another)
would translate to
External
Internal
Is this possible? How can I go about coding this?
I am using Kramdown, but I can also use other markdown engines.
I solved this with a Jekyll plugin. Not sure if there are better solutions.
require "kramdown"
module Kramdown
class Converter::Html
alias_method :super_convert_a, :convert_a
def convert_a el, indent
if ( el.attr["href"] =~ /^[A-Za-z0-9]+:/ )
el.attr["target"] = "_blank"
el.attr["rel"] = "nofollow noopener noreferrer"
end
super_convert_a el, indent
end
end
end
I use relative urls for my internal links.
I want to get some data from the HMs website, using this scraper:
require 'nokogiri'
require 'open-uri'
require 'rmagick'
require 'mechanize'
product = "http://www2.hm.com/es_es/productpage.0250933004.html"
web = Nokogiri::HTML(open(product))
puts web.at_css('.product-item-headline').text
Nokogiri returns NIL for each selector and raises undefined method for nilClass. I don't know if this particular website has something that can avoid scraping.
In the URL DOM, I can see there is a .product-item-headline class, and I can fetch the info in the JavaScript console, but I can't with Nokogiri.
I tried targeting the whole body text, and this is the only thing I get printed.
var callcoremetrix = function(){cmSetClientID(getCoremetricsClientId(), true, "msp.hm.com", "hm.com");};
Maybe some JavaScript is ruining my scrape?
One idea is to use IRB and go step by step:
irb
> require 'open-uri'
> html = open(product).read
Does the HTML contain the class name text?
> html =~ /product-item-headline/
=> 56099
Yes it does, and here's the line:
<h1 class="product-item-headline">
So try Nokogiri:
> require 'nokogiri'
web = Nokogiri::HTML(html)
=> success
Read the HTML text and try increasingly-broad queries related to your issue that take you nearer the top of the HTML, and see if they find results:
web.css("h1") # on line 2217 of the HTML
=> []
web.css(".product-detail-meta") # on line 2215
=> []
web.css(".wrapper") # on line 86
=> []
web.css("body") # on line 84
=> [#<Nokogiri::XML::Element …
This shows you there's a problem in the HTML. The parsing is disrupted between lines 84 and 86.
Let's guess that line 85 may be the issue: it is a <header> tag, and we happen to know that doesn't contain your target, so we can delete it. Save the HTML to a file, then use any text editor to delete the tag and all its contents, then re-parse.
Does it work now?
web.css("h1") # on line 359 of the HTML
=> []
Nope. So we repeat this process, cutting down the HTML.
I also like to cut down the HTML by removing pieces that I know don't contain my target, such as the <head> area, <footer> areas, <script> areas etc.
You may like to use an auto-indenting editor, because it can quickly show you that something is unbalanced with the HTML.
Eventually we find that the HTML has many incorrect tags, such as unclosed section tags.
You can solve this a variety of ways:
The pure way is to fix the unclosed section tags, any way to you want.
The hack way is to narrow the HTML to the area you know you need, which is in the h1 tag.
Here's the hack way:
area = html.match(/<h1 class="product-item-headline\b.*?<\/h1>/m)[0]
web = Nokogiri::HTML(area)
puts web.at_css(".product-item-headline").text.strip
=> "Funda de cojín de jacquard"
Heads up that the hack way isn't truly HTML-savvy, and you can see that it will fail if the HTML page author changes to use a different tag, or uses another class name before the class name you want, etc.
The best long-term solution is to contact the author of the HTML page and show him how to validate the HTML. A good site for this is http://validator.w3.org/ -- when you validate your URL, the site shows 100 errors and 6 warnings, and explains each one and how to solve it.
I loaded a page using Mechanize:
url = 'http://www.blah.com'
agent = Mechanize.new
page = agent.get(url)
and tried to access an element using an XPath selector:
found = page.at('/html/body/table')
It returns nil because the HTML, which is out of my control, has an opening tag where it shouldn't be:
<html>
<body>
<tr>
<table>
. . .
The "stray start tag," as Firefox calls it, is ignored when the browser renders the page in real life (and Firefox gives me xpaths that ignore it), but Nokogiri can't see anything past that extra <tr>.
Is there any way to clean the HTML of hanging tags like this?
In your example it would be:
page.at '/html/body/tr/table'
But maybe it makes more sense to just do:
page.at 'table'
Use a less brittle XPath query?
found = page.at('//table')
You can clean this using Nokogiri easily:
require 'nokogiri'
html = '<html><body><tr><table><tr><td>foo</td></tr></table></tr></body></html>'
doc = Nokogiri::HTML(html)
inner_table = doc.at('//body/tr/table')
if (inner_table)
doc.at('body tr').replace(inner_table)
end
puts doc.to_html
With the result being:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr><td>foo</td></tr></table></body></html>
If your HTML is more complex, then find some sort of marker similar to the <body><tr><table> node-chain, and substitute it into the code above.
Note that I'm mixing both XPath and CSS accessors. I prefer CSS for their readability, but sometimes XPath makes it easier to get at something or is more self-documenting.
Also notice that I'm using both XPath and CSS with Nokogiri's at method. Though Nokogiri supports both at, at_css and at_xpath, I rely on at unless I need to explicitly tell Nokogiri that what I'm using as an accessor is CSS or XPath. It's a convenience thing. The same applies to Nokogiri's search method.
Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?
You could try:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.traverse{ |x|
if x.text? && x.text !~ /^\s*$/
puts x.text
end
}
I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).
You can ignore JavaScript and there is a better way. You're ignoring the power of Nokogiri. Badly.
Rather than provide you with the direct answer, it will do you well to learn to "fish" using Nokogiri.
In a document like:
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
I recommend starting with CSS accessors because they're generally more familiar to people:
doc = Nokogiri::HTML(var_containing_html) will parse and return the HTML DOM in doc.
doc.at('p') will return a Node, which basically points to the first <p> node.
doc.search('p') will return a NodeSet of all matching nodes, which acts like an array, in this case all <p> nodes.
doc.at('p').text will return the text inside a node.
doc.search('p').map{ |n| n.text } will return all the text in the <p> nodes as an array of text strings.
As your document gets more complex you need to drill down. Sometimes you can do it using a CSS accessor, such as 'body p' or something similar, and sometimes you need to use XPaths. I won't go into those but there are great tutorials and references out there.
Nokogiri's tutorials are very good. Go through them and they will reveal all you need to know.
In addition, there are many answers on Stack Overflow discussing this sort of problem. Check out the "Related" links on the right of the page.
Ignore the tags where JavaScript lives (<script>). While we’re at it, we should also ignore CSS (<styles>).
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.css('style').each(&:remove)
doc.css('script').each(&:remove)
puts doc.text
# Alternatively, for cleaner output:
# puts doc.text.split("\n").map(&:strip).reject(&:empty?)
How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.