Anemone Crawler skip_links_like not obeyed - ruby

I am using Anemone to crawl a massive site that to make things worse has the same content on a few different language versions.
There is domain.com/ for the main language and domain.com/de/, domain.com/es/ for the other languages so I decided to exclude these in the crawl like so:
crawler = Anemone::Core.new('http://domain.com', opts = {skip_query_strings: true})
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
However when looking at what is being crawled via a puts page.url in the on_every_page do |page| block I can see that it is still crawling all the many language variations.
I've even tried to include this
crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}
To remove the language links from what is being considered next in the list of pages to crawl.
Any suggestions?

Turns out the skip_links_like method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
I had to use this:
crawler.skip_links_like(/(^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
or just the REGEX differences:
Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*
Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.
The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.
It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

writing a short script to process markdown links and handling multiple scans

I'd like to process just links written in markdown. I've looked at redcarpet which I'd be ok with using but I really want to support just links and it doesn't look like you can use it that way. So I think I'm going to write a little method using regex but....
assuming I have something like this:
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
tmp=str.scan(/\[.*\]\(.*\)/)
or if there is some way I could just gsub in place [hope](http://www.github.com) -> <a href='http://www.github.com'>hope</a>
How would I get an array of the matched phrases? I was thinking once I get an array, I could just do a replace on the original string. Are there better / easier ways of achieving the same result?
I would actually stick with redcarpet. It includes a StripDown render class that will eliminate any markdown markup (essentially, rendering markdown as plain text). You can subclass it to reactivate the link method:
require 'redcarpet'
require 'redcarpet/render_strip'
module Redcarpet
module Render
class LinksOnly < StripDown
def link(link, title, content)
%{#{content}}
end
end
end
end
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
md = Redcarpet::Markdown.new(Redcarpet::Render::LinksOnly)
puts md.render(str)
# => here is my thing hope and ...
This has the added benefits of being able to easily implement a few additional tags (say, if you decide you want paragraph tags to be inserted for line breaks).
You could just do a replace.
Match this:
\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)
Replace with:
\1
In Ruby it should be something like:
str.gsub(/\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)/, '\1')

DRY search every page of a site with nokogiri

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.

How to use Scrubty properly to grab URL from the XML outputted content

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:
require 'rubygems'
require 'scrubyt'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q','ruby'
submit
link_title "//a[#class='l']", :write_text => true do
link_url
end
end
google_data.to_xml.write($stdout, 1);
The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url> tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url or is there a way of extracting the text from the xml content held in link_url?
This is some of the content that gets printed by the google_data.to_xml.write():
<root>
<link_title>
Ruby Programming Language
<link_url>http://ruby-lang.org/</link_url>
</link_title>
<link_title>
Download Ruby
<link_url>http://www.ruby-lang.org/en/downloads/</link_url>
</link_title>
<link_title>
Ruby - The Inspirational Weight Loss Journey on the Style Network ...
<link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
</link_title>
<link_title>
Ruby (programming language) - Wikipedia, the free encyclopedia
<link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
</link_title>
</root>
I'd think about alternatives. Scrubyt hasn't been updated in a while, and the forums have been shut down.
Mechanize can do what the Extractor does, Nokogiri can parse XML or HTML responses, and Builder can create XML (though it seems like you don't really want XML).

Resources