I am trying to download a file from Watir but I don't want to use the loop sleep approach.
I would rather at the last moment of the interaction, recreate the session Watir has on the webpage and use another library, for example Typhoeus.
Typhoeus uses curl and can use cookies from a file, nonetheless, Watir generates a Hash Array, and if I ask to save it, it saves it as a YAML file.
Is there a faster way to convert it?
Another article in StackOverflow said that curl uses Mozilla style cookie files.
So, if your Watir instance is browser and the file to which you are going to write to is file you can do
browser.cookies.to_a.each do |ch|
terms = []
terms << ch[:domain]
terms << ch[:same_site].nil? ? 'FALSE' : 'TRUE'
terms << ch[:path]
terms << ch[:secure] ? 'TRUE' : 'FALSE'
terms << ch[:expires].to_i.to_s
terms << ch[:name]
terms << ch[:value]
file.puts terms.join("\t")
end
Then you can tell Typhoeus to use the contents of file to keep navigating using the same cookies.
Related
So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end
I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.
Is it possible to open every link in certain div and collect values of opened fields alltogether in one file or at least terminal output?
I am trying to get list of coordinates from all markers visible on google map.
all_links = b.div(:id, "kmlfolders").links
all_links.each do |link|
b.link.click
b.link(:text, "Norādījumi").click
puts b.text_field(:title, "Galapunkta_adrese").value
end
Are there easier or more effective ways how to automatically collect coordinates from all markers?
Unless there is other data (alt tags? elements invoked via onhover?) in the HTML already that you could pick through, that does seem like the most practical way to iterate through the links, however from what I can see you are not actually making use of the 'link' object inside your loop. You'd need something more like this I think
all_links = b.div(:id, "kmlfolders").links
all_links.each do |thelink|
b.link(:href => thelink.href).click
b.link(:text, "Norādījumi").click
puts b.text_field(:title, "Galapunkta_adrese").value
end
Probably using their API is a lot more effective means to get what you want however, it's why folks make API's after all, and if one is available, then using it is almost always best. Using a test tool as a screen-scraper to gather the info is liable to be a lot harder in the long run than learning how to make some api calls and get the data that way.
for web based api's and Ruby I find the REST-CLIENT gem works great, other folks like HTTP-Party
As I'm not already familiar with Google API, I find it hard for me to dig into API for one particular need. Therefor I made short watir-webdriver script for collecting coordinates of markers on protected google map. Resulting file is used in python script that creates speedcam files for navigation devices.
In this case it's speedcam map maintained and updated by Latvian police, but this script can probably be used with any google map just by replacing url.
# encoding: utf-8
require "rubygems"
require "watir-webdriver"
#b = Watir::Browser.new :ff
#--------------------------------
#b.goto "http://maps.google.com/maps?source=s_q&f=q&hl=lv&geocode=&q=htt%2F%2Fmaps.google.com%2Fmaps%2Fms%3Fmsid%3D207561992958290099079.0004b731f1c645294488e%26msa%3D0%26output%3Dkml&aq=&sll=56.799934,24.5753&sspn=3.85093,8.64624&ie=UTF8&ll=56.799934,24.5753&spn=3.610137,9.887695&z=7&vpsrc=0&oi=map_misc&ct=api_logo"
#b.div(:id, "kmlfolders").wait_until_present
all_markers = #b.div(:id, "kmlfolders").divs(:class, "fdrlt")
#prev_coordinates = 1
puts "#{all_markers.length} speedcam markers detected"
File.open("list_of_coordinates.txt","w") do |outfile|
all_markers.each do |marker|
sleep 1
marker.click
sleep 1
description = #b.div(:id => "iw_kml").text
#b.span(:class, "actbar-text").click
sleep 2
coordinates = #b.text_field(:name, "daddr").value
redo if coordinates == #prev_coordinates
puts coordinates
outfile.puts coordinates
#prev_coordinates = coordinates
end
end
puts "Coordinates saved in file!"
#b.close
Works both on Mac OSX 10.7 and Windows7.
I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser
Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.
I would like to start with a little script that fetches the examination results of me and my friends from our university website.
I would like to pass it the roll number as the post parameter and work with the returned data,
I don't know how to create the post string.
It would be great if someone could tell me where to start, what are the things to learn, links to a tutorial would be most appreciated.
I don´t want someone to write code for me, just guidance on how to get started.
I've written a solution here just as a reference for whatever you might come up with. There are multiple ways of attacking this.
#fetch_scores.rb
require 'open-uri'
#define a constant named URL so if the results URL changes we don't
#need to replace a hardcoded URL everywhere.
URL = "http://www.nitt.edu/prm/ShowResult.html?¶m="
#checking the count of arguments passed to the script.
#it is only taking one, so let's show the user how to use
#the script
if ARGV.length != 1
puts "Usage: fetch_scores.rb student_name"
else
name = ARGV[0] #could drop the ARGV length check and add a default using ||
# or name = ARGV[0] || nikhil
results = open(URL + name).read
end
You might examine Nokogiri or Hpricot to better parse/format your results. Ruby is an "implicit return" language so if you happened to wonder why we didn't have a return statement that's because results will be returned by the script since it was last executed.
You could have a look at the net/http library, included as part of the standard library. See http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html for details, there are some examples on that page to get you started.
A very simple way to do this is to use the open-uri library and just put the query parameters in the URL query string:
require 'open-uri'
name = 'nikhil'
results = open("http://www.nitt.edu/prm/ShowResult.html?¶m=#{name}").read
results now contains the body text fetched from the URL.
If you are looking for something more ambitious, look at net/http and the httparty gem.