I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.
Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?
A good way of doing this type of thing is to use the (AWESOME) VCR gem.
Here's an example of how you would do it:
require 'vcr'
require 'mechanize'
# Setup VCR's configs. The cassette library directory is where
# all of your "recordings" are saved as YAML files.
VCR.configure do |c|
c.cassette_library_dir = 'vcr_cassettes'
c.hook_into :webmock
end
# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
a = Mechanize.new
a.get('http://google.com/')
end
As you can see... VCR records the communication as a YAML file on the first run:
mario$ find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml
If you want to have VCR create new versions of the cassettes, just delete the corresponding file.
I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.
I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.
I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.
Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.
If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.
# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]
# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)
I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:
# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
# for in-memory storage, you could use a Hash.
# or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)
def get_cached(uri)
cache_key = "_cache/#{uri}"
if args = storage[cache_key]
logger.debug("getting (cached) #{uri}")
uri, response, body, code = args
page = Mechanize::Page.new(uri, response, body, code, agent)
agent.send(:add_to_history, page)
page
else
logger.debug("getting (UNCACHED) #{uri}")
page = agent.get(uri)
storage[cache_key] = [page.uri, page.response, page.body, page.code]
page
end
end
end
Which you could use like this:
require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'
storage = {}
foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")
pp storage
Which prints the following information:
D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
[#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
{"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"87",
"connection"=>"close",
"content-type"=>"text/plain"},
"Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
"200"],
"_cache/http://ifconfig.me/encoding"=>
[#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
{"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"42",
"connection"=>"close",
"content-type"=>"text/plain"},
"gzip,deflate,identity\n",
"200"]}
How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?
Related
I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here
I'm writing automated tests using Selenium WebDriver with Ruby. So, I'm thinking to keep elements in another file and actual code in another file. And for Ruby, I found yaml gem which allows to store data and access it. Hence I stored elements in lib.yml and test code in test.rb as following:
lib/lib.yml
homepage:
frame: 'mainPage'
email: 'loginPage-email'
password: 'loginPage-password'
login_button: 'btnLogin'
tests/test.rb
require 'selenium-webdriver'
require 'yaml'
driver = Selenium::WebDriver.for :firefox
driver.get 'http://www.abc.com'
config = YAML.load_file('./lib/lib.yml')
driver.switch_to.frame(config['homepage']['frame'])
email = driver.find_element(:id, config['homepage']['email'])
password = driver.find_element(:id, config['homepage']['password'])
email.clear
email.send_keys 'abc#gmail.com'
password.clear
password.send_keys 'password'
driver.find_element(:id, config['homepage']['login_button']).click
driver.quit
This way maintenance becomes easier. I just want to make sure if doing so is a good way or not. I'm asking because I'm trying this first time and don't know what difficulties I'll run into if I choose this for larger project.
I know, using Page object model, we can achieve same thing. But I don't know about Page object. So should I avoid using yml gem and directly go for page object gem?
Also, can someone explain how using yml will not be good idea(if it's not)?
Note:
In above code, config['homepage']['something'] is repetitive code. I'll write method to avoid repetition for that.
Yeah this definitely is useful... It keeps the changes to minimum when there is UI change in future.. You always have just one place to edit... Is there any data you have to pass to your code? How are storing the automation data passed to your test.. The only concern might be you might end up with too many yaml files which could be difficult to keep track...
In your specific case I don't see how this adds much value. Half of the settings (frame, login_button) won't change for your tests, so I suggest leaving them directly in the code where they are used. The html structure is not something that usually changes.
The other two values (email, password) seem like they might change when you want to try out different users (i.e. different cases). If you have one test with several example inputs then I suggest using a more readable solution as Cucumber.
(I'd suggest using capybara anyway for testing browser interaction, as it abstracts away many details of the underlying driver)
Apart from that, yaml is usually the ruby way for storing configuration.
I added one more step: Declared locator (id, name etc) in the yaml itself.
Ex:(yaml)
Declared env.rb which load the environment from yaml files
env.yml:
LOGIN:
UserName: {id: UserName}
Password: {id: Password}
RememberME: {id: RememberMe}
Submit: {xpath: "//input[#value='Log On']"}
Then added "pages\Login.rb"
#Loads all objects from yaml
def get_objects
username=#browser.find_element( $object_array['LOGIN']['UserName'])
password=#browser.find_element( $object_array['LOGIN']['Password'])
remember_me=#browser.find_element( $object_array['LOGIN']['RememberME'])
submit= #browser.find_element($object_array['LOGIN']['Submit'])
end
#Added methods in this class like
def loginas(uname,pass)
username.send_keys uname
password.send_keys pass
remember_me.click
submit.click
end #loginas_siteadmin
Created Tests file Login_tests.rb
lp=LoginPage::new(#browser)
lp.navigate
lp.loginas('SiteAdmin','password123')
This way your scripts and maintainable and most importantly you are free of any other external gem or dependency.
I have written a Jekyll plugin to display the number of pageviews on a page by calling the Google Analytics API using the garb gem. The only trouble with my approach is that it makes a call to the API for each page, slowing down build time and also potentially hitting the user call limits on the API.
It would be possible to return all the data in a single call and store it locally, and then look up the pageview count from each page, but my Jekyll/Ruby-fu isn't up to scratch. I do not know how to write the plugin to run once to get all the data and store it locally where my current function could then access it, rather than calling the API page by page.
Basically my code is written as a liquid block that can be put into my page layout:
class GoogleAnalytics < Liquid::Block
def initialize(tag_name, markup, tokens)
super # options that appear in block (between tag and endtag)
#options = markup # optional optionss passed in by opening tag
end
def render(context)
path = super
# Read in credentials and authenticate
cred = YAML.load_file("/home/cboettig/.garb_auth.yaml")
Garb::Session.api_key = cred[:api_key]
token = Garb::Session.login(cred[:username], cred[:password])
profile = Garb::Management::Profile.all.detect {|p| p.web_property_id == cred[:ua]}
# place query, customize to modify results
data = Exits.results(profile,
:filters => {:page_path.eql => path},
:start_date => Chronic.parse("2011-01-01"))
data.first.pageviews
end
Full version of my plugin is here
How can I move all the calls to the API to some other function and make sure jekyll runs that once at the start, and then adjust the tag above to read that local data?
EDIT Looks like this can be done with a Generator and writing the data to a file. See example on this branch Now I just need to figure out how to subset the results: https://github.com/Sija/garb/issues/22
To store the data, I had to:
Write a Generator class (see Jekyll wiki plugins) to call the API.
Convert data to a hash (for easy lookup by path, see 5):
result = Hash[data.collect{|row| [row.page_path, [row.exits, row.pageviews]]}]
Write the data hash to a JSON file.
Read in the data from the file in my existing Liquid block class.
Note that the block tag works from the _includes dir, while the generator works from the root directory.
Match the page path, easy once the data is converted to a hash:
result[path][1]
Code for the full plugin, showing how to create the generator and write files, etc, here
And thanks to Sija on GitHub for help on this.
In the project I am working on, we use VCR to store cassettes for both local and external services. The local ones are micro services that are constantly modified while the external ones are hardly modified.
Due this reason and plus that the external services takes a long time to be re-record, makes sense for us re-record just the local cassettes most of the time.
In order to solve that we tried to separate the cassettes in different folders (cassettes/localhost and cassettes/external/sample.com).
Then we came up with:
VCR.configure do |config|
config.around_http_request do |request|
host = URI(request.uri).host
vcr_name = VCR.current_cassette.name
folder = host
folder = "external/#{folder}" if host != 'localhost'
VCR.use_cassette("#{folder}/#{vcr_name}", &request)
end
[...]
end
But the problem is that we have some tests that need to make repeated requests (exactly the same request) where the server returns different results. So, using the code above makes the cassettes be reset for each http call. The first request is recorded, and the second is a playback of the first, even if the response was expected to be different.
Then we tried a different approach using tags and nested cassettes:
RSpec.configure do |config|
config.around(:each) do |spec|
name = spec.metadata[:full_description]
VCR.use_cassette "external/#{name}", tag: :external do
VCR.use_cassette "local/#{name}", tag: :internal do
spec.call
end
end
end
[...]
end
VCR.configure do |config|
config.before_record(:external) do |i|
i.ignore! if URI(i.request.uri).host == 'localhost'
end
config.before_record(:internal) do |i|
i.ignore! if URI(i.request.uri).host != 'localhost'
end
[...]
end
But also this doesn't work. The outcome was that all localhost requests were recorded on the internal cassette. The rest of the requests were ignored by VCR.
So do you have any suggestion to solve this?
Then we tried a different approach using tags and nested cassettes...But also this doesn't work.
Yeah, I didn't design cassette nesting with this kind of use case in mind. HTTP interactions are always recorded to the innermost cassette, but can be played back from any level of nesting (it tries the innermost cassette first, then searches up the parent chain). The main use case I had in mind for nesting cassettes was for cucumber: you may want to use a single cassette for an entire scenario, but then you may want to use a particular cassette for an individual step definition (i.e. for any scenario that uses that step). The inner cassette "takes over" while it is in use, but the outer cassette is still there to be available for when the inner cassette is ejected.
This is an interesting use case, though...if you think VCR should work that way, please consider opening a github issue for it and we can discuss it more.
As for your original question: I think #operand's answer will work.
I think you want to look into using the :match_requests_on setting. Read the docs here: https://www.relishapp.com/myronmarston/vcr/v/2-3-0/docs/request-matching
This should allow you to record multiple requests to the same URL but have them replayed in sequence.
In addition to that, I think your method of splitting up the cassettes into different directories sounds good. One thing I've done in the past to force re-recording of specific cassettes, is to just delete the cassettes themselves before rerunning the specs. In your case that should be easy since you've separated them nicely.
Instead of or in addition to that, you could possibly use the :all record setting when you know it's a local request, and patch that into your configure block. Something like:
VCR.configure do |config|
config.around_http_request do |request|
host = URI(request.uri).host
vcr_name = VCR.current_cassette.name
folder = host
if host != 'localhost'
folder = "external/#{folder}"
record_mode = :once
else
record_mode = :all
end
VCR.use_cassette("#{folder}/#{vcr_name}", :record => record_mode, &request)
end
[...]
end
Note, I haven't tested this, so please double check me on that. Of course, you'd also want to not use the :all record setting when you just want to play things back. Maybe you can develop a switch somehow when you invoke the tests.
It's not a complete answer, but I hope this helps.
I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.