This might be a similar problem to my earlier two questions - see here and here but I'm trying to use the _detail command to automatically click the link so I can scrape the details page for each individual event.
The code I'm using is:
require 'rubygems'
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'
event do
title 'The Coast of Mayo'
link_url
event_detail do
dates "1-4 October"
times "7:30pm"
end
end
next_page "Next Page", :limit => 20
end
nuffield_data.to_xml.write($stdout,1)
Is there any way to print out the URL that using the event_detail is trying to access? The error doesn't seem to give me the URL that gave the 404.
Update: I think the link may be a relative link - could this be causing problems? Any ideas how to deal with that?
I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url
event do
title 'The Coast of Mayo'
link_url
event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
dates "1-4 October"
times "7:30pm"
end
end
sudo gem install ruby-debug
This will give you access to a nice ruby debugger, start the debugger by altering your script:
require 'rubygems'
require 'ruby-debug'
Debugger.start
Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'
event do
title 'The Coast of Mayo'
link_url
event_detail do
dates "1-4 October"
times "7:30pm"
end
end
next_page "Next Page", :limit => 2
end
nuffield_data.to_xml.write($stdout,1)
Then find out where scrubyt is throwing an exception - in this case:
/Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'
Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:
if ##current_doc_protocol == 'file'
##hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(##current_doc_url).read))
else
##hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(##mechanize_doc.body))
store_host_name(self.get_current_doc_url) # in case we're on a new host
end
rescue
debugger
self # the self is here because debugger doesn't like being at the end of a method
end
Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:
##current_doc_url
You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.
This is basically how I figured out the answer to your previous questions.
Good luck.
Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.
I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.
Related
Using watir, I've written scripts to check multiple links are being directed to the right page as below.
Links= ["Link", "Link1"]
Links.each do |LinkValue|
#browser.link(:text => LinkValue).wait_until_present.click
fail unless #browser.text.include?(LinkValue)
#browser.back
end
What I am trying is:
maintaining Linktext in an array
iterating with each linktext
verify
navigate to the previous page to start verifying with next linktext.
But the script is not working. It is not executing after first value and also not navigating back.
The following scrip working for me
require 'watir'
browser = Watir::Browser.new(:firefox) # :chrome also work
browser.goto 'https://www.google.com/'
browser.link(text: 'Gmail').wait_until_present.click
sleep(10)
browser.back
sleep(10)
You are calling Kernel::Fail, which will raise an exception if the condition isn't satisfied.
In this case, it looks like you are expecting that the destination page will contain the same link text that was clicked on the originating page. If that's not true, then the script will raise an exception and terminate.
Here's a contrived "working" example (which only "works" because the link text exists on both originating and destination pages):
require 'watir'
b = Watir::Browser.new :chrome
b.goto "http://www.iana.org/domains/reserved"
links = ["Overview", "Root Zone Management"]
links.each do |link|
b.link(:text => link).click
fail unless b.text.include? link
b.back
end
b.close
Some observations:
I wouldn't use fail here. You should investigate a testing framework like Minitest or rspec, which have assertion methods for validating application behavior.
In ruby, variables (and methods and symbols) should be in snake_case.
I am trying to read in Rspec 3.1 a cookie received after get call.
I see it is returned but the last_response.cookies doesn't exist.
How can I read response's cookie?
it "doesn't signs in" do
get '/ui/pages/Home'
puts last_response.cookies
end
I know it has been a while, but facing exactly this same issue now, after some struggle, I've found an article here that shares an interesting approach. As I also couldn't find any native parsed method for this, that has worked fine for me.
Basically, place this piece of code below on your spec/spec_helper.rb:
def cookies_from_response(response=last_response)
Hash[response["Set-Cookie"].lines.map { |line|
cookie = Rack::Test::Cookie.new(line.chomp)
[cookie.name, cookie]
}]
end
and you could use this to see the parsed hash:
puts cookies_from_response
For a cookie's value check, you could then use something like:
# Given your cookie name is 'foo' and the content is 'bar'
expect(cookies['foo'].value).to eq 'bar'
Hopefully this becomes helpful to others facing similar issues.
I'm trying to get an example of the following code from github that looks to be a dead topic for my Linux/Ubuntu install. I have been trying to scrape data from my company intranet using "mechanize" see stack question for details. Since I'm not smart enough to figure a way around my login issue I thought I would try and feed data from an excel sheet as a work around until I can figure out the mechanize route. Once again I'm not smart enough to get the provided code to work on Linux because I'm getting the following error:
`kqueue=': kqueue is not supported on this platform (EventMachine::Unsupported)
If I'm understanding correctly from the information provided in the original source, the problem is that kqueue isn't supported in Linux. The OP states that inotify is an alternative but I've had no luck finding a similar example using it to display Excel in a widget.
Here is the code that is shown on GitHub and would like help converting it to work on Linux:
require 'roo'
EM.kqueue = EM.kqueue?
file_path = "#{Dir.pwd}/spreadsheet.xls"
def fetch_spreadsheet_data(path)
s = Roo::Excel.new(path)
send_event('valuation', { current: s.cell(1, 2) })
end
module Handler
def file_modified
fetch_spreadsheet_data(path)
end
end
fetch_spreadsheet_data(file_path)
EM.next_tick do
EM.watch_file(file_path, Handler)
end
Okay, so I was able to get this working and to display my data on a Dashing Dashboard widget by doing the following:
First: I uploaded my spreadsheet.xls to the root directory of my dashboard.
Second: I replaced the /jobs/sample.rb code with:
#!/usr/bin/env ruby
require 'roo'
SCHEDULER.every '2s' do
file_path = "#{Dir.pwd}/spreadsheet.xls"
def fetch_spreadsheet_data(path)
s = Roo::Excel.new(path)
send_event('valuation', { current: s.cell('B',49) })
end
module Handler
def file_modified
fetch_spreadsheet_data(path)
end
end
fetch_spreadsheet_data(file_path)
end
Third: Make sure the /widgets/number is in your dashboard "this is part of the sample install".
Fourth: Add the following code to your /dashboards/sample.erb file "this is part of the sample install as well".
<li data-row="1" data-col="1" data-sizex="1" data-sizey="1">
<div data-id="valuation" data-view="Number" data-title="Current Valuation" data-prefix="$"></div>
</li>
I used this source to help me better understand how Roo works. I tested my widget by changing my values and re-uploading the spreadsheet.xls to server and seen instant changes on my dashboard.
Hope this helps someone and I'm still looking for help to automate this process by scraping the data. Reference this if you can help.
Thanks for sharing this code sample. I did not manage to make it work in my environment (Raspberry/Raspbian) but after some efforts I managed to come up something that works -- at least for me ;)
I had never worked with Ruby before this week, so this code may be a bit crappy. Please accept apologizes.
-- Christophe
require 'roo'
require 'rubygems'
require 'rb-inotify'
# Implement INotify::Notifier.watch as described here:
# https://www.go4expert.com/articles/track-file-changes-ruby-inotify-t30264/
file_path = "#{Dir.pwd}/datasheet.csv"
def fetch_spreadsheet_data(path)
s = Roo::CSV.new(path)
send_event('csvdata', { value: s.cell(1, 1) })
end
SCHEDULER.every '5s' do
notifier = INotify::Notifier.new
notifier.watch(file_path, :modify) do |event|
event.flags.each do |flag|
## convert to string
flag = flag.to_s
puts case flag
when 'modify' then fetch_spreadsheet_data(file_path)
end
end
end
## loop, wait for events from inotify
notifier.process
end
I have to say I am new both to Ruby and to RSpec. Anyway I completed one RSpec script but after refactoring it failed. Here is the original working version:
describe Site do
browser = Watir::Browser.new :ie
site = Site.new(browser, "http://localhost:8080/site")
it "can navigate to any page at the site" do
site.pages_names.each do |page_name|
site.goto(page_name)
site.actual_page.name.should eq page_name
end
end
browser.close
end
and here is the modified version - I wanted to have reported all the pages which were visited during the test:
describe Site do
browser = Watir::Browser.new :ie
site = Site.new(browser, "http://localhost:8080/site")
site.pages_names.each do |page_name|
it "can navigate to #{page_name}" do
site.goto(page_name)
site.actual_page.name.should eq page_name
end
end
browser.close
end
The problem in the latter case is that site gets evaluated to nil within the code block associated with 'it' method.
But when I did this:
...
s = site
it "can navigate to #{page_name}" do
s.goto(page_name)
s.actual_page.name.should eq page_name
end
...
the nil problem was gone but tests failed with the reason "browser was closed"
Apparently I am missing something very basic Ruby knowledge - because the browser reference is not working correctly in modified script. Where did I go wrong? What refactoring shall be applied to make this work?
Thanks for your help!
It's important to understand that RSpec, like many ruby programs, has two runtime stages:
During the first stage, RSpec loads each of your spec files, and executes each of the describe and context blocks. During this stage, the execution of your code defines your examples, the hooks, etc. But your examples and hooks are NOT executed during this stage.
Once RSpec has finished loading the spec files (and all examples have been defined), it executes them.
So...trimming down your example to a simpler form, here's what you've got:
describe Site do
browser = Watir::Browser.new :ie
it 'does something with the browser' do
# do something with the browser
end
browser.close
end
While visually it looks like the browser instance is instantiated, then used in the example, then closed, here's what's really happening:
The browser instance is instantiated
The example is defined (but not run)
The browser is closed
(Later, after all examples have been defined...) The example is run
As O.Powell's answer shows, you can close the browser in an after(:all) hook to delay the closing until after all examples in this example group have run. That said, I'd question if you really need the browser instance at example definition time. Generally you're best off lazily creating resources (such as the browser instance) when examples need them as they are running, rather than during the example definition phase.
I replicated your code above using fake classes for Site and Watir. It worked perfectly. My only conclusion then is that the issue must lie with either one of the above classes. I noticed the Site instance only had to visit one page in your first working version, but has to visit multiple pages in the non working version. There may be an issue there involving the mutation happening inside the instance.
See if this makes a difference:
describe Site do
uri = "http://localhost:8080/site"
browser = Watir::Browser.new :ie
page_names = Site.new(browser, uri).page_names
before(:each) { #site = Site.new(browser, uri) }
after(:all) { browser.close }
pages_names.each do |page_name|
it "can navigate to #{page_name}" do
#site.goto(page_name)
#site.actual_page.name.should eq page_name
end
end
end
I'm working to do a crawl, but before I crawl an entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep getting a nomethoderror....
Anemone.crawl(self.url) do |anemone|
anemone.focus_crawl do |crawled_page|
crawled_page.links.slice(0..10)
page = pages.find_or_create_by_url(crawled_page.url)
logger.debug(page.inspect)
page.check_for_term(self.term, crawled_page.body)
end
end
NoMethodError (private method `select' called for true:TrueClass):
app/models/site.rb:14:in `crawl'
app/controllers/sites_controller.rb:96:in `block in crawl'
app/controllers/sites_controller.rb:95:in `crawl'
Basically I want to have a way to first craw only 10 pages, but I seem to be not understanding the basics here. Can someone help me out?
Thanks!!
Add this monkeypatch to your crawling file.
module Anemone
class Core
def kill_threads
#tentacles.each { |thread|
Thread.kill(thread) if thread.alive?
}
end
end
end
Here is an example of how to use it after you've added it to your crawling file.Then in the file which you are running your add this to your anemone.on_every_page method
#counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
anemone.on_every_page do |page|
#counter+= 1
if #counter > 10
anemone.kill_threads
end
end
end
Source: https://github.com/chriskite/anemone/issues/24
So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.
i found your question while i was googling for anemone.
I had the same problem. And with Anemone, what i did was:
As soon as i reach the URL limit that i want, i raise an exception. The whole anemone block is inside a begin/rescue block.
In your case specific i would take another approach. I would download the page that you want to parse, and bind it to fakeweb. I wrote a blog entry about it, long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html