anemone Ruby with focus_crawl

anemone Ruby with focus_crawl - ruby

I'm working to do a crawl, but before I crawl an entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep getting a nomethoderror....
Anemone.crawl(self.url) do |anemone|
anemone.focus_crawl do |crawled_page|
crawled_page.links.slice(0..10)
page = pages.find_or_create_by_url(crawled_page.url)
logger.debug(page.inspect)
page.check_for_term(self.term, crawled_page.body)
end
end
NoMethodError (private method `select' called for true:TrueClass):
app/models/site.rb:14:in `crawl'
app/controllers/sites_controller.rb:96:in `block in crawl'
app/controllers/sites_controller.rb:95:in `crawl'
Basically I want to have a way to first craw only 10 pages, but I seem to be not understanding the basics here. Can someone help me out?
Thanks!!

Add this monkeypatch to your crawling file.
module Anemone
class Core
def kill_threads
#tentacles.each { |thread|
Thread.kill(thread) if thread.alive?
}
end
end
end
Here is an example of how to use it after you've added it to your crawling file.Then in the file which you are running your add this to your anemone.on_every_page method
#counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
anemone.on_every_page do |page|
#counter+= 1
if #counter > 10
anemone.kill_threads
end
end
end
Source: https://github.com/chriskite/anemone/issues/24

So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.

i found your question while i was googling for anemone.
I had the same problem. And with Anemone, what i did was:
As soon as i reach the URL limit that i want, i raise an exception. The whole anemone block is inside a begin/rescue block.
In your case specific i would take another approach. I would download the page that you want to parse, and bind it to fakeweb. I wrote a blog entry about it, long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html

Related

poltergeist doesn't seem to wait for phantomjs to load in capybara

I'm trying to get some rspec tests run using a mix of Capybara, Selenium, Capybara/webkit, and Poltergeist. I need it to run headless in certain cases and would rather not use xvfb to get webkit working. I am okay using selenium or poltergeist as the driver for phantomjs. The problem I am having is that my tests run fine with selenium and firefox or chrome but when I try phantomjs the elements always show as not found. After looking into it for a while and using page.save_screenshot in capybara I found out that the phantomjs browser wasn't loaded up when the driver told it to find elements so it wasn't returning anything. I was able to hack a fix to this in by editing the poltergeist source in <gem_path>/capybara/poltergeist/driver.rb as follows
def visit(url)
if #started
sleep_time = 0
else
sleep_time = 2
end
#started = true
browser.visit(url)
sleep sleep_time
end
This is obviously not an ideal solution for the problem and it doesn't work with selenium as the driver for phantomjs. Is there anyway I can tell the driver to wait for phantom to be ready?
UPDATE:
I was able to get it to run by changing where I included the Capybara::DSL. I added it to the RSpec.configure block as shown below.
RSpec.configure do |config|
config.include Capybara::DSL
I then passed the page object to all classes I created for interacting with the webpage ui.
An example class would now look like this
module LoginUI
require_relative 'webpage'
class LoginPage < WebPages::Pages
def initialize(page, values = {})
super(page)
end
def visit
browser.visit(login_url)
end
def login(username, password)
set_username(username)
set_password(password)
sign_in_button
end
def set_username(username)
edit = browser.find_element(#selectors[:login_edit])
edit.send_keys(username)
end
def set_password(password)
edit = browser.find_element(#selectors[:password_edit])
edit.send_keys(password)
end
def sign_in_button
browser.find_element(#selectors[:sign_in_button]).click
end
end
end
Webpage module looks like this
module WebPages
require_relative 'browser'
class Pages
def initialize(page)
#page = page
#browser = Browser::Browser.new
end
def browser
#browser
end
def sign_out
browser.visit(sign_out_url)
end
end
end
The Browser module looks like this
module Browser
class Browser
include Capybara::DSL
def refresh_page
page.evaluate_script("window.location.reload()")
end
def submit(locator)
find_element(locator).click
end
def find_element(hash)
page.find(hash.keys.first, hash.values.first)
end
def find_elements(hash)
page.find(hash.keys.first, hash.values.first, match: :first)
page.all(hash.keys.first, hash.values.first)
end
def current_url
return page.current_url
end
end
end
While this works I don't want to have to include the Capybara::DSL inside RSpec or have to include the page object in the classes. These classes have had some things removed for the example but show the general structure. Ideally I would like to have the Browser module include the Capybara::DSL and be able to handle all of the interaction with Capybara.

Your update completely changes the question so I'm adding a second answer. There is no need to include the Capybara::DSL in your RSpec configure if you don't call any Capybara methods from outside your Browser class, just as there is no need to pass 'page' to all your Pages classes if you limit all Capybara interaction to your Browser class. One thing to note is that the page method provided by Capybara::DSL is just an alias for Capybara.current_session so technically you could just always call that.
You don't show in your code how you're handling any assertions/expectations on the page content - so depending on how you're doing that you may need to include Capybara::RSpecMatchers in your RSpec config and/or your WebPages::Pages class.
Your example code has a couple of issues that immediately pop out, firstly your Browser#find_elements (assuming I'm reading your intention for having find first correctly) should probably just be
def find_elements(hash)
page.all(hash.keys.first, hash.values.first, minimum: 1)
end
Secondly, your LoginPage#login method should have an assertion/expectation on a visual change that indicates login succeeded as its final line (verify some message is displayed/logged in menu exists/ etc), to ensure the browser has received the auth cookies, etc before the tests move on. What that line looks like depends on exactly how you're architecting your expectations.
If this doesn't answer your question, please provide a concrete example of what exactly isn't working for you since none of the code you're showing indicates any need for Capybara::DSL to be included in either of the places you say you don't want it.

Capybara doesn't depend on visit having completed, instead the finders and matchers will retry up to a specified period of time until they succeed. You can increase this amount of time by increasing the value of Capybara.default_max_wait_time. The only methods that don't wait by default are first and all, but can be made to wait/retry by specifying any of the count options
first('.some_class', minimum: 1) # will wait up to Capybara.default_max_wait_time seconds for the element to exist on the page.
although you should always prefer find over first/all whenever possible
If increasing the maximum wait time doesn't solve your issue, add an example of a test that fails to your question.

Use with Excel data to display on Dashing Dashboard?

I'm trying to get an example of the following code from github that looks to be a dead topic for my Linux/Ubuntu install. I have been trying to scrape data from my company intranet using "mechanize" see stack question for details. Since I'm not smart enough to figure a way around my login issue I thought I would try and feed data from an excel sheet as a work around until I can figure out the mechanize route. Once again I'm not smart enough to get the provided code to work on Linux because I'm getting the following error:
`kqueue=': kqueue is not supported on this platform (EventMachine::Unsupported)
If I'm understanding correctly from the information provided in the original source, the problem is that kqueue isn't supported in Linux. The OP states that inotify is an alternative but I've had no luck finding a similar example using it to display Excel in a widget.
Here is the code that is shown on GitHub and would like help converting it to work on Linux:
require 'roo'
EM.kqueue = EM.kqueue?
file_path = "#{Dir.pwd}/spreadsheet.xls"
def fetch_spreadsheet_data(path)
s = Roo::Excel.new(path)
send_event('valuation', { current: s.cell(1, 2) })
end
module Handler
def file_modified
fetch_spreadsheet_data(path)
end
end
fetch_spreadsheet_data(file_path)
EM.next_tick do
EM.watch_file(file_path, Handler)
end

Okay, so I was able to get this working and to display my data on a Dashing Dashboard widget by doing the following:
First: I uploaded my spreadsheet.xls to the root directory of my dashboard.
Second: I replaced the /jobs/sample.rb code with:
#!/usr/bin/env ruby
require 'roo'
SCHEDULER.every '2s' do
file_path = "#{Dir.pwd}/spreadsheet.xls"
def fetch_spreadsheet_data(path)
s = Roo::Excel.new(path)
send_event('valuation', { current: s.cell('B',49) })
end
module Handler
def file_modified
fetch_spreadsheet_data(path)
end
end
fetch_spreadsheet_data(file_path)
end
Third: Make sure the /widgets/number is in your dashboard "this is part of the sample install".
Fourth: Add the following code to your /dashboards/sample.erb file "this is part of the sample install as well".
<li data-row="1" data-col="1" data-sizex="1" data-sizey="1">
<div data-id="valuation" data-view="Number" data-title="Current Valuation" data-prefix="$"></div>
</li>
I used this source to help me better understand how Roo works. I tested my widget by changing my values and re-uploading the spreadsheet.xls to server and seen instant changes on my dashboard.
Hope this helps someone and I'm still looking for help to automate this process by scraping the data. Reference this if you can help.

Thanks for sharing this code sample. I did not manage to make it work in my environment (Raspberry/Raspbian) but after some efforts I managed to come up something that works -- at least for me ;)
I had never worked with Ruby before this week, so this code may be a bit crappy. Please accept apologizes.
-- Christophe
require 'roo'
require 'rubygems'
require 'rb-inotify'
# Implement INotify::Notifier.watch as described here:
# https://www.go4expert.com/articles/track-file-changes-ruby-inotify-t30264/
file_path = "#{Dir.pwd}/datasheet.csv"
def fetch_spreadsheet_data(path)
s = Roo::CSV.new(path)
send_event('csvdata', { value: s.cell(1, 1) })
end
SCHEDULER.every '5s' do
notifier = INotify::Notifier.new
notifier.watch(file_path, :modify) do |event|
event.flags.each do |flag|
## convert to string
flag = flag.to_s
puts case flag
when 'modify' then fetch_spreadsheet_data(file_path)
end
end
end
## loop, wait for events from inotify
notifier.process
end

Data driven testing with ruby testunit

I have a very basic problem for which I am not able to find any solution.
So I am using Watir Webdriver with testunit to test my web application.
I have a test method which I would want to run against multiple set of test-data.
While I can surely use old loop tricks to run it but that would show as only 1 test ran which is not what I want.
I know in testng we have #dataprovider, I am looking for something similar in testunit.
Any help!!
Here is what I have so far:
[1,2].each do |p|
define_method :"test_that_#{p}_is_logged_in" do
# code to log in
end
end
This works fine. But my problem is how and where would I create data against which I can loop in. I am reading my data from excel say I have a list of hash which I get from excel something like
[{:name =>abc,:password => test},{:name =>def,:password => test}]
Current Code status:
class RubyTest < Test::Unit::TestCase
def setup
#excel_array = util.get_excel_map //This gives me an array of hash from excel
end
#excel_array.each do |p|
define_method :"test_that_#{p}_is_logged_in" do
//Code to check login
end
end
I am struggling to run the loop. I get an error saying "undefined method `each' for nil:NilClass (NoMethodError)" on class declaration line

You are wanting to do something like this:
require 'minitest/autorun'
describe 'Login' do
5.times do |number|
it "must allow user#{number} to login" do
assert true # replace this assert with your real code and validation
end
end
end
Of course, I am mixing spec and test/unit assert wording here, but in any case, where the assert is, you would place the assertion/expectation.
As this code stands, it will pass 5 times, and if you were to report in story form, it would be change by the user number for the appropriate test.
Where to get the data from, that is the part of the code that is missing, where you have yet to try and get errors.

Using SitePrism with Rspec and Capybara feature specs

I recently discovered SitePrism via the rubyweekly email.
It looks amazing. I can see its going to be the future.
The examples I have seen are mostly for cucumber steps.
I am trying to figure out how one would go about using SitePrism with rspec.
Assuming #home_page for the home page, and #login_page for the login_page
I can understand that
#home_page.load # => visit #home.expanded_url
however, the part I am not sure about, is if I think click on for example the "login" link, and the browser in Capybara goes to the login page - how I can then access an instance of the login page, without loading it.
#home_page = HomePage.new
#home_page.load
#home.login_link.click
# Here I know the login page should be loaded, so I can perhaps do
#login_page = LoginPage.new
#login_page.should be_displayed
#login_page.email_field.set("some#email.com")
#login_page.password_field.set("password")
#login_page.submit_button.click
etc...
That seems like it might work. So, when you know you are supposed to be on a specific page, you create an instance of that page, and somehow the capybara "page" context, as in page.find("a[href='/sessions/new']") is transferred to the last SitePrism object?
I just feel like I am missing something here.
I'll play around and see what I can figure out - just figured I might be missing something.
I am looking through the source, but if anyone has figured this out... feel free to share :)

What you've assumed turns out to be exactly how SitePrism works :) Though you may want to check the epilogue of the readme that explains how to save yourself from having to instantiate page objects all over your test code. Here's an example:
# our pages
class Home < SitePrism::Page
#...
end
class SearchResults < SitePrism::Page
#...
end
# here's the app class that represents our entire site:
class App
def home
Home.new
end
def results_page
SearchResults.new
end
end
# and here's how to use it:
#first line of the test...
#app = App.new
#app.home.load
#app.home.search_field.set "sausages"
#app.home.search_button.click
#app.results_page.should be_displayed

Scrubyt gives 404 Error when clicking link using _details method

This might be a similar problem to my earlier two questions - see here and here but I'm trying to use the _detail command to automatically click the link so I can scrape the details page for each individual event.
The code I'm using is:
require 'rubygems'
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'
event do
title 'The Coast of Mayo'
link_url
event_detail do
dates "1-4 October"
times "7:30pm"
end
end
next_page "Next Page", :limit => 20
end
nuffield_data.to_xml.write($stdout,1)
Is there any way to print out the URL that using the event_detail is trying to access? The error doesn't seem to give me the URL that gave the 404.
Update: I think the link may be a relative link - could this be causing problems? Any ideas how to deal with that?

I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url
event do
title 'The Coast of Mayo'
link_url
event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
dates "1-4 October"
times "7:30pm"
end
end

sudo gem install ruby-debug
This will give you access to a nice ruby debugger, start the debugger by altering your script:
require 'rubygems'
require 'ruby-debug'
Debugger.start
Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'
event do
title 'The Coast of Mayo'
link_url
event_detail do
dates "1-4 October"
times "7:30pm"
end
end
next_page "Next Page", :limit => 2
end
nuffield_data.to_xml.write($stdout,1)
Then find out where scrubyt is throwing an exception - in this case:
/Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'
Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:
if ##current_doc_protocol == 'file'
##hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(##current_doc_url).read))
else
##hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(##mechanize_doc.body))
store_host_name(self.get_current_doc_url) # in case we're on a new host
end
rescue
debugger
self # the self is here because debugger doesn't like being at the end of a method
end
Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:
##current_doc_url
You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.
This is basically how I figured out the answer to your previous questions.
Good luck.

Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.

I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

anemone Ruby with focus_crawl - ruby

So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.

Related

poltergeist doesn't seem to wait for phantomjs to load in capybara

Use with Excel data to display on Dashing Dashboard?

Data driven testing with ruby testunit

Using SitePrism with Rspec and Capybara feature specs

Scrubyt gives 404 Error when clicking link using _details method

Categories

Resources