I use Mechanize to loop through a table, which is paginated.
I have a problem with a form that holds multiple submit inputs. The input tags are used as pagination and they are generated dynamically. When I loop through the pages I need to scrape, I need to be able to pick the right input, since only one of them will take me to the “next page”. The right tag can be identified by different attributes such as name, class, value etc. My problem is though, that I can’t find out how to tell mechanize which one to use.
I tried this:
require 'mechanize'
require 'yaml'
url = "http://www.somewhere.com"
agent = Mechanize.new
page = agent.get(url)
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(page.form_with(:action => /.*/).submits[3])
else
break
end
end
From this post, http://rubyforge.org/pipermail/mechanize-users/2008-November/000314.html, but as told the number of tags are changing so just picking a hardcoded number of the submits is not too good an idea.
What I would like to know is if there is a way like this:
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(:name => /the_right_submit_button/)
else
break
end
end
or something like that, maybe with a css or xpath selector.
I usually use form.button_with to select the right button to click:
form = results_page.forms[0]
results_page = form.submit(form.button_with(:name=>'ctl00$ContentBody$ResultsPager$NextButton'))
Related
I am struggling with mechanize. I wish to "click" on a set of links which can only be identified by their position (all links within div#content) or their href.
I have tried both of these identification methods above without success.
From the documentation, I could not figure out how return a collection of links (for clicking) based on their position in the DOM, and not by attributes directly on the link.
Secondly, the documentation suggested you can you use :href to match a partial href,
page = agent.get('http://foo.com/').links_with(:href => "/something")
but the only way I can get it to return a link is by passing a fully qualified URL, e.g
page = agent.get('http://foo.com/').links_with(:href => "http://foo.com/something/a")
This is not very usefull if i want to return a collection of links with href's
http://foo.com/something/a
http://foo.com/something/b
http://foo.com/something/c
etc...
Am I doing something wrong? do I have unrealistic expectations?
Part II
The value you pass to :href has to be an exact match by default. So the href in your example would only match and not
What you want to do is to pass in a regex so that it will match a substring within the href field. Like so:
page = agent.get('http://foo.com/').links_with(:href => %r{/something/})
edit:
Part I
In order to get it to select links only in a link, add a nokogiri-style search method into your string. Like this:
page = agent.get('http://foo.com/').search("div#content").links_with(:href => %r{/something/}) # **
Ok, that doesn't work because after you do page = agent.get('http://foo.com/').search("div#content") you get a Nokogiri object back instead of a mechanize one, so links_with won't work. However you will be able to extract the links from the Nokogiri object using the css method. I would suggest something like:
page = agent.get('http://foo.com/').search("div#content").css("a")
If that doesn't work, I'd suggest checking out http://nokogiri.org/tutorials
The nth link:
page.links[n-1]
The first 5 links:
page.links[0..4]
links with 'something' in the href:
page.links_with :href => /something/
You can get mechanize links using nokogiri nodes. See the source code of links() method.
# File lib/mechanize/page.rb, line 352
def links
#links ||= %w{ a area }.map do |tag|
search(tag).map do |node|
Link.new(node, #mech, self)
end
end.flatten
end
So that means:
the_links= page.search("valid_selector").map do |node|
Mechanize::Page::Link.new(node, agent, page)
end
This will give you the useful href, text and uri methods.
I'm trying to automate the process of searching for alternative telephone numbers using SayNoTo0870 . Every time one searches for an alternate number or name it brings up the '/companysearch.php' page.
Clearly this page has no reference, and in my mind you can't just link to this page.
What I'm hoping to do is use the code below, to automate the opening of a browser, searching of a name/number, stripping out the HTML and then providing the top 5 results. I've got the automation part down, but clearly when trying to save the webpage using Hpricot it only brings up the 'Sorry nothing can be found page' because I can't link directly to the search result page.
Here is my code thus far:
(I've removed comments to shorten it)
require 'rubygems'
require 'watir'
require 'hpricot'
require 'open-uri'
class OH870
def searchName(name)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'search_name').set name
browser.button(:name => 'submit').click
end
def searchNumber(number)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'number').set number
browser.button(:name => 'submit').click
end
def loadNew(website)
doc = Hpricot(open(website))
puts(doc)
end
def strip_tags
stripped = website.gsub( %r{</?[^>]+?>}, '' )
puts stripped
end
end # class
class Main < OH870
puts "What is the name of the place you want?"
website = 'http://www.saynoto0870.com/companysearch.php'
question = gets.chomp
whichNumber = OH870.new
whichNumber.searchName(question)
#result = OH870.new
#withoutTags = website.strip_tags
#result.loadNew(withoutTags)
end
Now I'm not sure whether there's a way of "asking watir to follow through to the companysearch.php page and dump the results without having to pass this page as a variable.
I wonder if anyone has any suggestions here?
With WATIR, minus the extraneous libraries, here's all it takes to accomplish what you've described (using the 'name' test case only). I've pulled it out of the function format since you already know how to do that, and this will be a clearer test case path.
require 'watir'
#browser = Watir::Browser.new :firefox #open a browser called #browser
#browser.goto "http://(your search page here)" #go to the search page
#browser.text_field(:name => 'name').value = "Awesome" #fill in the 'name' field
#browser.button(:name => 'submit').click #submit the form
If all goes well, we should now be looking at the search results. WATIR already knows it's on a new page - we don't have to specify a URL. In the case that the results are in a frame, we do need to access that frame before we can view its content. Let's pretend they're in a DIV element with an ID of "search_results":
results = #browser.div(:id => "search_results").text
resultsFrame = #browser.frame(:index => 1) #in the case of a frame
results = resultsFrame.div(id => "search_results).text
As you can see, you do not need to save the entire page to parse the results. They could be in table cells, they could be in a different div per line, or a new frame. All are easily accessible with WATIR to be stored in a variable, array, or immediately written to the console or log file.
#results = Array.new #create an Array to store our results
#browser.divs.each do |div| #for each div element on the page
if div.id == "search_results" #if the div ID equals "search_results"
#results << div.text #add it to our array named #results
end
end
Now, if you just wanted the top 5 there are many ways to access them.
#results[0] #first element
#results[0..4] #first 5 elements
I'd also suggest you look into a few programming principles like DRY (Don't Repeat Yourself). In your function definitions where you see that they share code, like opening the browser and visiting the same URL - you can consolidate those:
def search(how, what)
#browser = Watir::Browser.new :firefox
#browser.goto "(that search url again)"
#browser.text_field(:name => how).value = what
etc...
end
search("name", "Hilton")
search("number", "555555")
Since we know that the two available text_field names are "name" and "number", and those make good logical sense as a 'how', we can parameterize them and use a single function for both the Search by Name and Search by Number test cases. This is more efficient, as long as the test cases remain similar enough to be shared.
In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end
I know is a very simple question but I've been stuck for an hour and I just can't understand how this works.
I need to scrape some stuff from my school's library so I need to insert 'CE' to a text field and then click on a link with text 'Clasificación'. The output is what I am going to use to work. So here is my code.
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
url = 'http://biblio02.eld.edu.mx/janium-bin/busqueda_rapida.pl?Id=20110720161008#'
searchStr = 'CE'
agent = Mechanize.new
page = agent.get(url)
searchForm = page.form_with(:method => 'post')
searchForm['buscar'] = searchStr
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm,clasificacionLink)
When I run it, it gives me this error
janium.rb:31: undefined method `[]=' for nil:NilClass (NoMethodError)
Thanks!
I think your problem is actually on line 13, not 31, and I'll even tell why I think that. Not only does your script not have 31 lines but, from the fine manual:
form_with(criteria)
Find a single form matching criteria.
There are several forms on that page that have method="post". Apparently Mechanize returns nil when it can't exactly match the form_with criteria including the single part mentioned in the documentation; so, if your criteria matches more than one thing, form_with returns nil instead of choosing one of the options and you end up trying to do this:
nil['buscar'] = searchStr
But nil doesn't have a []= method so you get your NoMethodError.
If you use this:
searchForm = page.form_with(:name => 'forma')
you'll get past the first part as there is exactly one form with name="forma" on that page. Then you'll have trouble with this:
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm, clasificacionLink)
as Mechanize doesn't know what to do with JavaScript (at least mine doesn't). But if you use just this:
page = agent.submit(searchForm)
you'll get a page and then you can continue building and debugging your script.
mu's answer sounds reasonable. I am not sure if this is strictly necessary, but you might also try to put braces around searchStr.
searchForm['buscar'] = [searchStr]
I'm trying to use Scrubyt to get the details from this page http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events. I've managed to get the titles and detail URLs from the list, but I can't use next_page to get the scraper to go to the next page. I assume that's cause I'm not using the correct pattern for the next page link. I tried the string "Next Page", and I've also tried the XPath. Any other ideas?
The code is below:
require 'rubygems'
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events'
event do
title 'The Coast of Mayo'
#url "href", :type => :attribute
link_url
end
next_page "Next Page", :limit => 2
end
nuffield_data.to_xml.write($stdout,1)
Try this with a slightly different URL:
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'
scrubyt seems to be having issues with "?section=events" query on the end of the URL.
When it looks for the next page it is trying to return this URL:
http://www.nuffieldtheatre.co.uk/cn/events/?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
instead of:
http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
Removing the query string on the end of the URL seems to fix this - you might want to file this as a bug.