Ruby - Mechanize: Select link by classname and other questions - ruby

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.

in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)

You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)

I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Related

Why is this ruby code returning a blank page instead of filling it up with user names?

I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

Ruby script with mechanize and nokogiri

So basically I asked a question earlier and got an answer which solved that question but I now realise I need more help as I have spent the last few hours trying to fix this but keep getting "nil:Nilclass: error. Basically I need to go through every show listed on this site (so through each letter, and through each page this letter has) and get the following:
1. The shows title (this part I have done)
2. then either copy the page url for each show and add "/episodes/" to the end of it, or click the show and then the episode tab and copy that url.
This is what I have so far:
require 'mechanize'
shows = Array.new
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
end
end
require 'pp'
pp shows
So an end result would like something like the following:
Title: Game of Thrones
URL: http://www.tv.com/shows/game-of-thrones/episodes/
I have tried everything (even writing it from scratch) but just can't seem to add the extra parts so I was hoping someone here my be able to help me do so. Thanks
How about this
require 'mechanize'
shows = {}
base_uri = "http://www.tv.com/"
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
letter = letter_link.text.upcase
shows[letter] = agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
shows[letter] << agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
end
shows[letter].flatten!
end
puts shows
This will create the following structure Hash[letter] => Array[{ShowName => LinktoEpisodes}]
Example:
{"A"=>[{"A Show"=>"http://www.tv.com/shows/a-show/episodes/"},{"Another Show"=>"http://www.tv.com/shows/another-show/episodes/"},...],"B"=>[B SHOWS....],....}
Hope this helps.

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Stripping out results from a website that doesn't have differing URLs

I'm trying to automate the process of searching for alternative telephone numbers using SayNoTo0870 . Every time one searches for an alternate number or name it brings up the '/companysearch.php' page.
Clearly this page has no reference, and in my mind you can't just link to this page.
What I'm hoping to do is use the code below, to automate the opening of a browser, searching of a name/number, stripping out the HTML and then providing the top 5 results. I've got the automation part down, but clearly when trying to save the webpage using Hpricot it only brings up the 'Sorry nothing can be found page' because I can't link directly to the search result page.
Here is my code thus far:
(I've removed comments to shorten it)
require 'rubygems'
require 'watir'
require 'hpricot'
require 'open-uri'
class OH870
def searchName(name)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'search_name').set name
browser.button(:name => 'submit').click
end
def searchNumber(number)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'number').set number
browser.button(:name => 'submit').click
end
def loadNew(website)
doc = Hpricot(open(website))
puts(doc)
end
def strip_tags
stripped = website.gsub( %r{</?[^>]+?>}, '' )
puts stripped
end
end # class
class Main < OH870
puts "What is the name of the place you want?"
website = 'http://www.saynoto0870.com/companysearch.php'
question = gets.chomp
whichNumber = OH870.new
whichNumber.searchName(question)
#result = OH870.new
#withoutTags = website.strip_tags
#result.loadNew(withoutTags)
end
Now I'm not sure whether there's a way of "asking watir to follow through to the companysearch.php page and dump the results without having to pass this page as a variable.
I wonder if anyone has any suggestions here?
With WATIR, minus the extraneous libraries, here's all it takes to accomplish what you've described (using the 'name' test case only). I've pulled it out of the function format since you already know how to do that, and this will be a clearer test case path.
require 'watir'
#browser = Watir::Browser.new :firefox #open a browser called #browser
#browser.goto "http://(your search page here)" #go to the search page
#browser.text_field(:name => 'name').value = "Awesome" #fill in the 'name' field
#browser.button(:name => 'submit').click #submit the form
If all goes well, we should now be looking at the search results. WATIR already knows it's on a new page - we don't have to specify a URL. In the case that the results are in a frame, we do need to access that frame before we can view its content. Let's pretend they're in a DIV element with an ID of "search_results":
results = #browser.div(:id => "search_results").text
resultsFrame = #browser.frame(:index => 1) #in the case of a frame
results = resultsFrame.div(id => "search_results).text
As you can see, you do not need to save the entire page to parse the results. They could be in table cells, they could be in a different div per line, or a new frame. All are easily accessible with WATIR to be stored in a variable, array, or immediately written to the console or log file.
#results = Array.new #create an Array to store our results
#browser.divs.each do |div| #for each div element on the page
if div.id == "search_results" #if the div ID equals "search_results"
#results << div.text #add it to our array named #results
end
end
Now, if you just wanted the top 5 there are many ways to access them.
#results[0] #first element
#results[0..4] #first 5 elements
I'd also suggest you look into a few programming principles like DRY (Don't Repeat Yourself). In your function definitions where you see that they share code, like opening the browser and visiting the same URL - you can consolidate those:
def search(how, what)
#browser = Watir::Browser.new :firefox
#browser.goto "(that search url again)"
#browser.text_field(:name => how).value = what
etc...
end
search("name", "Hilton")
search("number", "555555")
Since we know that the two available text_field names are "name" and "number", and those make good logical sense as a 'how', we can parameterize them and use a single function for both the Search by Name and Search by Number test cases. This is more efficient, as long as the test cases remain similar enough to be shared.

Resources