Mechanize scraping google urls - ruby

I have a program that searches google using either a key word or keywords that are taken as a parameter while running the program:
example: pull_sites.rb "testing"
returns these sites >>>
https://en.wikipedia.org/wiki/Software_testing
http://en.wikipedia.org/wiki/Test_automation
http://www.istqb.org/about-istqb.html
http://softwaretestingfundamentals.com/test-plan/
https://en.wikipedia.org/wiki/Software_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:9qU2GDLzZzEJ:https://en.wikipedia.org/wiki/Software_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Test_strategy
https://en.wikipedia.org/wiki/Category:Software_testing
https://en.wikipedia.org/wiki/Test_automation
https://en.wikipedia.org/wiki/Portal:Software_testing
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ:https://en.wikipedia.org/wiki/Test%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Unit_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:G9V8uRLkPjIJ:https://en.wikipedia.org/wiki/Unit_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://testing.byu.edu/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:d9bGrCHr9fsJ:https://testing.byu.edu/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J:https://www.test.com/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://ddce.utexas.edu/disability/using-testing-accommodations/
http://blogs.vmware.com/virtualblocks/2015/07/06/vsan-vs-nutanix-head-to-head-performance-testing-part-4-exchange/
http://www.networkforgood.com/nonprofitblog/testing-101-4-steps-optimizing-your-fundraising-approach/
http://www.auslea.com/software-testing-training.html
http://academy.littletonpublicschools.net/Default.aspx%3Ftabid%3D12807%26articleType%3DArticleView%26articleId%3D2400
https://golang.org/pkg/testing/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:EALG7Jlm9eoJ:https://golang.org/pkg/testing/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J:http://www.speedtest.net/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydoJ:https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:pAzlNJl3YY4J:http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk
It works as expected but only scrapes the first page of google, is it possible to search say page 1-5?
Here's the source of the scrape:
def get_urls
puts "Searching...".green
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}" #SEARCH is the parameter given when program is run
page = agent.submit(google_form, google_form.buttons.first)
page.links.each do |link|
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
url=strList[1]
File.open("links.txt", "a+"){ |s| s.puts(url) }
end
end
end

Ok if you are using google chrome or firefox, open up the developer tools. This will help you to identify the links you want to automate clicking. When you do a google search and then scroll to the bottom you will see the page links to click on. Using the developer tools in your browser you need to identify what class or id google is assigning these page number links. Then using mechanizes click method to follow these links. For example if the link is labelled "next" you can use something simple like:
page2 = page1.link_with(:text => "next").click
I'm answering from my phone so it may save you time to google "click a link" with mechanize for more details on it.

That's a GET form so much easier just to make the request yourself:
https://www.google.com/search?q=foo
https://www.google.com/search?q=foo&start=10
https://www.google.com/search?q=foo&start=20

Related

How to scrape website with search button

I have this website: codigos if You look at it, it has a selection field at left, and a go button at right, I need to scrape some of the items on left.
But, how can I tell to mechanize in ruby how to access that selection field and then make the search and scrape it?
I've seen examples with login forms but I don't know if it can really suit this case though.
The <select> tag is contained within a <form> tag, so you need to locate the form and then you can set the option by passing the name of the select list and specifying the appropriate option:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://comext.aduana.cl:7001/codigos/')
form = page.forms.first
form["wlw-select_key:{actionForm.opcion}"] = "Aduana"
result_page = form.submit
result_page.uri #=> http://comext.aduana.cl:7001/codigos/buscar.do;jsessionid=2hGwYfzD76WKGfFbXJvmS2yq4K19VnZycJfH8hJMTzRFhln4pTy2!1794372623!-1405983655!8080!-1

Ruby Watir -- Trying to loop through links in cnn.com and click each one of them

I have created this method to loop through the links in a certain div in the web site. My porpose of the method Is to collect the links insert them in an array then click each one of them.
require 'watir-webdriver'
require 'watir-webdriver/wait'
site = Watir::Browser.new :chrome
url = "http://www.cnn.com/"
site.goto url
box = Array.new
container = site.div(class: "column zn__column--idx-1")
wanted_links = container.links
box << wanted_links
wanted_links.each do |link|
link.click
site.goto url
site.div(id: "nav__plain-header").wait_until_present
end
site.close
So far it seems like I am only able to click on the first link then I get an error message stating this:
unable to locate element, using {:element=>#<Selenium::WebDriver::Element:0x634e0a5400fdfade id="0.06177683611003881-3">} (Watir::Exception::UnknownObjectException)
I am very new to ruby. I appreciate any help. Thank you.
The problem is that once you navigate to another page, all of the element references (ie those in wanted_links) become stale. Even if you return to the same page, Watir/Selenium does not know it is the same page and does not know where the stored elements are.
If you are going to navigate away, you need to collect all of the data you need first. In this case, you just need the href values.
# Collect the href of each link
wanted_links = container.links.map(&:href)
# You have each page URL, so you can navigate directly without returning to the homepage
wanted_links.each do |link|
site.goto url
end
In the event that the links do not directly navigate to a page (eg they execute JavaScript when clicked), you will need to collect enough data to re-locate the elements later. What you use as the locator will depend on what is known to be static/unique. As an example, I will assume that the link text is a good locator.
# Collect the text of each link
wanted_links = container.links.map(&:text)
# Iterate through the links
wanted_links.each do |link_text|
container = site.div(class: "column zn__column--idx-1")
container.link(text: link_text).click
site.back
end

Ruby Mechanize: Programmatically Clicking a Link Without Knowing the Name of the Link

I am writing a ruby script to search the web. Here is the code:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://www.example.com/)
example_page = page.link_with(:text => 'example').click
puts example_page.body
The code above works alright. The text 'example' ((:text => 'example') has to be a link on the page for the code to work correctly. The problem, however, is that when I do a web search (bing, yahoo, google, etc), hundreds of links show up. How can I programmatically click a link without knowing the exact name of the link? I want to be able to click a link if the name of the link partly (or fully) matches a text that I specify or click a link if it has a certain url. Any help would be appreciated.
Mechanize has regular expressions:
page.link_with(text: /foo/).click
page.link_with(href: /foo/).click
Here are the Mechanize criteria that generally work for links and forms:
name: name_matcher
id: id_matcher
class: class_matcher
search: search_expression
xpath: xpath_expression
css: css_expression
action: action_matcher
...
If you're curious, here's the Mechanize ElementMatcher code

Clicking only on links within a specific menu?

I am trying to scrape a page, and need to click on some links within a menu. If I use the search method, I am then stuck with a Nokogiri object, and therefore can not use the click method.
agent.page.search('.right-menu').links_with(href: /^\/blabla\//).each do |link|
region = link.click
end
The following would tell me that links_with is not defined. How can I make a select links from a specific menu? Is there a way I can parse the object back to a Mechanize object?
You can try something like this:
agent.page.search('.youarehere > a').each do |a|
link = Mechanize::Page::Link.new(a.attr('href'), agent, agent.page)
region = link.click
end
not the cleanest way to do it I guess, but Mechanize is doing almost the same in its source code: http://mechanize.rubyforge.org/Mechanize/Page.html#method-i-links
Would be a nice addition though, instead of going through this.

stumped on clicking a link with nokogiri and mechanize

perhaps im doing it wrong, or there's another more efficient way. Here is my problem:
I first, using nokogiri open an html document and use its css to traverse the document until i find the link which i need to click.
Now once i have the link, how do i use mechanize to click it? According to the documentation, the object returned by Mechanize.new either the string or a Mechanize::Page::Link object.
I cannot use string - since there could be 100's of the same link - i only want mechanize to click the link that was traversed by nokogiri.
Any idea?
After you have found the link node you need, you can create the Mechanize::Page::Link object manually, and click it afterwards:
agent = Mechanize.new
page = agent.get "http://google.com"
node = page.search ".//p[#class='posted']"
Mechanize::Page::Link.new(node, agent, page).click
Easier way than #binarycode option:
agent = Mechanize.new
page = agent.get "http://google.com"
page.link_with(:class => 'posted').click
That is simple, you don't need to use mechanize link_with().click
You can just getthe link and update your page variable
Mechanize saves current working site internally, so it is smart enough to follow local links
Ex.:
agent = Mechanize.new
page = agent.get "http://somesite.com"
next_page_link = page.search('your exotic selectors here').first rescue nil #nokogyri object
next_page_href = next_page_link['href'] rescue nil # '/local/link/file.html'
page = agent.get(next_page_href) if next_page_href # goes to 'http://somesite.com/local/link/file.html'

Resources