How to scrape website with search button - ruby

I have this website: codigos if You look at it, it has a selection field at left, and a go button at right, I need to scrape some of the items on left.
But, how can I tell to mechanize in ruby how to access that selection field and then make the search and scrape it?
I've seen examples with login forms but I don't know if it can really suit this case though.

The <select> tag is contained within a <form> tag, so you need to locate the form and then you can set the option by passing the name of the select list and specifying the appropriate option:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://comext.aduana.cl:7001/codigos/')
form = page.forms.first
form["wlw-select_key:{actionForm.opcion}"] = "Aduana"
result_page = form.submit
result_page.uri #=> http://comext.aduana.cl:7001/codigos/buscar.do;jsessionid=2hGwYfzD76WKGfFbXJvmS2yq4K19VnZycJfH8hJMTzRFhln4pTy2!1794372623!-1405983655!8080!-1

Related

Follow post form redirects using Ruby Mechanize

We're trying to follow post forms that initialize redirects before showing their content using ruby Mechanize/Nokogiri. One example would be the search form on
http://www.chewtonrose.co.uk/
... if you hit the "search" button on your browser, you get taken to
http://www.chewtonrose.co.uk/AdvancedSearch/tabid/4280/Default.aspx?view=tn
how could we set up Mechanize to return that second url?
is Mechanize even the right tool?
Yes, mechanize is good. I checked in this case you will need to submit WITH the button.
agent = Mechanize.new
page = agent.get(<url>)
form = #get form
button = #get button
page2 = agent.submit(form, button)
page2.uri # will show your 2nd url

How to use Mechanize on a page with no form?

I am trying to write a website crawler with Mechanize, and I found that my target website is written in a SPA fashion, and although there are a bunch of text fields and buttons, there is no form!
How can I use mechanize to fill text fields and click buttons outside forms?
I had the exact same problem you did. I ended up using 'capybara', 'launchy' and 'selenium-webdriver' to do what 'mechanize' would have in non-JavaScript env
Let's say agent is a Mechanize object and page is a Mechanize::Page.
You can do:
form = Mechanize::Form.new page.at('body'), agent
Now the form is initialized with all the fields and buttons on the page.
You will need to set the action and method yourself:
form.action = 'http://foo.com'
form.method = 'POST'
next_page = form.submit

Ruby Mechanize: Programmatically Clicking a Link Without Knowing the Name of the Link

I am writing a ruby script to search the web. Here is the code:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://www.example.com/)
example_page = page.link_with(:text => 'example').click
puts example_page.body
The code above works alright. The text 'example' ((:text => 'example') has to be a link on the page for the code to work correctly. The problem, however, is that when I do a web search (bing, yahoo, google, etc), hundreds of links show up. How can I programmatically click a link without knowing the exact name of the link? I want to be able to click a link if the name of the link partly (or fully) matches a text that I specify or click a link if it has a certain url. Any help would be appreciated.
Mechanize has regular expressions:
page.link_with(text: /foo/).click
page.link_with(href: /foo/).click
Here are the Mechanize criteria that generally work for links and forms:
name: name_matcher
id: id_matcher
class: class_matcher
search: search_expression
xpath: xpath_expression
css: css_expression
action: action_matcher
...
If you're curious, here's the Mechanize ElementMatcher code

Mechanize and invisible search form

I'm trying to perform search on some website using Mechanize but I can't submit a search form because mechanize does not see any forms. page.form returns nil and page = agent.get returns just {forms}> while I expect something like
<Mechanize::Form
{name "somename"}
{method "GET"}
{action "/search"}
Is it because the search form uses javascript? Is there any way to solve this? Or the only way is to give up on mechanize and use something else?
It means there's no form on that page. The workaround is to get the next page, the one that's pretending to be a form submit.
In other words when I type 'foo' into the search box and click the button I get redirected to:
http://s.weibo.com/weibo/foo&Refer=index
So just get that page and do something with it.

stumped on clicking a link with nokogiri and mechanize

perhaps im doing it wrong, or there's another more efficient way. Here is my problem:
I first, using nokogiri open an html document and use its css to traverse the document until i find the link which i need to click.
Now once i have the link, how do i use mechanize to click it? According to the documentation, the object returned by Mechanize.new either the string or a Mechanize::Page::Link object.
I cannot use string - since there could be 100's of the same link - i only want mechanize to click the link that was traversed by nokogiri.
Any idea?
After you have found the link node you need, you can create the Mechanize::Page::Link object manually, and click it afterwards:
agent = Mechanize.new
page = agent.get "http://google.com"
node = page.search ".//p[#class='posted']"
Mechanize::Page::Link.new(node, agent, page).click
Easier way than #binarycode option:
agent = Mechanize.new
page = agent.get "http://google.com"
page.link_with(:class => 'posted').click
That is simple, you don't need to use mechanize link_with().click
You can just getthe link and update your page variable
Mechanize saves current working site internally, so it is smart enough to follow local links
Ex.:
agent = Mechanize.new
page = agent.get "http://somesite.com"
next_page_link = page.search('your exotic selectors here').first rescue nil #nokogyri object
next_page_href = next_page_link['href'] rescue nil # '/local/link/file.html'
page = agent.get(next_page_href) if next_page_href # goes to 'http://somesite.com/local/link/file.html'

Resources