How do I search then parse results on a webpage with Ruby? - ruby

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.

Use Mechanize - a library used for automating interaction with websites.

Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

Related

Difficulties using UNIX cURL to scrape Ajax Wicket Information

I am instructed to use write UNIX shell scripts that scrape certain websites. We use fiddler to trace the HTTP requests, then we write the cURLs accordingly. For the most part, scraping most websites seem to be fairly simple, however I've ran into a situation where I'm having difficulties capturing certain information.
I need to be somewhat generic in saying that I cannot provide the website address that I am actually looking at, however I can post some of the requests and responses to provide context.
Here's the situation:
The website starts with a search screen. You enter your search query and the website returns a list of results.
I need to choose the first result from the result page.
I need to capture EVERYTHING on the page from the first result.
Everything up until this point is working fine
Here's the problem:
The page returned has hyperlinks that are wickets. When these links are pressed, a window pops up within the page - it is not actually a window like a pop up created by javascript, it is more comparable to what you see when you 'compose a message' or 'poke' someone on Facebook ( am I the only one who still does that? ).
I need to capture the contents of that pop up window. There are usually multiple wicket links on a given page. Handling that should be easy enough with a loop, but I need to figure out the proper way to cURL those wickets first.
Here is the cURL i'm currently using to attempt to scrape the wickets.
(I'm explicitly defining the referrer URL, Accept, and Wicket-Ajax boolean as these were the items that were sent in the header when I traced the site). Link is the URL which looks like this:
http://www.someDomainName.com/searches/?x=as56f1sa65df1&random=0.121345151
( the random I believe is populated with some javascript, not sure if that's needed or even possible to recreate. I'm currently sending one of the randoms that I received on one particular occasion. ).
/bin/curl -v3 -b COOKIE -c COOKIE -H "Accept: text/xml" -H "Referer: $URL$x" -H "Wicket-Ajax: true" -sLf "$link"
Here is the response I get:
<ajax-response><redirect><![CDATA[home.page;jsessionid=6F45DF769D527B98DD1C7FFF3A0DF089]]></redirect>
</ajax-response>
I am expecting an XML document with actual content to be returned. Any insight into this issue would be greatly appreciated. Please let me know if you need more information.
Thanks,
Paul

Creating dynamic web forms on the client side

Is there an existing library that would do this?
I want to be able to have code on the client side where the user chooses something, it makes a call to the server, and the server sends back "for this option, you need a have a text field called foo and a select field called bar with the following options, this one is selected, etc", and then the client side builds the next part of the form from that information. Or if they choose a different option, a different set of fields and values is returned from the server and populated on the screen. Also it might cascade so after the first selection we need a select field with some options, and then depending what they select on that select field the next field might be another select field or it might be a text input field.
Has anybody done anything like that? Is my best choice to have the AJAX call return some html that I just stuff into a div, or can I do it field by field and value by value?
If it matters, the back end is going to be written in Perl/MASON, and the front end will be using Javascript/JQuery/JQuery-UI.
I would use jquery and submit AJAX calls to whatever backend system you choose. Have this backend system compute the necessary changes and return the info as JSON. Let JQuery parse it for you and append the necessary form elements. However, it seems like under alot of use cases these decisions could be made on the client side without even talking to the server just as we pre validate form input before allowing posting to the server. I don't, however, have your requirements in front of me so I am sure there is a reason you want to get the info back from the server.
P.S. please do not return pure html from the back end to the client....ever.

Use of Mechanize

I want to get response from websites that take a simple input, which is also reflected in the parameter of the url. Is it better to simply get the result by using conventional methods, for example OpenURI.open_uri(...) with some parameter set, or it is better to use mechanize, extract the form, and get the result through submit?
The mechanize page gives an example of extracting a form and submitting it to get the search result from Google search. However, this much can be done simply as OpenURI.open_uri("http://www.google.com/search?q=...").read. Is there any reason I should try to use one way or the other?
There are lots of sites where it turns out to be easiest to use mechanize. If you need to log in, and set a cookie before accessing the data, then mechanize is a simple way of doing this. Similarly, if there are lots of hidden fields that need to be matched (such as CSRF token), then fetching the page using mechanize then submitting it with the data filled out is often a more foolproof method that crafting the URL yourself.
If it is a simple URI, like google's search pages, then manually constructing it may be simpler.

Redirect from current page to a new page

I am having trouble with some Ruby CGI.
I have a home page (index.cgi) which is a mix of HTML and Ruby, and has a login form in it.
On clicking on the Submit button the POST's action is the same page (index.cgi), at which point I check to make sure the user has entered data into the correct fields.
I have a counter which increases by 1 each time a field is left empty. If this counter is 0 I want to change the current loaded page to something like contents.html.
With this I have:
if ( errorCount > 0 )
do nothing
else
....
end
What do I need to put where I have the ....?
Unfortunately I cannot use any frameworks as this is for University coursework, so have to use base Ruby.
As for using the CGI#header method as you have suggested, I have tried using this however it is not working for me.
As mentioned my page is index.cgi. This is made of a mixture of Ruby and HTML using "here doc" statements.
At the top of my code page I have my shebang line, following by a HTML header statement.
I then do the CGI form validation part, and within this I have tried doing something like: print this.cgi( { 'Status' => '302 Moved', 'location' =>
'{http://localhost:10000/contents.html' } )
All that happens is that this line is printed at the top of the browser window, above my index.cgi page.
I hope this makes sense.
To redirect the browser to another URL you must output an 30X HTTP response that contains the Location: /foo/bar header. You can do that using the CGI#header method.
Instead of dealing with these details that you do not yet master, I suggest you use a simple framework as Sinatra or, at least, write your script as a Rack-compatible application.
If you really need to use the bare CGI class, have a look at this simple example: https://github.com/tdtds/amazon-auth-proxy/blob/master/amazon-auth-proxy.cgi.

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Resources