Use of Mechanize - ruby

I want to get response from websites that take a simple input, which is also reflected in the parameter of the url. Is it better to simply get the result by using conventional methods, for example OpenURI.open_uri(...) with some parameter set, or it is better to use mechanize, extract the form, and get the result through submit?
The mechanize page gives an example of extracting a form and submitting it to get the search result from Google search. However, this much can be done simply as OpenURI.open_uri("http://www.google.com/search?q=...").read. Is there any reason I should try to use one way or the other?

There are lots of sites where it turns out to be easiest to use mechanize. If you need to log in, and set a cookie before accessing the data, then mechanize is a simple way of doing this. Similarly, if there are lots of hidden fields that need to be matched (such as CSRF token), then fetching the page using mechanize then submitting it with the data filled out is often a more foolproof method that crafting the URL yourself.
If it is a simple URI, like google's search pages, then manually constructing it may be simpler.

Related

Ajax results filtering and URL parameters

I am building a results filtering page using AJAX requests. I would like to reflect the filters in the URL. For example: for price_from I want to add ?price_from=VAL to the URL.
I have a backend that is capable of rendering the page with URL parameters.
After some googling I would a Backbone.router solution which has a hash fallback for the IE that does not support HTML5 history API.
I have a problem with setting a good philosophy of routes. I have a set of filtering parameters (price_from, price_to, color, ...) and I would like to attach each parameter to one route.
Is that possible to chain the routes to match for example: ?price_from=0&price_to=1&color=red? (the item order can change)
It means: call all the routes at the same time and keep the ie backwards compatibility?
Your best bet would be to have a query portion of the URL rather than using GET parameters to denote the search criteria. For example:
Push state: /search/query/price_from=0&price_to=1&color=red
Hash based: #search/query/price_from=0&price_to=1&color=red
Your backend would of course need to change a bit to be able to parse the new URL structure.

How do I search then parse results on a webpage with Ruby?

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.
Use Mechanize - a library used for automating interaction with websites.
Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

xss protection and html purifier

I am currently using the CodeIgniter framework, and looking to strengthen the XSS protection by using HTMLPurifier (http://htmlpurifier.org/).
Is my understanding correct that you want to 'clean' data on post, so that its purified before its inserted into the Database? Or do I run it before displaying in the view?
If so, do I want to run HTMLPurifier on every single post that takes place? Since the app contains a lot of forms, I'd hate to have to selectively choose what gets cleaned and what doesnt - assuming that I can intercept all posts, is this the way to go? Of course, I validate some fields anyway (like email addresses, numeric numbers, etc)
Use $this->input->post() to get $_POST data. Codeigniter filters it automatically if global xss filter is set to true.
See the docs: http://codeigniter.com/user_guide/libraries/input.html
Edit: to clarify
Yes you should filter before inserting into the DB and yes you should filter all user input.
A quick google search, http://www.google.com/search?q=codeigniter+htmlpurifier, led to this page: http://codeigniter.com/wiki/htmlpurifier which is a helper for htmlpurifier. Regarding catching all $_POST data: you have to do something with the data, right? In your models, when you're doing that something, just make purify() part of that process:
$postdata = purify($_POST);

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

GET vs POST in Ajax

What is the difference between GET and POST for Ajax requests?
I don't see any difference between those two, except that when I use GET, the parameters are send in URL, which for me don't really make any difference, since all requests are made on background and user doesn't find any difference.
edit:
What are PUT and DELETE methods used for?
GET is designed for getting data from the server. POST (and lesser-known friends PUT and DELETE) are designed for modifying data on the server.
A GET request should never cause data to be removed from an application. If you have a link you can click on with a GET to remove data, then Google spidering your site could click on all your "Delete" links.
The canonical answer can be found here, which quotes the HTML 2.0 spec:
If the processing of a form is idempotent (i.e. it has no lasting
observable effect on the state of the
world), then the form method should be
GET. Many database searches have no
visible side-effects and make ideal
applications of query forms.
If the service associated with the processing of a form has side effects
(for example, modification of a
database or subscription to a
service), the method should be POST.
In your AJAX call, you need to use whatever method your server supports. You should always design your server so that operations that modify data are called by POST/PUT/DELETE. Other comments have links to REST, which generally maps C/R/U/D to "POST or PUT"(Create)/GET(Read)/PUT(Update)/DELETE(Delete).
If you're sending large amounts of data, or sensitive data over HTTPS, you will want to use POST. If it's just a simple parameter, I would use GET.
GET requests have a limit to the amount of data that can be sent. I forget the exact number, but this can cause issues if you're sending anything substantial.
Basically the difference between GET and POST is that in a GET request, the parameters are passed in the URL where as in a POST, the parameters are included in the message body.
Whether its AJAX or not is irrelevant. Its about the action that you're taking. I'd recommend following the principles of REST. Which have further provisions for updating, deleting, etc...
GET requests are easier to exploit in CSRF (cross site request forgery) attacks. Namely fake POST requests require Javascript to be enabled on the user side, while fake GET requests are still possible just with img, script tags.
Many web servers limit the length of the data that can be passed as part of the URL, so the GET request may break in odd ways that are hard to debug.
Also, most server software logs URLs in the access logs, so if you pass sensitive information (such as passwords) in a GET request, this will in all likelihood be written to disk in plaintext.
From a REST perspective, GET requests should have no side-effects -- they shouldn't modify data. So, if you're just GETting a resource by ID, this makes sense, but if you're committing changes to a resource, you should be using PUT, POST, or UPDATE for the http verb.
Both are used to send some data and receive some response using that data.
GET: Get information store in server. Ie. Search, tweet, Person Information. If you want to send information then get request send request using process.php?name=subroto
So it basically send information through url. Url cannot handle more than 2083 char. So for blog post can you remember it is not possible?
POST: Post do same thing as get. User registration, User login, Big data send, Blog Post.
If you need to send secure information then use post or for big data as it not go through url.
AJAX: $.get() and $.post() contain features that are subsets of $.ajax(). It has much configuration.
$.get () method, which is a kind of shorthand for $.Ajax (). When using $.get (), instead of passing in an object, you pass in arguments. At minimum, you’ll need the first two arguments, which are the URL of the file you want to retrieve (i.e. ‘test.txt’) and a success callback.
Summary:
$.get( url [, data ] [, success ] [, dataType ] )
$.post( url [, data ] [, success ] [, dataType ] ) // for sending secure or Large information
$.ajax( url [, settings ] ) // More Configaration
First, general information. Use GET if you only read data, use POST if you change something on database, txt files etc.
But the problem is, some browsers cache GET results. I had problems with AJAX requests in IE7, but at last I found out that browser caches GET results. I rethought the flow and changes my request to POST.
So, don't use GET if you don't want caching.
(Of course you can disable caching in GET operations. But I didn't prefer it)
About me, i prefer POST. I reserve get to the events i know the sent value is limited to data i have the "control", for example, to retreive an item with an id. Example, "getitem?id=123", "deleteImtem?id=123", ... For the other cases, when i have a form fillable by a user, i prefer POST.
Like Ryan Smith have said, it's better to use POST to send a large amount of data, and less wories in cases of the use in others language/special chars (generally all majors javascript framework should'nt have any problems to deal with that but i think is less wories to use POST).
For the REST perspective, in my opinion, you can use this with a new project (to keep a consistency with the entire project).
Finally, maybee some programs used in a network (URL loguers (ie.: to see if the employees lost their time on non-autorised sites, ...) proxys, ... ) or any other kind of tool can intercept the query. Somes will show in the reports the params you have sent with GET, considering it like a different web page. But in this situation, is could be not your problem it's changes from a project to an other! ;)
The difference is the same between GET and POST whether you're using Ajax, HTML forms, or curl. Here are the relevant definitions:
GET
POST
If you are passing on any arguments with characters that can get messed up in the URL (such as spaces), you use POST. Otherwise you can use GET.
Generally, if you're just passing on a few tiny arguments you would use GET. But for passing on user submitted information such as blog entries, text, etc, its a good practice to use POST.
There are also certain frameworks that rely completely on segment based urls (such as site.com/products/133 rather than site.com/products.php?id=333 and these frameworks unset the GET variables for security. In such cases you would use POST allt the time.

Resources