Scraping pages with asynchronous responses with Hpricot - ruby

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.

When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Related

Xpath syntax when scraping headlines from CNN homepage

I tried to scrape CNN homepage with scrapy.
I used the following xpath selectors, but all of them returned empty lists.
Current results : all of these returns []
"//strong"
"//h2"
"//span[#class='cd__headline-text']"
Expected results :
[Headline_1, Headline_2, Headline_3, ...]
Can someone help me figure out why?
Is CNN doing something to stop people from scraping headlines?
I use Scrapy.
In order to write XPath/CSS selector or any web page, first of all, check page source that whether the selectors which you are looking for exists or not. In the current case none of the above selectors are found in page source. They are getting page content in various requests, try checking the network and find appropriate requests for your case. You need to make those requests in your spider in order to scrape news from CNN.

Symfony2, doctrine2, session, best way to paginate search page

Let's imagine we have simple data and want to make pagination of it. It's not hard to do, simple _GET var with page number others doctrine with offset will allow us to do it in easy way, BUT How should it look like in search page? Let me explain.
For example we have simple route with /search url. Where we have form for our search. When use input string we user POST method on same page and will get result. Simple enough but if we add pagination here it become a problem with storing "inputed string".
If we store in session on search query it will be solution BUT... it's not. Why? User input search string - get result with pagination (here search string already in session) after that leave the page (or close browser, or left to another page). When he will return data from session will show him 'result of old query'...
So question is, what is the best practice for such situation? I want simple search query + pagination of it but if user left page - clear result.
Using POST instead of GET for search query is kinda unusual and not really safe. Since search query operations are read-only you should use GET to access/get the data. POST is used for updating or creating resources.
And how you will go back/forward in the pagination (using browser's buttons)? You always will be getting an alert box. AND you cannot share/bookmark the search query url.
BTW to answer your question, sessions and hidden input fields would be the way to go. You also can use a combination of get and post
When should I use GET or POST method? What's the difference between them?

Difficulties using UNIX cURL to scrape Ajax Wicket Information

I am instructed to use write UNIX shell scripts that scrape certain websites. We use fiddler to trace the HTTP requests, then we write the cURLs accordingly. For the most part, scraping most websites seem to be fairly simple, however I've ran into a situation where I'm having difficulties capturing certain information.
I need to be somewhat generic in saying that I cannot provide the website address that I am actually looking at, however I can post some of the requests and responses to provide context.
Here's the situation:
The website starts with a search screen. You enter your search query and the website returns a list of results.
I need to choose the first result from the result page.
I need to capture EVERYTHING on the page from the first result.
Everything up until this point is working fine
Here's the problem:
The page returned has hyperlinks that are wickets. When these links are pressed, a window pops up within the page - it is not actually a window like a pop up created by javascript, it is more comparable to what you see when you 'compose a message' or 'poke' someone on Facebook ( am I the only one who still does that? ).
I need to capture the contents of that pop up window. There are usually multiple wicket links on a given page. Handling that should be easy enough with a loop, but I need to figure out the proper way to cURL those wickets first.
Here is the cURL i'm currently using to attempt to scrape the wickets.
(I'm explicitly defining the referrer URL, Accept, and Wicket-Ajax boolean as these were the items that were sent in the header when I traced the site). Link is the URL which looks like this:
http://www.someDomainName.com/searches/?x=as56f1sa65df1&random=0.121345151
( the random I believe is populated with some javascript, not sure if that's needed or even possible to recreate. I'm currently sending one of the randoms that I received on one particular occasion. ).
/bin/curl -v3 -b COOKIE -c COOKIE -H "Accept: text/xml" -H "Referer: $URL$x" -H "Wicket-Ajax: true" -sLf "$link"
Here is the response I get:
<ajax-response><redirect><![CDATA[home.page;jsessionid=6F45DF769D527B98DD1C7FFF3A0DF089]]></redirect>
</ajax-response>
I am expecting an XML document with actual content to be returned. Any insight into this issue would be greatly appreciated. Please let me know if you need more information.
Thanks,
Paul

How do I search then parse results on a webpage with Ruby?

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.
Use Mechanize - a library used for automating interaction with websites.
Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

Use of Mechanize

I want to get response from websites that take a simple input, which is also reflected in the parameter of the url. Is it better to simply get the result by using conventional methods, for example OpenURI.open_uri(...) with some parameter set, or it is better to use mechanize, extract the form, and get the result through submit?
The mechanize page gives an example of extracting a form and submitting it to get the search result from Google search. However, this much can be done simply as OpenURI.open_uri("http://www.google.com/search?q=...").read. Is there any reason I should try to use one way or the other?
There are lots of sites where it turns out to be easiest to use mechanize. If you need to log in, and set a cookie before accessing the data, then mechanize is a simple way of doing this. Similarly, if there are lots of hidden fields that need to be matched (such as CSRF token), then fetching the page using mechanize then submitting it with the data filled out is often a more foolproof method that crafting the URL yourself.
If it is a simple URI, like google's search pages, then manually constructing it may be simpler.

Resources