Watin & Google Searches - watin

I'm trying to use Watin to parse google search results.
However watin is unable to find elements i the Google Search Result page. When I view the source it's because the page is generated off javascript so the search results are not sent over the wire in html.
However when I open up Firebug (in Firefox) I am able to parse the html that gets generated by the javascript.
Does anyone know how I can get Watin to do the same so I'm able to parse the results?
Thanks :)

Could you use the google search API instead?
http://code.google.com/apis/ajaxsearch/documentation/#fonje

It may be a matter of timing. If javascript is generating the data, you may be checking for your data before it is written. Try running it in debug, stepping through and waiting until you know the items exist in the source before using WatiN to test for it.
I would also suggest posting code so we can see exactly what you're trying to do.

Related

IMPORTHTML or IMPORTXML to collect data from a site

I have made several attempts to collect the data within this table:
The simple ways of the two functions I've commented on, I've tried, but not succeeded.
I would like to if anyone knows any other way to collect this data in Google Sheets.
Site Link:
https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad
the table you want to scrape is under JavaScript control, therefore, it can't be scraped.
all you can get from that site into Google Sheets is:
=ARRAY_CONSTRAIN(IMPORTDATA(
"https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad&team_id=13217"); 10000; 10)
Because the page you are trying to scape is rendered using Javascript — i.e. the content you are looking to scrape is not in the markup, you will not be able to use a tool like Google Sheets.
However... you can easily scrape this by using a "headless browser". You pretty much will use a browser (without a UI) that will render your URL with the Javascript, and then once the page is loaded, you query the data using XPATH, etc.
Check out Puppeteer for an example of a JS framework that you can use for this task.

Google Suggest, how it works?

How does Google Suggest work? How does it manage to update the web page on the client so quickly, based on information in a distant Google database? Why does the web page not look ‘jumpy’ if it is being frequently updated?
It uses AJAX.
When you are writing your query, it searches for the 10 most requested words matching yours. Then it writes minified JSON on an invisible DIV element. Fast, but still resource intensive.
Try to install Firebug on Firefox or use the Developer Console on Chrome, open the console and start writing "Youtube" or whatever you want. You will see the minified JSON responses.
Good luck :D
In addition to the front-end handling others have talked about, which jQuery is a great example of, you might also be interested in how they approach the idea on the backend. Dr. Peter Norvig has written about how to create a spelling corrector, where similar approaches could be used to find close matches.
The whole page is not being updated. Only parts of it are using AJAX - Asynchronous Javascript and XML. Ajax requests can be made in Javascript, and the page updated when the response comes back.
A far more interesting question is how does Google actually search 10bn+ documents in a teeny tiny fraction of a second :)

Is there a way to visually see if htmlunit is performing the correct commands?

Is there a way to visually see if htmlunit is performing the correct commands? I have a hard requirement to use htmlunit. I just don't know if it's filling out all the form correctly.
HTMLunit is designed to be GUI less browser and for your requirements you can consider using Webdriver or Watir or Selenium etc such tools. In case you are in to Ruby, take a look at Celerity which wrapped HtmlUnit in a Watir-ish API; In fact Celerity is itself being wrapped by Culerity, which integrates Celerity and Cucumber and that could be of more interest to you.
Yes. you can see the HTTP traffic by using proxy like webscarab, fiddler..etc.
Make sure the following
Set the proxy details to Htmlunit via contsructor. I think it is webclient
Make sure you either trust all the certs or add proxy certificate to truststore
What do you mean by "correct commands"? HtmlUnit itself won't give you a running description of what it's doing, if that's what you mean. As suthasankar says, HtmlUnit is a headless browser (intentionally so) and will never give you the cool Watir experience of watching pages fly by.
Any time I've wanted to know what's happening during a test's execution, I have added logging statements at various points in the test code and then watched them in the console. You could send messages to any other monitoring system you instead.
It wouldn't take much to then write wrappers around the "commands" you're interested in, like "getPage" and button clicks and form entries and the like.
It's not possible to view what HtmlUnit is doing unless you code logging and some sort of display yourself. I have done this in the past, and it's helpful to a certain degree but it's not really possible to have a visual feedback to see what HtmlUnit is doing. Even with logging, it's not possible to know every single detail what HtmlUnit is doing or where it goes wrong, so it's an extremely time consuming task. I even resorted to outputting the current page viewed but this is pretty limited as an html page cannot tell the actual "commands" HtmlUnit is executing on that page.
Another approach would be to use Selenium, which executes your "commands" in a visual manner you can see where things go wrong instantly by watching it.

Mastering ajax data loading

It's not that I'm not familiar with the concept but I'm wondering about what is the best approach when you are creating applications supporting ajax userexperience.
I mostly use ajax with jQuery but also when I want to load information without a pagerefresh. As you will probably know, the XmlHttp Object and the ResponseText provides a nice and easy way to execute a script in the back and display the results e.g. a div.
A bit of a downside of this approach is that its hard to see the actual generated sourcecode. I often take a look at the sourcecode to see if the expected parameters are correctly provided to e.g. formelements.
So I'm curious, what is your approach for creating ajax-data-load functionality? Is it just by the ResponseText property?
Having manually crafted XmlHttpRequest objects for a while, I now use jQuery for all my AJAX-ified stuff. It gives you much better control while coding less.
If you use Firefox, get yourself the Web Developer Toolbar. It has a cool feature called "view generated source code" which generates the HTML code that the browser knows about in the current document as it stands, so it includes the HTML sent back by your AJAX requests.
Also I make it a rule to always tell the user you're loading something rather than relying on the browser to tell them (like Gmail's "Loading" text in the corner for instance.)

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Resources