How would I accomplish this? Automated 3rd party page refresh + Alert - ruby

I'm faced with an interesting task:
Our transport guys have to monitor a 3rd-party webpage the entire day, clicking every 5 seconds on a button, to refresh the page and get available transport slots. The slots section is only updated when the button is clicked. When slots become available, the available slot label changes from "0" to "1", or "2", depending on the amount of open slots...
Is there any way of writing a script that would automatically click on the button, and raise an alert when that specific value on the page changes? Maybe some sort of UI Testing framework that could automated this?
Any suggestions?

Pressing a button on such webpage always boils down to a HTTP request, which you can do with pretty much plain Ruby's net/http. However, I guess there is some authentication going on there, so cookies may have to be preserved. For such uses, Mechanize is very nice library. It relies on Nokogiri, and pages you get are really easy to scan for such changes, as number of open slots you need.
Without more detailed information about the pages you need to scrape, this is pretty much all the advice you can get.

Related

How to read content from reCAPTCHA protected site

My client needs data scraped from a website. I am planning to use php_curl. The problem is, the site is using Google reCAPTCHA. Few powerful data items are visible only when you click "show this information link". then the reCAPTCHA appears in lightbox and vanishes, and information is displayed.
I have checked the source html, the protected item is actually loaded when someone clicks, and there is no way for me to automate this click. I have even tried to open the site in iframe and then use JS to click it, but it fails as both domains are different. I have also tried to use Selenium stand alone version but its downloads are corrupt.
Unless there is a design flaw with the website, the reCAPTCHA will prevent you from scraping the material without human intervention.
Technically, your best bet is to employ humans to solve CAPTCHAs all day and write some software to automatically scrape the material it protects for each one they solve. A number of viable businesses have been created this way, where the data is valuable and there is a genuine public interest in opening the data-set. (For example I heard that flight companies use CAPTCHA devices to prevent price comparison sites from driving down the cost to the consumer, and I'd argue in such a case there is an overwhelming public interest to defeating such defences).
Morally, however, you would need to tell us what you are doing in order for us to advise you. It is possible your client is merely planning to steal other people's material and then attempt to monetise it for him/herself, even though they had no hand in creating it. That may breach some copyright laws, but moreover, they (and you) need to decide if the scraping is fair.
I am facing the same problem but resolved it using clear my cookies in httprequest in useragent after clear cookie wait time function (tread sleep) for some time and then start scrapping again. But I am doing this in C#, not in PHP. Applying this logic may help you.

How to automatically resubmit the form in Firefox on error

In case a (say login) form POST submission fails and Firefox displays "Try Again" message.
Is there any way to click this "Try Again" automatically or through any settings in Firefox about:config that it clicks it?
Related
"Clicking" the Try Again button is relatively easy. There is an extension that does just that, and lets you set the number of seconds between retries.
The real rub here is that you want to "blindly" retry form POSTs. As we all know, just because you didn't get a response, that doesn't necessarily imply that nothing was changed on the server.
Re-submitting a login form sounds harmless enough, and usually is. But if you imagine forms that result in orders being placed or money being moved, it's easy to understand why browsers have implemented this kind of warning:
This is what you'll see if you enable an extension like TryAgain and a form post fails. It's the same behavior you'd get by pressing F5 yourself. The extension will dutifully try to POST again, but the browser is going to intervene with an alert, and refuse to send the POST until "Resend" is clicked.
This kind of safety feature does a fair amount to protect end-users and developers from poor implementations and network hiccups. However, it's really going to work against what you're trying to accomplish.
That said, if you could figure out a way to modify the extension to detect the alert and somehow click "Resend", you'd be in business. I can't say for sure that this is impossible, put it kind of looks that way, at least for now: this issue was marked as "won't fix", and this issue is still open.
Here is an extension for firefox:
auto reload
but i would warn you. because you could auto send any sensitive data. usually web browsers ask reload because the dont want any sensitive data to be submitted without user discretion.

Scraping pages that do not seem to have URLs

I'm trying to scrape these listings and provide more exposure for these job listings on a site that belongs to a client of mine. The issue is that I need to be able to link to the specific job listing in order for the job seeker to apply. This is the page I'm trying to save listing links from.
It would be ideal if I could save an address for the job seeker to click on to see the original listing and then apply.
What is this website doing to not feature a URL for these pages
Is it possible to provide a listing specific address
If that's possible how could I generate that address?
If I can't get a specific address I think I could get it so that the user clicks a link that triggers an internal script on my client's site which takes the listing ID and searches the site I found that listing on, and then redirects the user to that specific listing.
The downside to this is that the user will have to wait a little while depending on how far back the listing is on a directory. I could put some kind of progress bar with a pleasant "Searching for your listing! Thanks for being patient" message.
If I can avoid having to do this, though, that'd be great!
I'm using Nokogiri and Mechanize.
The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.
The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.
The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.

Automate website log-in and form filling?

I'm trying to log in to a website and save an HTML page automatically (I want to be able to do this on a regular time interval). From the surface, this is a typical modern website where, if the user navigates directly to a "locked" URL, a log-in form pops up, and after logging in, the user is redirected to the intended page.
I gave mechanize a shot (http://wwwsearch.sourceforge.net/mechanize/) but it wasn't finding some form elements which were needed for login (hidden elements that have some values put in by a javascript function that runs when the user clicks the "log in" button).
I played a bit with the "web browser" control in .NET but quickly lost interest because I couldn't even get it to submit a query on the Google page.
I don't care what the language is; I'll learn it to solve this problem. At a minimum it has to work in Windows.
A simple example, say, typing in a query into the Google search box would be a great bonus.
In my experience, the most reliable way is to use javascript. It works well in .Net. To test, browse to the following addresses one after another in Firefox or Internet Explorer:
http://www.google.com
javascript:function f(){document.forms[0]['q'].value='stackoverflow';}f();
javascript:document.forms[0].submit()
That performs a search for "stackoverflow" on Google. To do it in VB .Net using the webbrowser control, do this:
WebBrowser1.Navigate("http://www.google.com")
Do While WebBrowser1.IsBusy OrElse WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Threading.Thread.Sleep(1000)
Application.DoEvents()
Loop
WebBrowser1.Navigate("javascript:function%20f(){document.forms[0]['q'].value='stackoverflow';}f();")
Threading.Thread.Sleep(2000) 'wait for javascript to run
WebBrowser1.Navigate("javascript:document.forms[0].submit()")
Threading.Thread.Sleep(2000) 'wait for javascript to run
Notice how the space in the URL is converted to %20. I'm not certain if this is necessary but it can't hurt. It is important that the first javascript be in a function. The calls to Sleep() are to wait for Google to load and also for the javascript stuff. The Do While Loop might run forever if the page fails to load so for automation purposes have a counter that will timeout after, say, 60 seconds.
Of course, for Google you can just navigate directly to www.google.com?q=stackoverflow but if your site has hidden input fields, etc, then this is the way to go. Only works for HTML sites - flash is a whole other matter.
If I understand you right, you want to log in to only one webpage, and that form always stays the same. You could either reverse engineer the java script, or debug it via a javascript debugger in the browser (e.g. firebug for firefox). Or you can fill in the form in your browser and look at the http request via a network packet sniffer. Once you have all required form data to submit, you can do the same with your program (thats what I did the last time I had a pretty similar task to do). dont forget to store all cookie data you requested back from the webserver and send it with the next request, to 'stay logged in'.
Its being already discussed here.
Basically its gist is you can use selenium, an open source web automation tool, which has api library available in various languages like java, ruby, etc.
Neoload can handle the form filling with authentication, assuming you don't want to collect data, just perform actions. It's a web stress tool, so it's not really meant to be used as a time-based service, but you COULD just leave it running.
I've used Ruby and Watir (a web app testing suite) for something similar, but it was a very small task (basically visiting URLs from a text file and downloading an image).
There's also an extension called iMacros that can do some automation, but I'm not personally familiar with it (just aware of it).
"I'm trying to log in to a website and save an HTML page automatically"
SAVEAS TYPE=HTM FOLDER=C: FILE=page.html
https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/?src=search
This commands played in iMacros addon will save the page on C: drive and name it page.html
Also,
URL GOTO=www.website.com
Goes on the particular website you want to save. You can also use scripting in iMacros and set different websites in macro.

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Resources