Automated Web Scraping Issues

Automated Web Scraping Issues - ruby

I am developing a rather large automation application to scrape various abandoned property information from various state databases, in order to find specific properties. I have already developed search scripts for about 8 state websites, using various forms of automation. I prefer to use something like ruby's Mechanize library to perform the automation, because it is the most stable method I have come across so far. In some cases, I am unable to automate the scraping with Mechanize and must fall back to something like Watir (or, more specifically, the branch of Watir called Vapir). Vapir is needed specifically when a source requires javascript to be searched, since Mechanize only makes HTTP requests and does not deal with JS interpretation.
My problem is with Vapir automating an instance of Internet Explorer. In some cases, after prolonged searches (some of these searches are for lists of 4,000+ search terms), IE locks up. I assume it is an issue with the OLE engine. The error I receive is as follows:
failed to create WIN32OLE object from `InternetExplorer.Application' HRESULT error code:0x80004005 Unspecified error
I cannot find anything to resolve this issue.
My question is if anyone knows of any solution or work-around to an automated OLE instance that locks up? To fix the error, I have to manually kill all of the IE processes and restart the automated search.
Alternatives that I am aware of are to automate Firefox through Vapir in the back-end (rather than IE), or possibly switch over to something like PhantomJS. Does anybody have an opinion on either of these options?

Is there a reason you are using Vapir? Why don't you try watir (drives Internet Explorer) or watir-webdriver (drives Internet Explorer, Firefox, Chrome and Opera) gems?
For installation see https://github.com/zeljkofilipin/watirbook/blob/master/installation/windows.md

Related

Watir Webdriver script generation

I'm currently working on writing a suite of test scripts using watir webdriver. Is there something out there that would make script generation easier than looking directly at the HTTP and manually putting the script together? Maybe something captures user interactions with the browser elements and then writes that to a script.
I could just write them manually, but I may as well ask and see if there is a better way.

There are a couple record and playback tools that are available for Selenium (like IDE), and several non-open source solutions as well. Most of the Selenium and Watir development communities actively discourage their usage for writing test suites as they create very brittle tests that are difficult to maintain over time.
Watir does allow you to locate elements based on text or regular expressions, which can make it easier to find many elements without looking at the html. In general, though, you the tester have a better idea of the structure of your website, what id elements are there, and what css elements are unique on a page, or unlikely to change with future site updates, etc.

Browser activity simulator

For testing purpose, I'm looking for a tool that simulates browsing activity. I'm not looking for just HTTP(S) traffic generator, I need to define some browsing scenarios. For example, to browse [x] links deep, or randomly jump from page to page, or to randomly fill and submit forms, maybe even generate some erroneous requests. It's important to support all HTTP verbs (PUT, HEAD, DELETE, ...etc.), and hopefully, but not necessarily, command line. It'd be a very big plus to have randomizable fields (IP address, User-Agent, ...etc.)
If no such tool exist, what are the recommended packages to script such in ruby?

This is called "end to end" web testing (e2e)
You may want to look at selenium, a technology that is able to take control of a browser and automate user browsing scenarios.
Selenium is usually used through some kind of control tool. Since you use ruby, you may want to look at selenium-webdriver
If you want random interactions, I heard of a tool called gremlins

I suggest you look into capybara https://github.com/jnicklas/capybara
You can use capybara with the most common ruby test frameworks, rspec, cucumber, test::unit...
It supports selenium by default but you can also make it headless (not opening a browser window) if you use another driver such as capybara-webkit.
Check you the README, you'll find everything you need.

Watir Webdriver Detection

Is there a way to detect whether someone is conducting web testing using watir-webdriver on your site? I have read somewhere that it is fairly easy to detect watir/selenium, but I never managed to get more details about it.
I have tried UserAgent detection, but that's not something very useful as far as it's easy to change it.

Right, I will make my comments into an answer as requested.
I doubt if it's possible. The idea of Selenium is to automate browsers by simulating actions like real users. You can't possibly detect it from server side, unless Selenium fails to simulate (e.g. click really fast, but if the Selenium code is written deliberately to simulate a real user in a slow fashion, then I'd say it will be difficult to detect).
On the other hand, User Agent approach won't work if someone runs it using common browsers with default UA.

Node.js or Ruby for Scraping

I am trying to make an application that requires a lot of data scraping from multiple websites. I tried scraping websites using Ruby but gems such as Mechanize only seem to scrape static pages and not dynamic content. I have a couple questions regarding which of these languages, or any other language, I should use for this project (I am considering using Node because quite a few elements in the application have to be in real time).
Is it possible to use Ruby and/or Node to scrape dynamic content? If so which tools specifically should be used?
If multiple users are going to be scraping from multiple sites, which language would you recommend using?
On a slightly unrelated note, is it possible to combine Node and Rails?
Thanks in advance!

You can utilize the capybara gem for scraping javascript sites using ruby.
This has the advantage of being able to use actual browsers such as Firefox, Chrome and IE through the selenium driver. Or you can use headless browsers such as webkit (via capybara-webkit) or phantomjs (via poltergeist).
When you use capybara, just be sure to use a javascript enabled driver, such as selenium or capybara-webkit. My driver of the day is poltergeist.
There are some instructions for how to use capybara with remote sites in their readme.
Node vs. Ruby is a very open ended question. My answer here is suggesting Ruby because that is my experience and preference. "Combining" them could mean many things, they can be used in concert, each playing to their strengths.

When you say that mechanize can't scrape dynamic content, you really mean that it's a little bit more work to figure out which ajax requests need to be made and make them. The other side of that is that once you do you generally get a nice json response that's easy to deal with. Mechanize is also much faster than a full browser solution so my opinion is that it's usually worth the extra work.
As far as Node goes, there's potential and maybe once it's been around for a while some great libraries will become available, but I haven't seen anything yet that would make up for the ruby things I wiss miss.

XUL Testing Tools

Are there any tools for testing XUL? I'm using yui test for testing XPCOM. But I can't find any for XUL

(source: clear-code.com)
I am using UxU, which is a reworked version of MozUnit. Though I was initially intimidated by the fact that its documentation is half in Japanese, I've used it for few hours and I already feel like recommending it!
Besides allowing you to test your javascript code within the same environment wherein it will run, it also provides a number of convenient helpers methods to remote-control the browser and automating functionalities such loading URLs, opening Tabs, modifying preferences, accessing files and local storage..
Definitively worth of a try!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio