Node.js or Ruby for Scraping - ruby

I am trying to make an application that requires a lot of data scraping from multiple websites. I tried scraping websites using Ruby but gems such as Mechanize only seem to scrape static pages and not dynamic content. I have a couple questions regarding which of these languages, or any other language, I should use for this project (I am considering using Node because quite a few elements in the application have to be in real time).
Is it possible to use Ruby and/or Node to scrape dynamic content? If so which tools specifically should be used?
If multiple users are going to be scraping from multiple sites, which language would you recommend using?
On a slightly unrelated note, is it possible to combine Node and Rails?
Thanks in advance!

You can utilize the capybara gem for scraping javascript sites using ruby.
This has the advantage of being able to use actual browsers such as Firefox, Chrome and IE through the selenium driver. Or you can use headless browsers such as webkit (via capybara-webkit) or phantomjs (via poltergeist).
When you use capybara, just be sure to use a javascript enabled driver, such as selenium or capybara-webkit. My driver of the day is poltergeist.
There are some instructions for how to use capybara with remote sites in their readme.
Node vs. Ruby is a very open ended question. My answer here is suggesting Ruby because that is my experience and preference. "Combining" them could mean many things, they can be used in concert, each playing to their strengths.

When you say that mechanize can't scrape dynamic content, you really mean that it's a little bit more work to figure out which ajax requests need to be made and make them. The other side of that is that once you do you generally get a nice json response that's easy to deal with. Mechanize is also much faster than a full browser solution so my opinion is that it's usually worth the extra work.
As far as Node goes, there's potential and maybe once it's been around for a while some great libraries will become available, but I haven't seen anything yet that would make up for the ruby things I wiss miss.

Related

Watir Webdriver script generation

I'm currently working on writing a suite of test scripts using watir webdriver. Is there something out there that would make script generation easier than looking directly at the HTTP and manually putting the script together? Maybe something captures user interactions with the browser elements and then writes that to a script.
I could just write them manually, but I may as well ask and see if there is a better way.
There are a couple record and playback tools that are available for Selenium (like IDE), and several non-open source solutions as well. Most of the Selenium and Watir development communities actively discourage their usage for writing test suites as they create very brittle tests that are difficult to maintain over time.
Watir does allow you to locate elements based on text or regular expressions, which can make it easier to find many elements without looking at the html. In general, though, you the tester have a better idea of the structure of your website, what id elements are there, and what css elements are unique on a page, or unlikely to change with future site updates, etc.

Can I read webpage data using Ruby?

I am looking for a way to automate the testing, web page data filling, and also wanted to extract web page data and get them stored into our database permanent basis. Is there any way to fulfil such requirement using Ruby? If so, please guide me to what Ruby modules can help me.
Yes you can do all this tasks using Ruby and some gems.
I recommend you to take a look at Nokogiri gem for data extraction:
https://github.com/sparklemotion/nokogiri
And Capybara gem for testing and automation of forms and stuff:
https://github.com/jnicklas/capybara
P.S.: Capybara gem does much more than just this, but it can be applied to your case too.
Since some Webpages may not be valid XML, you are also able to use Regular Expressions to fetch the data you want from a webpage. Sometimes a XMLReader-approach just fails.
Sample:
require 'open-uri'
page_content = open("http://your_page.com").read
page_body = page_content.scan(/<body>(.*)<\/body>/i).first
# do whatever you want with it
As VBSlover said, capybara is useful to deal with browsing related stuff.
Doing this in an automated way every n minutes or the like is also possible with the whenever gem.
For handling Database-Storing there are plenty of very good gems out there.
Final answer: there is nothing you can't do with Ruby nowadays. Okay, maybe except writing some really (!) high-performance code / 3D-Engines.
Edit:
if you can tell what you exactly want to do i may suggest you some matching gems.
Usually "There is a gem for it" is a good saying. you can browse rubygems.org for some keywords you need, or look at https://www.ruby-toolbox.com/ for some categorized/ranked suggestions for your problem. :)
EDIT 2:
have a look at http://watir.com/
maybe just play around with it in some little painless scripts to get a feeling for it and if it is the solution for you.
Watir drives browsers the same way people do. It clicks links, fills
in forms, presses buttons. Watir also checks results, such as whether
expected text appears on the page.
Once you have it clicked everything for you, just scrape the results (or whatever you need) from the webpage, using some XML-Parser (nokogiri would be a good choice) or some regexp's.
Then stuff your data in your database. Activerecord comes to mind for this, but it may or may not be overkill. depending on your database, choose whatever adapter/connection gem you like (again: there are MANY).
If you want to do this every hour or the like, just use the whenever gem (manages a cronjob for you) or simply write a infinite loop with sleep(x) in it if you want. There is more than one way to do it. :)
First of all, you need a proper operation system, either use Linux or BSD or MacOS.
Windows will fit for some people, but not for you as ruby developer, too much libraries need c extensions with are pain in the ass to compile under cygwin.
I recommend, install a Ruby version manager, so you can try out different ruby versions, I prefer RVM, the Ruby Version Manager.
Install Ruby 1.9.3 it is the standard nowadays.
Trough rubygems install the gem mechanize, with does pretty all automation for websites you will need. It is a successor of LWP::Mechanize from Perl.
Nokogiri would be also useful, for parsing XML data like (X)HTML, but remember you should have prior libxml libs installed on your system.
Ah, according to your question:
Yes, you can read websites using ruby, for example read this webpage:
http = HTTPClient.new
http.get "http://stackoverflow.com/questions/14235393/can-i-read-webpage-data-using-ruby"
Done

Can you use ruby for web pages other than ruby on rails?

Is Ruby primarily only used in ruby on rails? Is it used on the server side for general work like php is? Also, I haven't seen a lot of hype about rails anymore. Is Ruby and/or RoR dead or fading away?
I ask because I was interested in RhoMobile for building mobile apps, but I didn't want to get into using an antiquated language.
Thanks.
edit: Can i use Ruby for web pages if I don't want to use rails? (I do not mean another framework. I mean like php.)
Regarding your question about Ruby and/or RoR dead or fading away, look at the job trends
There are many web-frameworks for Ruby, not just Rails, Sinatra being one of them.
You shouldn't be deciding to use a language or technology, because there is or there is not a hype around it.
If a product is able to solve your problems, then you should use it. I know people building stuff in Smalltalk nowadays (who would have thought, right?), because it's great and it works.
Take a look at Sinatra, for example.
Also there is a lot of tools written entirely in ruby.

Automated Web Scraping Issues

I am developing a rather large automation application to scrape various abandoned property information from various state databases, in order to find specific properties. I have already developed search scripts for about 8 state websites, using various forms of automation. I prefer to use something like ruby's Mechanize library to perform the automation, because it is the most stable method I have come across so far. In some cases, I am unable to automate the scraping with Mechanize and must fall back to something like Watir (or, more specifically, the branch of Watir called Vapir). Vapir is needed specifically when a source requires javascript to be searched, since Mechanize only makes HTTP requests and does not deal with JS interpretation.
My problem is with Vapir automating an instance of Internet Explorer. In some cases, after prolonged searches (some of these searches are for lists of 4,000+ search terms), IE locks up. I assume it is an issue with the OLE engine. The error I receive is as follows:
failed to create WIN32OLE object from `InternetExplorer.Application' HRESULT error code:0x80004005 Unspecified error
I cannot find anything to resolve this issue.
My question is if anyone knows of any solution or work-around to an automated OLE instance that locks up? To fix the error, I have to manually kill all of the IE processes and restart the automated search.
Alternatives that I am aware of are to automate Firefox through Vapir in the back-end (rather than IE), or possibly switch over to something like PhantomJS. Does anybody have an opinion on either of these options?
Is there a reason you are using Vapir? Why don't you try watir (drives Internet Explorer) or watir-webdriver (drives Internet Explorer, Firefox, Chrome and Opera) gems?
For installation see https://github.com/zeljkofilipin/watirbook/blob/master/installation/windows.md

Ruby off the rails

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Sometimes it feels that my company is the only company in the world using Ruby but not Ruby on Rails, to the point that Rails has almost become synonymous with Ruby.
I'm sure this isn't really true, but it'd be fun to hear some stories about non-Rails Ruby usage out there.
One of the huge benefits of Ruby is the ability to create DSLs very easily. Ruby allows you to create "business rules" in a natural language way that is usually easy enough for a business analyst to use. Many Ruby apps outside of web development exist for this purpose.
I highly recommend Googling "ruby dsl" for some excellent reading, but I would like to leave you with one post in particular. Russ Olsen wrote a two part blog post on DSLs. I saw him give a presentation on DSLs and it was very good. I highly recommend reading these posts.
I also found this excellent presentation on Ruby DSLs by Obie Fernandez. Highly recommended reading!
I use Ruby extensively in my work, and none of it is Rails (or even web) based.
My domain is usually client-side Windows applications (wxRuby GUI) and scripts, automating Excel, Internet Explorer, SQL Server queries and report generation (win32ole COM automation). I also use the sqlite, pdf-writer, and gruff libraries for various data munging and graph generation tasks.
Rails' success has been great for Ruby, but I agree that Rails has received so much attention that Ruby's value beyond the web is often overlooked.
We are mainly a C++ shop, but we've found several areas where Ruby has proven quite useful. Here are a few:
Code Generation - Built several DSLs to generate C++/Java/C# code from single input files
Build Support
scripts to generate Makefiles for unix from Visual Studio Project Files
scripts for building projects and formatting the output for Cruise Control
scripts for running our unit tests and formatting the output for Cruise Control
scripts for manipulating Visual Studio projects and solutions from the command line
Integration Tests - We can crank out tests much quicker and cleaner using Ruby than C++
QA's entire testing suite is written in Ruby
Ruby is basically my go to tool for where it makes sense. And it makes sense in a lot of places.
Google Sketchup uses Ruby as an embedded scripting language. You can use it to perform all sorts of 3d modeling and import/export tasks. The scripting works with the free version and there's even decent documentation.
Ruby with a homebrew extension written in C++ does all the heavy pixel pushing for my photography processing. I was using Python+numpy but when doing artsy stuff, Ruby is just more fun. Also the relative lack of, or lesser maturity of, good image processing libraries makes me feel less like i'm reinventing wheels. I am clueless about Rails, other than i've heard of it, have a fuzzy idea what it is, and actually have a book on it (unopened)
We use Watir (Ruby library) to test our .net web application.
Check out Shoes, a simple API for building GUIs in Ruby aimed at novice programmers.
Or you could use Ruby to make music ala Giles Bowkett's Archaeopteryx. This presentation by Giles about Archaeopteryx is one of the best presentations ever. I highly recommend it.
RubyCocoa and MacRuby. Possible to make full Cocoa-based GUI apps without Rails. And then you get to use Interface Builder, too.
I worked on a museum project last year that used a lot of Ruby. (http://http://ourspace.tepapa.com/home)
The part that I spent most of my time on was an interactive floor map. The Map on the floor has sensors so when people walk on it lights are triggered and displays in the wall show images or videos and audio tracks are played.
All the control code for this part of the exhibit is ruby. I wrote C interfaces with ruby wrappers to communicate with the floor sensors and the lighting controllers. The system queries a MYSQL database for the media files to be displayed and then tells computers in the walls to play the media via UDP.
It's the most reliable part of the entire exhibit.
Ruby was used for the other major part of the exhibit, the Wall though I didn't have much to do with that. Most of the graphics were prototyped in ruby using interfaces to OpenGL, a bit of Cocoa and a physics library before being ported to pure Obj-C.
Puppet and Chef: DevOps
I didn't see a mention of Puppet or Chef in the 30 answers that preceded my arrival. Ruby appears to dominate current work in cloud automation and is the base, extension, and templating language of these two big players. They are used primarily to distribute system and application configuration information for server arrays and for general IT workstation management.
The DevOps field is quite Ruby-aware. Today, Perl has a competitor. While a really simple script may often still be written directly for sh(1), a complex task now might be done in Ruby rather than Perl.
The only site I've done with Ruby at work is using Rails, but I'd like to try Merb.
Other than that I do a lot of little utility programs in Ruby - for instance an app that reads RSS feeds and imports new posts into a dabase.
It's fun, so I also write some dumb stuff just because it's so quick. Yesterday I wrote an app to play the Monty Hall problem 100,000 times to help a friend convince her professor that switching is the correct strategy.
I almost take insult that ruby is a rails thing. It is like back when CGI was the latest trend and everyone figured that if you knew perl you must be doing it only because you programmed CGI apps. Ruby is just a scripting language for me, although not as mature as python so I somewhat regret having to jump through some of its hoops and recent changes, I still like it and use it. Although I work in a java shop and therefore groovy is the ideal choice for a scripting language, I still use ruby at home and for throw away scripts that aren't needed to be shared at work.
I was considering getting into RoR from all the buzz and how quick/simple it is, but after looking over rails I didn't see anything at all that was amazing or even the least bit innovative or rapidly fast about its development compared to any other framework. The only benefit I saw was that I could code in ruby, which would be nice, but initial setup, server maintenance and scaling is more difficult, thus re-offsetting the pleasure of coding in ruby.
I created a presentation -- coincidentally named Off The Rails -- to discuss Rack-based web applications:
https://github.com/alexch/Off-The-Rails
The git repo includes slides in Markdown format and sample code (in the form of running applications and middleware). Here's the abstract:
Ruby on Rails is the most popular web application framework for Ruby. But it's not the only one! If you think Rails is too big, or too opinionated, or too anything, you might be happy to learn about the new generation of so-called microframeworks built on Rack. And since Rails 3 is itself a Rack app, you don't have to give up Rails to get the benefit of Sinatra routes or Grape APIs.
And here are some references:
This talk lives at https://github.com/alexch/off-the-rails
Yehuda's #10 Favorite Thing About Ruby
Rack
rack-test
rack-client
Sinatra
Grape
Vegas
Siesta
Rerun
Hope you find it useful!
I'm mostly a Web developer, and I learned Ruby to use Rails, but I like the language so much that I started developing a desktop Swing application in Ruby, using JRuby and Monkeybars. I'm competent in Java, but don't much like using it, and the Swing API is horrible, so putting Ruby on top has been a big win.
We mainly use rails, but we have plenty of other non-rails ruby things - for example a standalone authentication daemon thing for centralized authentication of users, and an 'image processing server' which runs arbitrary numbers of ruby processes to process images in parallel.
Oh, and don't forget good old Rake :-)
Ruby is also used for Desktop application. Especially the use of JRuby to develop Swing desktop application.
I've used Ruby at work for
A data extractor, generating csv files from binary output.
A .ini file generator, turning a simple syntax into a repetitive .ini format.
A simple TCP/IP server, acting as stand-in for the customer's system during testing.
We use Ruby to implement our test automation software. This includes test framework and driver code for Selenium RC, WATIR and AutoIT.
Ruby is powerful enough to create comprehensive applications that can interface with Test tools like Selenium or WATIR, while at the same time reading from data files, interacting with a remote Windows UI and performing near transparent network communication. All while running on Windows or Linux.
The uncluttered syntax makes it ideal for new and inexperienced programmers to read. While its totally OO nature makes it easy for these same programmers to apply good (recently learned) OO techniques, from the start.
The flexible nature of Ruby's syntax also makes the use and creation of DSLs much easier. This allows less-technical people to get invovled, read and possibly create there own tests.
I have used Ruby for code generation of C# and T-SQL stored procedures in a project with unstable requirements. The data model was encoded in a YAML file and .erb templates were used for the classes and stored procedures. It also allowed for a much more DRY solution than would have been possible with straight C# as repetitve code could be factored out into a single method in the code generator.
Where I work, we use Ruby to do a number of different one-off type batch jobs. One example of that is a job that interacts with Amazon's S3 service. At the time, the Ruby S3 library was probably the easiest one out there for us to get up and running in a short amount of time.
I wrote an order processing expert system (see DSL answer as well), converted 100k lines of customer specific perl into about 10k lines of ruby handling dozens of customers. No web components at all, no Rails.
I am a webdriver user. ruby is used by webdriver for automating the build process thanks to rake. see http://code.google.com/p/webdriver/ for details
Heh, great question.
I used Ruby to convert Excel spreadsheet airport facility data to sqlite3 for the android phone platform while making an app for pilots.
I use Ruby with Sinatra which is much simpler than Rails. I did use Rails but just found that it has turned into a bit of a monster, although Rails is still amazing compared to web frameworks available for Java.
The main feature of Ruby that I love however is "eval" and "method_missing", which Rails actually uses for example in ActiveRecord so that you can use the amazing "find_by-field-name-" queries.
I used Ruby for a lot of back-end code simply because I was the only person who was tasked to do it and needed a nice clean language that allowed me to be very productive and write easy to maintain code. I find Ruby allows me to do that easier than Perl and Python. Other people's mileage might vary on that but it works well for me.
Besides that, I like how Sequel and Nokogiri work. I also used ActiveRecord for a while separately from Rails.
We use some Ruby for file manipulation but have not been able to incorporate rails yet.
I've used Ruby a lot professionally for quick scripts for things like shuffling files around. I'm the same way in that I was using Ruby first before touching Rails at all.
In Boulder there was an excellent group of Ruby users who met monthly. This point was made - that Ruby does have an existence beside its use in Rails. Plain Ruby users do exist, are begging for attention, have neat things to show, and can find each other at user group meetings.
They also had better pizza than the Python group, who met also the same day of the month. Can only pick one...
While we do have several Rails apps at work, we also use Ruby for some fairly intensive non-web stuff.
We've got an SMS delivery daemon, which pulls messages from a queue and then delivers them, and credit card processing daemon which other apps can call out to, which makes sure there's a central audit trail.

Resources