Can I read webpage data using Ruby? - ruby

I am looking for a way to automate the testing, web page data filling, and also wanted to extract web page data and get them stored into our database permanent basis. Is there any way to fulfil such requirement using Ruby? If so, please guide me to what Ruby modules can help me.

Yes you can do all this tasks using Ruby and some gems.
I recommend you to take a look at Nokogiri gem for data extraction:
https://github.com/sparklemotion/nokogiri
And Capybara gem for testing and automation of forms and stuff:
https://github.com/jnicklas/capybara
P.S.: Capybara gem does much more than just this, but it can be applied to your case too.

Since some Webpages may not be valid XML, you are also able to use Regular Expressions to fetch the data you want from a webpage. Sometimes a XMLReader-approach just fails.
Sample:
require 'open-uri'
page_content = open("http://your_page.com").read
page_body = page_content.scan(/<body>(.*)<\/body>/i).first
# do whatever you want with it
As VBSlover said, capybara is useful to deal with browsing related stuff.
Doing this in an automated way every n minutes or the like is also possible with the whenever gem.
For handling Database-Storing there are plenty of very good gems out there.
Final answer: there is nothing you can't do with Ruby nowadays. Okay, maybe except writing some really (!) high-performance code / 3D-Engines.
Edit:
if you can tell what you exactly want to do i may suggest you some matching gems.
Usually "There is a gem for it" is a good saying. you can browse rubygems.org for some keywords you need, or look at https://www.ruby-toolbox.com/ for some categorized/ranked suggestions for your problem. :)
EDIT 2:
have a look at http://watir.com/
maybe just play around with it in some little painless scripts to get a feeling for it and if it is the solution for you.
Watir drives browsers the same way people do. It clicks links, fills
in forms, presses buttons. Watir also checks results, such as whether
expected text appears on the page.
Once you have it clicked everything for you, just scrape the results (or whatever you need) from the webpage, using some XML-Parser (nokogiri would be a good choice) or some regexp's.
Then stuff your data in your database. Activerecord comes to mind for this, but it may or may not be overkill. depending on your database, choose whatever adapter/connection gem you like (again: there are MANY).
If you want to do this every hour or the like, just use the whenever gem (manages a cronjob for you) or simply write a infinite loop with sleep(x) in it if you want. There is more than one way to do it. :)

First of all, you need a proper operation system, either use Linux or BSD or MacOS.
Windows will fit for some people, but not for you as ruby developer, too much libraries need c extensions with are pain in the ass to compile under cygwin.
I recommend, install a Ruby version manager, so you can try out different ruby versions, I prefer RVM, the Ruby Version Manager.
Install Ruby 1.9.3 it is the standard nowadays.
Trough rubygems install the gem mechanize, with does pretty all automation for websites you will need. It is a successor of LWP::Mechanize from Perl.
Nokogiri would be also useful, for parsing XML data like (X)HTML, but remember you should have prior libxml libs installed on your system.
Ah, according to your question:
Yes, you can read websites using ruby, for example read this webpage:
http = HTTPClient.new
http.get "http://stackoverflow.com/questions/14235393/can-i-read-webpage-data-using-ruby"
Done

Related

Is there a way to know if my Ruby gem is being used?

I created a Ruby gem and I would like to know if people are using it after they download it.
Although there are a number of techniques you could write into your gem, so that your gem will send you a notifications whenever it's being used, including information about the developer using the gem, the time the gem was used, the geographical location, (the code to the app they're developing) etc'...
But there's a good reason why there's no common way of doing that.
On a personal note, if your gem was sending you notifications about it's use, I would stop using it.
Gems aren't the same as applications - they are development tools. As such, it is expected that their code will perform the task for which they were designated and ONLY the task for which they were designated.
Good luck.

If I put my ruby code into a gem, is its source code secure?

Its my understanding that when I make a gem I'm compiling my ruby code into some form of executable, right? Does this mean that unless someone used introspection techniques (which is an acceptable risk to me), my source code is secure?
A gem is not a compiled executable. It's not compiled at all. Ruby is interpreted. Creating a gem just bundles the necessary files together, much like a zip file or tar archive.
If you want your gem secure you should keep it out of rubygems.org. You can set up your own private gem server or you can just include your gem in projects that need it.
While it is possible to compile Ruby code into an executable or shared library using Ruby's C API, that has nothing to do with gems.
A gem is just a collection of Ruby code (which could be regular scripts or compiled libraries) in a nice package for use with the rubygems package manager. It makes no effort to hide/protect the code.
I think
gem unpack
can extract your code. Never tried to do it and see if it's "human-readable" but you can try it before publishing your gem ;)
Its my understanding that when I make a gem I'm compiling my ruby code into some form of executable, right?
No. It's just a zipfile with some metadata. The contents of the zipfile are exactly what you put into them.
Does this mean that unless someone used introspection techniques (which is an acceptable risk to me), my source code is secure?
This depends on what you mean by "secure", but is completely orthogonal to RubyGems.
If you mean that "it can't be stolen", then that is already guaranteed for you by copyright law. Unless you live in a really weird country, software is protected by copyright automatically from the moment you write it.
If you mean "cannot be reverse engineered", then that is impossible. If you want people to be able to run your program, then you must give it to them in a format that can be understood by the CPU. Humans are much smarter than computers, so, if the program can be understood by the CPU, then it can also be understood by a human.
There are two common ways around this, which I will call the "Nintendo way" and the "Google way".
The Nintendo way is to give the user the CPU as well as the program, therefore, the user's CPU doesn't have to understand it. However, that model is still flawed. As long as you give the user something, he can figure it out. In the end, it's all just maths and physics, which can be understood. And users are pretty clever. Note that, for example, most game consoles were not cracked by evil crime syndicates trying to steal the code or pirating games, no, they were cracked by students wanting to run Linux or BSD on their hardware.
The Google way is to give the user nothing. You type something in the search box, Google sends you back the results, but at no time does the software leave Google's datacenter.

Node.js or Ruby for Scraping

I am trying to make an application that requires a lot of data scraping from multiple websites. I tried scraping websites using Ruby but gems such as Mechanize only seem to scrape static pages and not dynamic content. I have a couple questions regarding which of these languages, or any other language, I should use for this project (I am considering using Node because quite a few elements in the application have to be in real time).
Is it possible to use Ruby and/or Node to scrape dynamic content? If so which tools specifically should be used?
If multiple users are going to be scraping from multiple sites, which language would you recommend using?
On a slightly unrelated note, is it possible to combine Node and Rails?
Thanks in advance!
You can utilize the capybara gem for scraping javascript sites using ruby.
This has the advantage of being able to use actual browsers such as Firefox, Chrome and IE through the selenium driver. Or you can use headless browsers such as webkit (via capybara-webkit) or phantomjs (via poltergeist).
When you use capybara, just be sure to use a javascript enabled driver, such as selenium or capybara-webkit. My driver of the day is poltergeist.
There are some instructions for how to use capybara with remote sites in their readme.
Node vs. Ruby is a very open ended question. My answer here is suggesting Ruby because that is my experience and preference. "Combining" them could mean many things, they can be used in concert, each playing to their strengths.
When you say that mechanize can't scrape dynamic content, you really mean that it's a little bit more work to figure out which ajax requests need to be made and make them. The other side of that is that once you do you generally get a nice json response that's easy to deal with. Mechanize is also much faster than a full browser solution so my opinion is that it's usually worth the extra work.
As far as Node goes, there's potential and maybe once it's been around for a while some great libraries will become available, but I haven't seen anything yet that would make up for the ruby things I wiss miss.

safe browsing with ruby

any usable ruby code to interact with the safe browsing API from google?
i did search at google but i didn't find any mature and solid code.
I have 3 points:
(0) I'd say that This looks alright, as does this
(1) Having used quite a few ruby gems for various obscure things, I find bugs all the time. It helps the open source community and the world if you find a gem, fix a bug, and let the rest of the world benefit by submitting a pull request. Tests make the life of a contributor sooooooooooo much easier, and guarantee that your fix works, so use gems with extensive tests where possible, even if they are not mature and you half-expect them to fail.
(2) From experience, gems which have lots of objects encapsulating something can sometimes be counterproductive. This has tripped me up in the case of the ruby mail gem and the tire gem (though that's not to say that they are not good and incredibly useful gems.). This applies to you if you only need to make one type of API call, say, and take a simple action. Using the simplest gem is sometimes advantageous, and for this purpose you might not need to use any gem at all! Just write a class that uses Net::HTTP to call the HTTP API: https://developers.google.com/safe-browsing/lookup_guide

Anything better than ruby alchemy for extracting keywords?

I've currently written an algorithm in Ruby based on the arc90 readability code to extract an article from a web page.
Now that I have the article, I want to extract keywords and specific information from it (names, author, etc)
I heard Alchemy was a great ruby gem for doing this though it consumes a lot of resources. Are there any better gems I can use for this?
fast, leightweight and easy-to-use gem for extracting keywords from longer content:
https://rubygems.org/gems/highscore
i use it in production, works like a charm.
The question is a bit older, but i'll leave this here for others who will come from google to see this question.
There is an OpenCalais gem which provides similar capability. In addition to entity extraction it can also detect events and relations between entities. It's not lightweight, though I couldn't tell if it's better or worse than Alchemy as I haven't used the Alchemy gem. Hope this helps.

Resources