I need to test links inside a web application. I have looked at a couple tools (Xenu, various browser plugins, link-checker(ruby)). Nothing quite fits my needs which I will detail below.
I need to get past a login form
test needs to be rerun for different types of users (multiple sets of login credentials)
would like to automate this under a ci server (Jenkins)
the ability to spider the site
Does anyone have any ideas? Bonus if I can use Ruby to do this!
What you are asking for is beyond most of the test tools once you throw in the ability to spider the site.
That final requirement pushes you into the realm of hand-coding something. Using the Mechanize gem you could do all those things, but, you get to code a lot of the navigation of the site.
Mechanize uses Nokogiri internally, so it's easy to grab all links in a page, which you could store in a database to be checked by a different thread, or some subsequent code. That said, writing a spider is not hard if you're the owner of the pages you're hitting, because you can be pretty brutal about accessing the server and let the code run at full speed without worrying about being banned for excessive bandwidth use.
Related
I am playing a Flash-only game that uses AJAX to communicate with the server. The problem is that all the data is "drawn" and most of it not copy/pastable, so I end up retyping URLs and similar stuff from parts of it (i.e., from the chat).
I thought I'd make a simple page action extension for Chrome that would intercept all the AJAX communication between the game and the server, the way Developer tools can do it, and display only the data I'm interested in (parsing URLs and similar stuff is a no-brainer).
However, looking around the internet, I've found no info on how to do this. Many sites (including answers to some questions here) mention using Developer Tools (I'd prefer having a page action extension, simple enough to share with other players, but any other automation is welcome as well), some mention chrome.webRequest (which seems to be able to provide only the headers),...
I also thought of making a content script along the lines of this answer, but since I'm trying to read the data between a Flash applet (not a web page) and a server, I don't think injecting a JavaScript code is possible.
So, my question is: can this be done and, if yes, how?
In case anyone got the wrong idea, the aim of this is only to monitor the communication and extract the parts I'd want to be able to copy/paste, not change any data (i.e., the purpose is simplification of the game play, not cheating).
I know this is an open ended question, but hopefully it will get some good answers before the thread is locked...
I'm wondering what methods there are to programmatically check (language agnostic) if a website is online from a client perspective (assume you can't make changes to the site/server, but you can rely on certain behaviours of the site.)
The result of each method could stack to provide a measure of certainty that the site is up/down - that is, a method does not have to provide a definite indication if the site is up/down on its own.
Some common tests just to check 'upness' may be:
Ping the site (which in the case of shared hosting isn't very
indicative)
Send a http head/get request and check the status
Others I can think of to check that the site is up and functioning:
Check you received a well formed html response i.e. html to html
tags, if the site is experiencing trouble it may spit an error and
exit without writing the rest of the page (not all that reliable
though because the site may handle most errors in a better way)
Check certain content is or is not on the page, i.e. perhaps there is some content that is always present on your pages, or always present in the case of an error
Can anybody think of any other methods that could be used to help determine if a site is in fact up/down and functioning/not functioning correctly from within a program?
If your get request on a page that displays info from database comes back with status 200 and matching keywords are found, you can be pretty certain that your site is up and running.
And you don't really need to write your own script to do that. There are free services such as GotSiteMonitor, Pingdom, UptimeRobot etc. allows you to monitor your site.
Based your set of test on the unit tests priciple. It is normally used in programming to test classes, modules or other artefacts after changes have been made. You can use any of the available frameworks, so don't have to reinvent the wheel. You must describe (implement) tests to be run, in your case a typical test should request a url inside the page and then do some evaluations like:
call result (for example return code of curl execution)
http return code
http headers
response mime type
response size
response content (test against a regular expression)
This way you can add, remove and modify single tests without having to care about the framework, once you are up. You can also chain tests, so perform a login in one test and virtually click a button in subsequent test.
There are also tools to handle such test runs automatically including visualization of results, statistics and the like.
OK, it sounds like you want to test and monitor your website from a customer experience perspective rather than purely establishing if a server is up (using ping for example). An effective way to replicate the customer experience is to simulate tests against the site using one of the headless browser testing tools (phantomJS is great a great choice) as they will render the page fully (including images, CSS, JS etc.) giving you a real page load time. These tools also allow you to make assertions on all aspects of the HTML content and HTTP response.
pingdom recently started offering a (paid for) service to perform these exact types of checks for alongside their existing monitoring solution. The demo is worth looking at, their interface for writing the actual tests is very nice.
What's the best way to profile my RoR website http://www.karmabee.net?
I'm using the fb_graph GEM which is pretty slow, especially when retrieving friends lists. Twilio is also pretty slow when sending SMS texts.
So I'm not sure I could optimize those things. In any case, I need to figure out how to profile the site first.
Any ideas?
NewRelic: http://newrelic.com/ It looks into your rails app and tells you how much time each request spends on db queries, page rendering etc. From there you can drilldown to the bottleneck and work on optimizations.
http://www.webpagetest.org/ is handy for general page speed testing.
Chrome comes with the Audits tool(right click, inspect element -> Audits tab) which you can test any webpage's webpage performance and network utilization. Firefox has an addon YSlow does something similar.
Not sure how interaction with twilio can be profiled...
I really like request log analyzer, just do:
gem install request-log-analyzer
Then on your production box you can do something like:
request-log-analyzer log/production.log
It'll tell you all sorts of things like which controllers and actions are slow, etc, give it a try!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.
Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.
I used to write spiders, page scrapers and site analyzers for my job, and still write them periodically to scratch some itch I get.
Ruby has some excellent gems to make it easy:
Nokogiri is my #1 choice for the HTML parser. I used to use Hpricot, but found some sites that made it explode in flames. I switched to Nokogiri afterwards and have been very happy with it. I regularly use it for parsing HTML, RDF/RSS/Atom and XML. Ox looks interesting too, so that might be another candidate, though I find searching the DOM a lot easier than trying to walk through a big hash, such as what is returned by Ox.
OpenURI is good as a simple HTTP client, but it can get in the way when you want to do more complex things or need to have multiple requests firing at once. I'd recommend looking at HTTPClient or Typhoeus with Hydra for modest to heavyweight jobs. Curb is good too, because it uses the cURL library, but the interface isn't as intuitive to me. It's worth looking at though. HTTPclient is also worth looking at, but I lean toward the previously mentioned ones.
Note: OpenURI has some flaws and vulnerabilities that can affect unsuspecting programmers so it's fallen out of favor somewhat. RestClient is a very worthy successor.
You'll need a backing database, and some way to talk to it. This isn't a task for Rails per se, but you could use ActiveRecord, detached from Rails, to talk to the database. I've done that a couple times and it works all right. Instead, I really like Sequel for my ORM. It's very flexible in how it lets you talk to the database, from using straight SQL to using Sequel's ability to programmatically build a query, to modeling the database and using migrations. Once you have the database built, you could use Rails to act as a front-end to the data though.
If you are going to navigate sites in any way beyond simply grabbing pages and following links, you'll want to look at Mechanize. It makes it easy to fill out forms and submit pages. As an added bonus, you can grab the content of a page as a Nokogiri HTML document and parse away using Nokogiri's multitude of tricks.
For massaging/mangling URLs I really like Addressable::URI. It's more full-featured than the built-in URI module. One thing that URI does that's nice is it has the URI#extract method to scan a string for URLs. If that string happened to be the body of a web page it would be an alternate way of locating links, but its downside is you'll also get links to images, videos, ads, etc., and you'll have to filter those out, probably resulting in more work than if you use a parser and look for <a> tags exclusively. For that matter, Mechanize also has the links method which returns all the links in a page, but you'll still have to filter them to determine whether you want to follow or ignore them.
If you think you'll need to deal with Javascript manipulated pages, or pages that get their content dynamically from AJAX, you should look into using one of the WATIR variants. There are flavors for the different browsers on different OSes, such as Firewatir, Safariwatir and Operawatir, so you'll have to figure out what works for you.
You do NOT want to rely on keeping your list of URLs to visit, or visited URLs, in memory. Design a database schema and store that information there. Spend some time up front designing the schema, thinking about what things you'll want to know as you collect links on a site. SQLite3, MySQL and Postgres are all excellent choices, depending on how big you think your database needs will be. One of my site analyzers was custom designed to help us recommend SEO changes for a Fortune 50 company. It ran for over three weeks covering about twenty different sites before we had enough data and stopped it. Imagine what would have happened if we had a power-outage and all that data went in the bit-bucket.
After all that you'll want to also make your code be aware of proper spidering etiquette: What are the key considerations when creating a web crawler?
I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.
So you want a good Ruby-based web crawler?
Try spider or anemone. Both have solid usage according to RubyGems download counts.
The other answers, so far, are detailed and helpful but they don't have a laser-like focus on the question, which asks for ruby libraries for web crawlers. It would seem that this distinction can get muddled: see my answer to "Crawling vs. Web-Scraping?"
Tin Man's comprehensive list is good but partly outdated for me.
Most websites my customers deal with are heavily AJAX/Javascript dependent.
I've been using Watir / watir-webdriver / selenium for a few years too, but the overhead of having to load up a hidden web browser on the backend to render that DOM stuff just isn't viable, let alone that all this time they still haven't implemented a useable "browser session reuse" to let new code execution reuse an old browser in memory for this purpose, shooting down tickets that might have worked their way up the API layers eventually. (refering to https://code.google.com/p/selenium/issues/detail?id=18 ) **
https://rubygems.org/gems/phantomjs
is what we're migrating new projects over to now, to let the necessary data get rendered without even any sort of invisible Xvfb memory & CPU heavy web browser.
** Alternative approaches also failed to pan out:
how to serialize an object using TCPServer inside?
Can a watir browser object be re-used in a later Ruby process?
If you don't want to write your own, then use any ordinary web crawler. There are dozens out there.
If you do want to write your own, then write your own. A web crawler isn't exactly a complicated activity, it consists of:
Downloading a website.
Locating URLs in that website, filtered however you dang well please.
For each URL in that website, repeat step 1.
Oh, and this seems to be a duplicate of "Web crawler in ruby".
My company has multiple vendors that all have their own websites. I am creating a website that acts as a dashboard where customers can access all of the vendor's sites. I wanted to know what is the best option for doing this?
Here's what I have so far:
Iframe
Can bring in the entire website
Seems secure enough (not sure if I'm missing any information on security issues for this)
Users can interact with the vendor's website through our site
Our website cannot fully interact with the vendor's website (Also may be missing info here)
Pulling in the content
Can bring in the entire website
Not very secure from what I hear (Some websites actually say that pulling another website in is a voilation of security and will alert the user of this or something similar...
Users can interact with their website through our site
Our website can fully interact with the vendor's website
Anyone have any other options...?
What are some of the downsides to bringing in a site with an iframe and is this really our only option for doing something like this?
Optimally, we would like to pull in their site to ours without using an iframe- What options do we have on this level? Is there anything better than an iframe?
Please add in as much information as you can about iframes, pulling content, security, and website interactions like this. Anything to add in is appreciated.
Thanks,
Matt
As far as "pulling content" is concerned I wouldn't advise it as it can break. All it takes is a simple HTML change on their end and your bot will break. Also, it's more work than you think to do this for one site, let alone the many that you speak of. However, there are 3rd party apps that can do this for you if you have the budget.
You could use an iframe/frames, however, many sites might try to bust out of them and it can ruin the user experience of the site within the frame.
My advice is to use the following HTML for each link in your dashboard.
Vendor Site Link
If you can have the sites that you are embedding add some client-side script, then you could use easyXSS. It allows for easy transferring of data, and also calling javascript methods across the domain boundry.
I would recommend iFrames. Whilst not the most glamorous of elements, many payment service providers use iFrames for the Verified by Visa/Mastercard Secure Code integration.