A crawler with smart support for sitemaps?

A crawler with smart support for sitemaps? - sitemap

I'm trying to crawl a few hundred sites regularly. I would like to do this in the most efficient and consistent way possible. To do so it seems like the correct approach would be to use sitemaps where ever they are found. So first of all I am looking for a crawler that recognizes sitemaps and knows how to use them. The second issue, is how smart the crawler is. While some sites maintain their sitemaps perfectly, many do not. Their sitemaps may be out of date or in a non-standard format. Each situation needs a different approach.
So the question is whether this has been tackled in the open source (or commercial world)? Are there projects that do this well or well enough? I looked at a few of the open source crawlers that I identified and couldn't find this level of site crawling intelligence. If the answer is no. Are there any other good resources on this problem?

Our open source crawler Norconex HTTP Collector does support sitemaps. I do not know about the non-standard formats you've encountered, but it detects sitemaps at different locations (website root, listed in robots.txt, user-provided). It supports sitemap sub-indices as well as compressed sitemaps. Have a try and give your feedback if you'd like to suggest new features. If you know your way around Java, you can also swap the default sitemap-resolving implementation with your own.

Related

What is a good open source package for building flexible spam detection on a large Rails site?

My site is getting larger and it's starting to attract a lot of spam through various channels. The site has a lot of different types of UGC (profiles, forums, blog comments, status updates, private messages, etc, etc). I have various mitigation efforts underway, which I hope to deploy in a blitzkrieg fashion to convince the spammers that we're not a worthwhile target. I have high confidence in what I'm doing functionality wise, but one missing piece is killing all the old spam all at once.
Here's what I have:
Large good/bad corpora (5-figure bad, 6 or 7-figure good). A lot of the spam has very reliable fingerprints, and the fact that I've sort of been ignoring it for 6 months helps :)
Large, modular Rails site deployed to AWS. It's not a huge traffic site, but we're running 8 instances with the beginnings of a SOA.
Ruby, Redis, Resque, MySQL, Varnish, Nginx, Unicorn, Chef, all on Gentoo
My requirements:
I want it to perform reasonably well given the volume of data (therefore I'm wary of a pure ruby solution).
I should be able to train multiple classifications to different types of content (419-scam vs botnet link spam)
I would like to be able to add manual factors based on our own detective work (pattern matching, IP reuse, etc)
Ultimately I want to construct a nice interface to be used with Ruby. If this requires getting my hands dirty in C or whatever, I can handle it, but I'll avoid it if I can.
I realize this is a long and vague question, but what I'm looking for primarily is just a list of good packages, and secondarily any random thoughts from someone who has built a similiar system about ways to approach it.

We looked for an acceptable open source solution and didn't find one.
If you come to the same conclusion and decide to consider proprietary anti-spam, check out the paid Akismet collaborative spam filtering service. We've had decent performance from it across a dozen medium sized sites. It integrates with rails through rack and rackismet.

Resources containing cross-language benchmarks?

What resources are available that use benchmarks for comparing programming languages?
I am interested in both
How quickly a program in a given language can execute a given benchmark?
How many lines of code are required in a given language to implement a given benchmark?
There is a long-standing web site called the Computer Language Benchmarks Game, originally created by Doug Bagley as the "Great Computer Language Shootout". (You can view a little history at Portland Patterns Repository.)
Is anyone aware of other resources that enable programmers to compare performance and size of programs written in different languages?

Alternatives
After a quick google search, I found a couple other sites where benchmarks for various languages have been done. Some other sites mention the programming language shootout site that is currently down.
There is a CPAN module for Perl that uses the same code found on that site.
Google has a directory where pages on this topic can be found. I have not found any yet that are as comprehensive as the page you speak of, but there are certainly other resources out there for comparisons.
Archived / Cached Page
If you're only seeking some information there, you can view archived pages of the site using the Wayback Machine or Google's cached version. Try searching Google with "site: shootout.alioth.debian.org" and click on the "Cached" links for the pages you find.
Find the Author?
Perhaps the best option is to try to contact the owner of the old site and find out what happened. The author mentioned in the BSD licence on this page is "Brent Fulgham". He may or may not be the one to contact.
Wait until Alioth is Fixed
As #ioguy found out, Debian's Alioth server that hosts the site in question is currently under maintenance. I would suggest subscribing to the debian-devel-announce mailing list for updates, and an idea of when it may be fully functional again.
If you find problems in the future, you can probably post to the debian-user list.

Each year there are two or three
isolated blog posts that claim to
compare performance and size of one
or two programs written in different
languages.
As a resource the blog posts fail for obvious
reasons, most obviously:
not updated with newer versions of the language implementation
not updated with better programs
Every couple of years someone
dissatisfied with something about
the benchmarks game (often some
detail about the code repository or
website technology) starts a project that will
fix everything they dislike about the benchmarks game.
As a resource the most obvious problem with those
projects is that they never seem to get
close to publishing performance
data.
Every year some group of programmers
campaigns to have language X
included in the benchmarks game,
while some other group demands that
some program is included (or
excluded).
Sadly, they rarely accept that among
the resources provided by the
benchmarks game are
scripts they can use to make and publish language performance
measurements
examples of which basic information (language version, build
commands, run commands, measurement
techniques, ...) is required to provide context for the measurements.
They rarely accept that they are
empowered to create what they wish
to see.

The benchmarks game website is now back to normal!
From Friday 20 May 2011 through Monday 23 May 2011, ALL alioth.debian.org subdomains were down - because the alioth admins were upgrading "in every way we can find: kernel, Debian release, FusionForge software, hardware, and so on."
In addition, making the benchmarks game website work again required:
installation of the GD library on the new server, for chart generation
basic information about changes to ssh use on the new servers
basic information about the project cvs repository on the new servers
basic information about the project /htdocs location on the new servers
replacement of the long deprecated
$HTTP_GET_VARS by $_GET in a couple
of dozen PHP scripts
Since the performance benchmark site
for Programming Languages (aka
Programming Language "Shootout" &
shootout.alioth.debian.org) is
permanently down ...
The original question was predicated on a false premise.

How to implement simple online management for a book library?

At my institution, we have a small library with 150 books and 50
users. We would like to use a simple online management system that
displays the books, lets users search and enter when they get and
return a book. (There is no librarian, the books are just in an
otherwise empty room.)
I'm not familiar with modern web content management systems. In the
old days, I would have just implemented a quick Perl/CGI script, but I
think there are better options nowadays?
What would be the simplest way to get/implement such a system? Django?
Ruby on Rails? Ideally, I'd like to just run it in my user account
without having to install database support etc.
Is it possible to do everything on one dynamic HTML page? What role
does AJAX play in such a system?

I suggest take a look at the available open source tools for libraries before deciding to build one from scratch:
http://www.libsuccess.org/index.php?title=Open_Source_Software#Great_Free.2FOpen_Source_Tools_for_Libraries
 
Another good resource in your research: http://www.oss4lib.org/
 
If you find an existing tool that fits the bill (or enough to make it worth extending), that will be important in guiding what platform/language/framework and techniques will be best to use.

If you want a quick and easy solution, you might want to consider using SQLite as the database backend, since it does not require any configuration or setup (except for the tables, of course).
If you have a machine standing around there, you could take a look at Qt/C++ or PyQt to create a simple user interface.
Pylons (there are lots of alternatives!) or any other web framework might do the job as well, but I guess it would be more work to create a web application than a quick and simple desktop application for this job.

This is quite a complicated question doesn't have a simple answer. The best I can do is point you in the direction of some resources to get you started:
Framework/CMS
Unfortunately, most frameworks require at least some minimal kind of db interaction. While this is not true for all, it would probably be easiest to steer clear of a framework, you probably don't need that much overhead anyway.
Javascript/AJAX
If you want things to happen without any seperate pageloads, then sure, you can use some ajax. However, you probably don't need anything this sophisiticated
How I Would Do It
If you really trusted your students enough to be diligent about checking in/out books, I think it would be easiest to just have a form on a webpage somewhere that they could enter the number of the book they are checking in/out. Then store the state of each book in a text file somewhere (you said you didn't want to use any db's), or even look into sqlite.
Again, you probably don't need all the overhead of a full framework/CMS. It would be fairly trivial to, as you said, write a quick script to handle the ISDN, ID, Title, Whatever of the book they are checking in/out.
Also, there are significantly easier languages to write scripts in these days than Perl and CGI. Try PHP, Ruby, or Java

Tracking Useful Information

What do the clever programmers here do to keep track of handy programming tricks and useful information they pick up over their many years of experience? Things like useful compiler arguments, IDE short-cuts, clever code snippets, etc.
I sometimes find myself frustrated when looking up something that I used to know a year or two ago. My IE favorites probably represent a good chunk of the Internet in the late 1990s, so clearly that isn't effective (at least for me). Or am I just getting old?
So.. what do you do?

Two Things I do:
I blog about it - this allows me to go back and search my own blog.
We use the code snippet feature in Visual Studio.
Cheers.

I use:
Google Notebook - I take notes for projects, books I'm reading, etc
Delicious + Firefox plug in - Every time I see a good page I mark it.
Windows Journal (in tablet pc) - When I need to draw something and then copy/cut/paste it. I have more distractions here, the web is always very close :)
Small Moleskine paper notebook - Its always with me.
Big paper notebook - When I need more space to write and less distractions.
Obviously these are for all useful information, not just for snippets or tips and tricks.

Why not set up a Wiki?
If you are on windows, i know that ScrewTurn wiki is pretty simple to deploy on a desktop/laptop. No database to fuss around with.

Blog about it.
One of the nice side-effects of blogging is that if you use a sensible categorization or tagging system, it's quite easy to search for stuff within your blog. The fact that you wrote about it also makes it easier to remember problems you have encountered before ("hey, I blogged about that!").
That's a great benefit aside from, of course, being able to share this information publicly so that others might be able to find your solution to a particular problem using Google.

A number of people I know swear by Google Notebook

I send them to my gmail account, that way I have them where ever I go, and they can be put into appropriate folders for later.

I second the blog about it technique...even Jeff said that's a major reason he blogs.
Also, regarding the wiki idea, if you set one up at work, be sure to encourage your coworkers to do the same. When someone finds something of interest they can just write a little "article" explaining what it is and how to do it... that way, not only are your own things easily available and quickly searchable, but you'll often find out things you never knew from other people in your group. That way it benefits everyone not just you.

I agree with emailing, the wiki and the blog. Emailing is the most useful. If you can't use GMail and you're on windows, install a desktop search utility (Windows search, Google Desktop, Copernic, etc)
I also like to jot it into a textfile and save it in my documents folder. Whatever desktop search utility you use will be able to find it easily. e.g.
//print spool stop.notes.txt
If the printer spooler stops, start it again by
- Services > Provision Networks > Restart Service
tags: printer provision no printer spooler cannot print remote desktop

Subscribe in Google Reader and then search later.

At my last place of work they wouldn't let me set up a wiki or anything - so I just made various word documents full of tips and instructions and gave that to my successor when I left.
Now though I'd use a private wiki, or maybe a blog.

For many years I've kept a Word doc named Knowledgebase.doc that contains all my notes with a decent table of contents. I like to keep everything in one searchable doc.
I use a sync tool to make sure the file is copied to all the machines I want it on.

I use TiddlyWiki stored in my DropBox account. Although, recently, Evernote is getting my atention; it has a really useful feature: you send a twitter direct message to evernote user (myen) and it adds a note with your message (a really quick way to add notes or URL's for post-processing). Imagine, you can use a command-line twitter client to create notes! (or any twitter client). I really like this feature.

Windows Help files - what are the options?

Back in the old days, Help was not trivial but possible: generate some funky .rtf file with special tags, run it through a compiler, and you got a WinHelp file (.hlp) that actually works really well.
Then, Microsoft decided that WinHelp was not hip and cool anymore and switched to CHM, up to the point they actually axed WinHelp from Vista.
Now, CHM maybe nice, but everyone that tried to open a .chm file on the Network will know the nice "Navigation to the webpage was canceled" screen that is caused by security restrictions.
While there are ways to make CHM work off the network, this is hardly a good choice, because when a user presses the Help Button he wants help and not have to make some funky settings.
Bottom Line: I find CHM absolutely unusable. But with WinHelp not being an option anymore either, I wonder what the alternatives are, especially when it comes to integrate with my Application (i.e. for WinHelp and CHM there are functions that allow you to directly jump to a topic)?
PDF has the disadvantage of requiring the Adobe Reader (or one of the more lightweight ones that not many people use). I could live with that seeing as this is kind of standard nowadays, but can you tell it reliably to jump to a given page/anchor?
HTML files seem to be the best choice, you then just have to deal with different browsers (CSS and stuff).
Edit: I am looking to create my own Help Files. As I am a fan of the "No Setup, Just Extract and Run" Philosophy, i had that problem many times in the past because many of my users will run it off the network, which causes exactly this problem.
So i am looking for a more robust and future-proof way to provide help to my users without having to code a different help system for each application i make.
CHM is a really nice format, but that Security Stuff makes it unusable, as a Help system is supposed to provide help to the user, not to generate even more problems.

HTML would be the next best choice, ONLY IF you would serve them from a public web server. If you tried to bundle it with your app, all the files (and images (and stylesheets (and ...) ) ) would make CHM look like a gift from gods.
That said, when actually bundled in the installation package, (instead of being served over the network), I found the CHM files to work nicely.
OTOH, another pitfall about CHM files: Even if you try to open a CHM file on a local disk, you may bump into the security block if you initially downloaded it from somewhere, because the file could be marked as "came from external source" when it was obtained.

I don't like the html option, and actually moved from plain HTML to CHM by compressing and indexing them. Even use them on a handful of non-Windows customers even.
It simply solved the constant little breakage of people putting it on the network (nesting depth limited, strange locking effects), antivirus that died in directories with 30000 html files, and 20 minutes decompression time while installing on an older system, browser safety zones and features, miscalculations of needed space in the installer etc.
And then I don't even include the people that start "correcting" them, 3rd party product with faulty "integration" attempts etc, complaints about slowliness (browser start-up)
We all had waited years for the problems to go away as OSes and hardware improved, but the problems kept recurring in a bedazzling number of varieties and enough was enough. We found chmlib, and decided we could forever use something based on this as escape with a simple external reader, if the OS provided ones stopped working and switched.
Meanwhile we also have an own compiler, so we are MS free future-proof. That doesn't mean we never will change (solutions with local web-servers seem favourite nowadays), but at least we have a choice.

Our software is both distributed locally to the clients and served from a network share. We opted for generating both a CHM file and a set of HTML files for serving from the network. Users starting the program locally use the CHM file, and users getting their program served from a network share has to use the HTML files.
We use Help and Manual and can thus easily produce both types of output from the same source project. The HTML files also contain searching capabilities and doesn't require a web server, so though it isn't an optimal solution, works fine.
So far all the single-file types for Windows seems broken in one way or another:
WinHelp - obsoleted
HtmlHelp (CHM) - obsoleted on Vista, doesn't work from network share, other than that works really nice
Microsoft Help 2 (HXS) - this seems to work right up until the point when it doesn't, corrupted indexes or similar, this is used by Visual Studio 2005 and above, as an example

If you don't want to use an installer and you don't want the user to perform any extra steps to allow CHM files over the network, why not fall back to WinHelp? Vista does not include WinHlp32.exe out of the box, but it is freely available as a download for both Vista and Server 2008.

It depends on how import the online documentation is to your product, a good documentation infrastructure can be complex to establish but once done it pays off. Here is how we do it -
Help source DITA compilant XML, stored in SCC (ClearCase).
Help editing XMetal
Help compilation, customized Open DITA Toolkit, with custom Perl/Java preprocessing
Help source cross references applications resources at compile time, .RC files etc
Help deliverables from single source, PDF, CHM, Eclipse Help, HTML.
Single source repository produces help for multiple products 10+ with thousands of shared topics.
From what you describe I would look at Eclipse Help, its not simple to integrate into .NET or MFC applications, you basically have to do the help mapping to resolve the request to a URL then fire the URL to Eclipse Help wrapper or a browser.

Is the question how to generate your own help files, or what is the best help file format?
Personally, I find CHM to be excellent. One of the first things I do when setting up a machine is to download the PHP Manual in CHM format (http://www.php.net/download-docs.php) and add a hotkey to it in Crimson Editor. So when I press F1 it loads the CHM and performs a search for the word my cursor is on (great for quick function reference).

If you are doing "just extract and run", you are going to run in security issues. This is especially true if you are users are running Vista (or later). is there a reason why you wanted to avoid packaging your applications inside an installer? Using an installer would alleviate the "external source" problem. You would be able to use .chm files without any problems.
We use InstallAware to create our install packages. It's not cheap, but is very good. If cost is your concern, WIX is open source and pretty robust. WIX does have a learning curve, but it's easy to work with.

PDF has the disadvantage of requiring the Adobe Reader
I use Foxit Reader on Windows at home and at work. A lot smaller and very quick to open. Very handy when you are wondering what exactly a80000326.pdf is and why it is clogging up your documents folder.

I think the solution we're going to end up going with for our application is hosting the help files ourselves. This gives us immediate access to the files and the ability to keep them up to date.
What I plan is to have the content loaded into a huge series of XML files, each one containing help for a specific item. This XML would contain links to other XML files. We would use XSLT to display the contents as necessary.
Depending on the licensing, we may build a client-specific XSLT file in order to tailor the look and feel to what they need. We may need to be able to only show help for particular versions of our product as well and that can be done by filtering out stuff in the XSLT.

I use a commercial package called AuthorIT that can generate a number of different formats, such as chm, html, pdf, word, windows help, xml, xhtml, and some others I have never heard of (does dita ring a bell?).
It is a content management system oriented towards the needs of technical documentation writers.
The advantage is that you can use and re-use the same content to build a set of guides, and then generate them in different formats.
So the bottom line relative to the question of choosing chm or html or whatever is that if you are using this you are not locked into a given format, but you can provide several among which the user can choose, and you can even add more formats as you go along, at no extra cost.
If you just have one guide to create it won't be worth your while, but if you have a documentation set to manage then it is the best to my knowledge. Their support is very helpful also.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio