Extracting a Hostname's TLD with a Regular Expression - ruby

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.
Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat
I have two specific questions:
Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.
If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.
Here's my take on converting this file into a usable Regexp in Ruby:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')$'
)

You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.
From the docs:
Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to the relevant RFCs and adds support for IRIs and URI templates. Additionally, it provides extensive support for URI templates.
With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.
That answer mentions the public TLD list. Using the information there you could quickly use Ruby's Regexp.escape and Regexp.union methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't so union will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)

There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.

There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.

Related

How can I scrape, parse and crawl files in Ruby?

I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI makes it easy to retrieve pages.
URI.extract makes it easy to find links in pages. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
Simple, untested, logic to start might look like:
require "openuri"
require "uri"
urls_to_scan = %w[
http://www.example.com/page1
http://www.example.com/page2
]
loop do
break if urls_to_scan.empty?
url = urls_to_scan.shift
html = open(url).read
# you probably want to do something to make sure the URLs are not
# pointing outside the site you're walking.
#
# Something like:
#
# URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
#
new_urls = URI.extract(html)
if (new_urls.any?)
urls_to_scan += new_urls
else
; # parse your file as data using the content in html
end
end
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.
There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.
Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

Localization best practices

I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

HTTP response splitting

I'm trying to handle this possible exploit and wondering what is the best way to do it? should i use apache's common-validator and create a list of known allowed symbols and use that?
From the wikipedia article:
The generic solution is to URL-encode strings before inclusion into HTTP headers such as Location or Set-Cookie.
Typical examples of sanitization include casting to integer, or aggressive regular expression replacement. It is worth noting that although this is not a PHP specific problem, the PHP interpreter contains protection against this attack since version 4.4.2 and 5.1.2.
Edit
im tied into using jsp's with java actions!
There don't appear to be any JSP-based protections for this attack vector - many descriptions on the web assume asp or php, but this link describes a fairly platform-neutral way to approach the problem (jsp used as an incidental example in it).
Basically your first step is to indentify the potentially hazardous characters (CRs, LFs, etc) and then to remove them. I'm afraid this about as robust a solution as you can hope for!
Solution
Validate input. Remove CRs and LFs (and all other hazardous characters) before embedding data into any HTTP response headers, particularly when setting cookies and redirecting. It is possible to use third party products to defend against CR/LF injection, and to test for existence of such security holes before application deployment.
Use PHP? ;)
According to Wikipedia and the PHP CHANGELOG, PHP's had protection against it in PHP4 since 4.4.2 and PHP5 since 5.1.2.
Only skimmed it -- but, this might help. His examples are written in JSP.
ok, well casting to an int is not much use when reading strings, also using regex in every action which recieves input from browser could be messy, im looking for a more robust solution

Resources