I'm looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
How can I extend this to remove the subdomains that exist in some URLs?
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
This is a tricky issue. Some top-level domains do not accept registrations at the second level.
Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.
Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.
You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!
Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.
For posterity, here's an update from Oct 2014:
I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.
In combination with URI.parse for stripping protocol and paths, it works really well:
❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"
The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).
Ready for a complex regular expression? :)
re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip
Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).
I tested this expression on the following samples:
foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk
Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!
Something like:
def remove_subdomain(host)
# Not complete. Add all root domain to regexp
host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end
puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl
You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.
Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.
Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.
Why not just strip the .com or .co.uk and then split on '.' and get the last element?
some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1
Have to say it feels hacky. Are there any other domains like .co.uk?
I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.
Related
The following Ruby code accesses Google Translate's API:
require 'google/api_client'
client = Google::APIClient.new
translate = client.discovered_api('translate','v2')
client.authorization.access_token = '123' # dummy
client.key = "<My Google API Key>"
response = client.execute(
:api_method => translate.translations.list,
:parameters => {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
)
The response comes back in JSON format.
This code works for text-to-be-translated (the q argument) of less than approximately 750 characters, but generates the following error for longer text:
Error 414 (Request-URI Too Large)!!1Error 414 (Request-URI Too Large)!!1
I have Googled this problem and found the following pages:
https://developers.google.com/translate/
https://github.com/google/google-api-ruby-client
It seems that the Google API Client code is placing the call to the Google API
using a GET instead of a POST. As various systems place limits on the length of URLs,
this restricts the length of text that can be translated.
The form of the solution to this problem seems to be that I should instruct
the Google API interface to place the call to the Google API using a POST
instead of a GET. However, I so far haven't been able to figure out how to do this.
I see that there are a few other StackOverflow questions relating to this
issue in other languages. However, I so far haven't been able to figure out how
to apply it to my Ruby code.
Thanks very much Dave Sag. Your approach worked, and I have developed it
into a working example that other Stack Overflow users might find useful.
(Note: Stack Overflow won't let me post this as an answer for eight hours, so I'm
editing my question instead.) Here is the example:
#!/usr/local/rvm/rubies/ruby-1.9.3-p194/bin/ruby
#
# This is a self-contained Unix/Linux Ruby script.
#
# Replace the first line above with the location of your ruby executable. e.g.
#!/usr/bin/ruby
#
################################################################################
#
# Author : Ross Williams.
# Date : 1 January 2014.
# Version : 1.
# Licence : Public Domain. No warranty or liability.
#
# WARNING: This code is example code designed only to help Stack Overflow
# users solve a specific problem. It works, but it does not contain
# comprehensive error checking which would probably double the length
# of the code.
require 'uri'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'json'
################################################################################
#
# This method translates a string of up to 5K from one language to another using
# the Google Translate API.
#
# google_api_key
# This is a string of about 39 characters which identifies you
# as a user of the Google Translate API. At the time of writing,
# this API costs US$20/MB of source text. You can obtain a key
# by registering at:
# https://developers.google.com/translate/
# https://developers.google.com/translate/v2/getting_started
#
# source_language_code
# target_language_code
# These arguments identify the source and target language.
# Each of these arguments should be a string containing one
# of Google's two-letter language identification codes.
# A list of these codes can be found at:
# https://sites.google.com/site/opti365/translate_codes
#
# text_to_be_translated
# This is a string to be translated. Ruby provides excellent
# Unicode support and this string can contain non-ASCII characters.
# This string must not be longer than 5K characters long.
#
def google_translate(google_api_key,source_language_code,target_language_code,text_to_be_translated)
# Note: The 5K limit on text_to_be_translated is stated at:
# https://developers.google.com/translate/v2/using_rest?hl=ja
# It is not clear to me whether this is 5*1024 or 5*1000.
# The Google Translate API is served at this HTTPS address.
# See http://stackoverflow.com/questions/13152264/sending-http-post-request-in-ruby-by-nethttp
uri = URI.parse("https://www.googleapis.com/language/translate/v2")
request = Net::HTTP::Post.new(uri.path)
# The API accepts only the GET method. However, the GET method has URL length
# limitations so we have to use POST method instead. We need to use a trick for
# deliverying the form fields using POST but having the API perceive it as a GET.
# Setting a key/value pair in request adds a field to the HTTP header to do this.
request['X-HTTP-Method-Override'] = 'GET'
# Load the arguments to the API into the POST form fields.
params = {
'access_token' => '123', # Dummy parameter seems to be required.
'key' => google_api_key,
'format' => 'text',
'source' => source_language_code,
'target' => target_language_code,
'q' => text_to_be_translated
}
request.set_form_data(params)
# Execute the request.
# It's important to use HTTPS, as the API key is being transmitted.
https = Net::HTTP.new(uri.host,uri.port)
https.use_ssl = true
response = https.request(request)
# The API returns a record in JSON format.
# See http://en.wikipedia.org/wiki/JSON
# See http://www.json.org/
json_text = response.body
# Parse the JSON record, yielding a nested hash/list data structure.
json_structure = JSON.parse(json_text)
# Navigate down into the data structure to get the result.
# This navigation was coded by examining the JSON text from an actual run.
data_hash = json_structure['data']
translations_list = data_hash['translations']
translation_hash = translations_list[0]
translated_text = translation_hash['translatedText']
return translated_text
end # def google_translate
################################################################################
google_api_key = '<INSERT YOUR GOOGLE TRANSLATE API KEY HERE>'
source_language_code = 'en' # English
target_language_code = 'fr' # French
# To test the code, I have chosen a sample text of about 3393 characters.
# This is large enough to exceed the GET URL length limit, but small enough
# not to exceed the Google Translate API length limit of 5K.
# Sample text is from http://www.gutenberg.org/cache/epub/2701/pg2701.txt
text_to_be_translated = <<END
Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
There now is your insular city of the Manhattoes, belted round by
wharves as Indian isles by coral reefs--commerce surrounds it with
her surf. Right and left, the streets take you waterward. Its extreme
downtown is the battery, where that noble mole is washed by waves, and
cooled by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.
Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears
Hook to Coenties Slip, and from thence, by Whitehall, northward. What
do you see?--Posted like silent sentinels all around the town, stand
thousands upon thousands of mortal men fixed in ocean reveries. Some
leaning against the spiles; some seated upon the pier-heads; some
looking over the bulwarks of ships from China; some high aloft in the
rigging, as if striving to get a still better seaward peep. But these
are all landsmen; of week days pent up in lath and plaster--tied to
counters, nailed to benches, clinched to desks. How then is this? Are
the green fields gone? What do they here?
But look! here come more crowds, pacing straight for the water, and
seemingly bound for a dive. Strange! Nothing will content them but the
extremest limit of the land; loitering under the shady lee of yonder
warehouses will not suffice. No. They must get just as nigh the water
as they possibly can without falling in. And there they stand--miles of
them--leagues. Inlanders all, they come from lanes and alleys, streets
and avenues--north, east, south, and west. Yet here they all unite.
Tell me, does the magnetic virtue of the needles of the compasses of all
those ships attract them thither?
Once more. Say you are in the country; in some high land of lakes. Take
almost any path you please, and ten to one it carries you down in a
dale, and leaves you there by a pool in the stream. There is magic
in it. Let the most absent-minded of men be plunged in his deepest
reveries--stand that man on his legs, set his feet a-going, and he will
infallibly lead you to water, if water there be in all that region.
Should you ever be athirst in the great American desert, try this
experiment, if your caravan happen to be supplied with a metaphysical
professor. Yes, as every one knows, meditation and water are wedded for
ever.
END
translated_text =
google_translate(google_api_key,
source_language_code,
target_language_code,
text_to_be_translated)
puts(translated_text)
################################################################################
I'd use the REST API directly rather than the whole Google API which seems to encompass everything but when I searched the source for 'translate' it turned up nothing.
Create a request, set the appropriate header as per these docs
Note: You can also use POST to invoke the API if you want to send more data in a single request. The q parameter in the POST body must be less than 5K characters. To use POST, you must use the X-HTTP-Method-Override header to tell the Translate API to treat the request as a GET (useX-HTTP-Method-Override: GET).
code (not tested) might look like
require 'net/http'
require 'net/https'
uri = URI.parse('https://www.googleapis.com/language/translate/v2')
request = Net::HTTP::Post.new(uri, {'X-HTTP-Method-Override' => 'GET'})
params = {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
request.set_form_data(params)
https = Net::HTTP.new(uri.host, uri.port)
https.use_ssl = true
response = https.request request
parsed = JSON.parse(response)
I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.
From any URL I want to extract its path.
For example:
URL: https://stackoverflow.com/questions/ask
Path: questions/ask
It shouldn't be difficult:
url[/(?:\w{2,}\/).+/]
But I think I use a wrong pattern for 'ignore this' ('?:' - doesn't work). What is the right way?
I would suggest you don't do this with a regular expression, and instead use the built in URI lib:
require 'uri'
uri = URI::parse('http://stackoverflow.com/questions/ask')
puts uri.path # results in: /questions/ask
It has a leading slash, but thats easy to deal with =)
You can use regex in this case, which is faster than URI.parse:
s = 'http://stackoverflow.com/questions/ask'
s[s[/.*?\/\/[^\/]*\//].size..-1]
# => "questions/ask" (6,8 times faster)
s[/\/(?!.*\.).*/]
# => "/questions/ask" (9,9 times faster, but with an extra slash)
But if you don't care with the speed, use uri, as ctcherry showed, is more readable.
The approach presented by ctcherry is perfectly correct, but I prefer to use request.fullpath instead of including the URI library in the code. Just call request.fullpath in your views or controllers. But be careful, if you have any GET parameters in your URL it will be catched, in this case a use a split('?').first
I have a ruby script that downloads URLs from an RSS server and then downloads the files at those URLs.
I need to split the URL into 2 components like so -
http://www.website.com/dir1/dir2/file.txt
--> 'www.website.com' and 'dir1/dir2/file.txt'
I'm struggling to come up with a way to do this. I've been playing with regular expressions but nothing has worked. How would others go about doing this?
Use the URI library.
require 'uri'
u = URI.parse("http://www.website.com/dir1/dir2/file.txt")
u.host
# => "www.website.com"
u.path
# => "/dir1/dir2/file.txt"
In a simple way , you could use split .
split('/')[2]
STILL NOT RESOLVED :( [Feb 11th]
I have a large text file full of random data and want to pull out all the email addresses from it.
I would like to do this in Ruby, with pseudo code like this:
monster_data_string = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)
Does anyone know what Ruby email regular expression I would use to accomplish this?
Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*
What I have already tried is:
monster_data_string.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
but I receive Ruby errors stating that "+" is an invalid character
Thanks in advance
Watch this...
f = File.open("content.txt")
content = f.read
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:
/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i
For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe#example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.
Given that it is not possible to parse every valid email address using a regexp you are left with two choices:
Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.
or
Make a regexp that Matches anything that "might be" an email address and then live with the false positives
I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page
Gleaned from Ruby Cookbook which has a very good section on email address validation:
valid = '[^ #]+'
/^#{valid}##{valid}\.#{valid}/
Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).
What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?
To try and help you get there (though not very elegantly, I admit):
I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:
irb(main):001:0> mds = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
=> "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
=> nil
irb(main):004:0> mds.match(/([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "**joe#example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^#\s*]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "joe#example.com" 1:"joe" 2:"example.com">
Even better,
require 'yaml'
content = "asfsfsdfsdfsf sfda **joe#example.com.au** sdfdsf cool_me#example.com.fr"
r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)#([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
will give you
---
- - joe
- example
- .com.au
- - cool_me
- example
- .com.au