XPath: encode-for-uri() but keep characters allowed in IRIs unencoded - xpath

XPath has the function encode-for-uri() that makes a string safe for use in a URI path segment:
encode-for-uri('AC/DC') => AC%2FDC
But it also %-encodes international characters:
encode-for-uri('汉/语') => %E6%B1%89%2F%E8%AF%AD
This is indeed necessary for URIs, but it is not necessary for IRIs, which are allowed to include these characters.
Is there a way to achieve the effect of encode-for-uri() in XPath while keeping i18n characters unencoded? Like this:
???('汉/语') => 汉%2F语

Perhaps the iri-to-uri() function does what you are looking for.
However, it doesn't escape "/" - it's designed to operate on an entire IRI, not on a segment of the path.

Related

Ruby regexp /(\/.*?(?=\/|$)){2}/

My regexp behaves just like I want it to on http://regexr.com, but not like I want it in irb.
I'm trying to make a regular expression that will match the following:
A forward slash,
then 2 * any number of random characters (i.e. `.*`),
up to but not including another /
OR the end of the string (whichever comes first)
I'm sorry as that was probably unclear, but it's my best attempt at an English translation.
Here's my current attempt and hopefully that will give you a better idea of what I'm trying to do:
/(\/.*?(?=\/|$)){2}/
The usage scenario is I want to be able to take a path like /foo/bar/baz/bin/bash and shorten it to the level I'm at in the filesystem, in this case the second level (/foo/bar). I'm trying to do this using the command path.scan(-regex-).shift.
The usage scenario is I want to be able to take a path like /foo/bar/baz/bin/bash and shorten it to the level I'm at in the filesystem, in this case the second level (/foo/bar)
Ruby already has a class for handling paths, Pathname. You can use Pathname#relative_path_from to do what you want.
require 'pathname'
path = Pathname.new("/foo/bar/baz/bin/bash")
# Normally you'd use Pathname.getwd
cwd = Pathname.new("/foo/bar")
# baz/bin/bash
puts path.relative_path_from(cwd)
Regexes just invite problems, like assuming the path separator is /, not honoring escapes, and not dealing with extra /. For example, "//foo/bar//b\\/az/bin/bash". // is particularly common in code which joins together directories using paths.join("/") or "#{dir}/#{file}.
For completeness, the general way you match a single piece of a path is this.
%r{^(/[^/]+)}
That's the beginning of the string, a /, then 1 or more characters which are not /. Using [^/]+ means you don't have to try and match an optional / or end of string, a very useful technique. Using %r{} means less leaning toothpicks.
But this is only applicable to a canonicalized path. It will fail on //foo//b\\/ar/. You can try to fix up the regex to deal with that, or do your own canonicalization, but just use Pathname.

Is there a function to url encode dot ('.') in ruby

I need to send an email as a parameter in the query string.
Non of the standard functions i have tried is able to encode the '.' (dot).
CGI.escape('my.fake#email.com')
=> "my.fake%40email.com"
URI.escape('my.fake#email.com')
=> "my.fake#email.com"
URI.encode('my.fake#email.com')
=> "my.fake#email.com"
ERB::Util.url_encode('my.fake#email.com')
=> "my.fake%40email.com"
I can easily do a function myself, but i just wanted to know if is there already any method.
As tadman pointed here, you will have a problem with dots if the route is something like get "users/:email". Because rails considers dots as separators. As pointed in the docs, you will then need to add a constraint like:
get "users/:email", to: "users#show", constraints: { email: /.*/ }
And with that you don't need to escape dots in the url.
You actually don't need to encode the dot. After the ? in the url, / and . don't have any specific meaning.
You can escape characters that wouldn't normally be escaped by providing a regex as the 2nd argument to URI.escape
URI.escape('a#b.c', /\./)
=> "a#b%2Ec"
(keep in mind that providing your own regex overrides the default, so now nothing but . will be encoded)

How to reversibly escape a URL in Ruby so that it can be saved to the file system

The use-case example is saving the contents of http://example.com as a filename on your computer, but with the unsafe characters (i.e. : and /) escaped.
The classic way is to use a regex to strip all non-alphanumeric-dash-underscore characters out, but then that makes it impossible to reverse the filename into a URL. Is there a way, possibly a combination of CGI.escape and another filter, to sanitize the filename for both Windows and *nix? Even if the tradeoff is a much longer filename?
edit:
Example with CGI.escape
CGI.escape 'http://www.example.com/Hey/whatsup/1 2 3.html#hash'
#=> "http%3A%2F%2Fwww.example.com%2FHey%2Fwhatsup%2F1+2+3.html%23hash"
A couple things...are % signs completely safe as file characters? Unfortunately, CGI.escape doesn't convert spaces in a malformed URL to %20 on the first pass, so I suppose any translation method would require changing all spaces to + with a gsub and then applying CGI.escape
One of the ways is by "hashing" the filename. For example, the URL for this question is: https://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste. You could use the Ruby standard library's digest/md5 library to hash the name. Simple and elegant.
require "digest/md5"
foldername = "https://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste"
hashed_name = Digest::MD5.hexdigest(foldername) # => "5045cccd83a8d4d5c4fc01f7b4d8c502"
The corollary for this scheme would be that MD5 hashing is used to validate the authenticity/completeness of downloads since for all practical purposes, the MD5 digest of the string always returns the same hex-string.
However, I won't call this "reversible". You need to have a custom way to look up the URLs for each of the hashes that get generated. May be, a .yml file with that data.
update: As #the Tin Man suggests, a simple SQLite db would be much better than a .yml file when there are a large number of files that need storing.
here is how I would do it (adjust the regular expression as needed):
url = "http://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste"
filename = url.each_char.map {|x|
x.match(/[a-zA-Z0-9-]/) ? x : "_#{x.unpack('H*')[0]}"
}.join
EDIT:
if the length of the resulting file name is a concern then I would store the files in sub-directories with the same names as the url path segments.

How is Illegal char's URL working?

There are many sites (such as Stackoverflow) that has the title of the page in the URL.
I am looking for the algorithm in which they are using in order to avoid illegal URL characters. ( I dont want URL encoding, I want replace/remove algo)
like 'How is Illegal char's URL working?' will become 'How-is-Illegal-chars-URL-working'
Thanks!
The algorithm to do this is generally called 'slugify', because it turns a string into a 'slug' to be used in a URL. Searching for that should give you plenty of useful implementations.
No idea how SO does it, but I would just strip every non-alphanumeric character and replace spaces with underscores.
In Python:
def cleanTitle(title):
temp = ''
for character in title.lower():
if character in 'abcdefghijklmnopqrstuvwxyz1234567890_-+/<>,.=[]{}()\|!##$%^&':
temp += character
return temp
I see you are working in C#. I don't know C#, so you'll have to translate this code. I doubt it's hard to do, though.

Negating strings in Ruby regular expressions

I'm looking for a way to extract LinkedIn profile pages from lists of URLs using Ruby. Currently I am looping over the URLs and matching them against this regex:
/^http:\/\/.+\.linkedin.com\/(pub|in)/
However, the URLs of LinkedIn profile directory pages are as follows:
http://www.linkedin.com/pub/dir
, so I'm looking to avoid any links that have the pub/dir path in them. I know it's possible to negate character classes in Ruby regexs, such as [^abc] matching any character that isn't abc. Is there a way to do the same with strings? I.e. matching any sequence of characters besides "dir"?
You can use a negative lookahead. Something like
(pub(?!\/dir)|in)

Resources