Question
Using only modules and functions that are built-in to vanilla Python 2.x (where x is >= 7), and without rolling my own Python function to do so (see below), how do I translate characters to strings similar to how urllib.quote(), but limited to a subset of the characters that urllib.quote() translates?
In the example below, it is doing urlencoding, but my question happens to be more basic and general than just urlencoding. In the general case, I would like to translate arbitrarily specified characters into arbitrarily specified strings, and not just those for RFC compliance for url encoding.
Example
In the example below, I am translating only the balanced paren characters (square brackets, curly brackets, and parenthesis), and not all other characters that need to be quoted for URL query strings. This is not web-related at all, but is for paren navigation of the output the Python script produces.
I thought I would be able to use something that would seem both concise and efficient, that would be built into vanilla Python 2.x (vanilla being not requiring additional modules to be installed) using something like string translate, but the latter requires that the translation table be a mapping from single characters to single characters, but not my desired mapping from a single character to a multiple character strings.
So I wrote my own below:
def urlencode_parens(line):
"""url encode parens as would urllib.quote would do, but we only want it for parens"""
trans_table = {
'(': '%28',
')': '%29',
'{': '%7B',
'}': '%7D',
'[': '%5B',
']': '%5D',
}
retval = []
for char in line:
if char in trans_table:
char = trans_table[char]
retval.append(char)
return "".join(retval)
Although rudimentary timing analysis showed the above to be blazingly fast, it bothers me that I had to code that up in the first place (because now I have to stash that away in my own set of personal modules and maintain it).
Things I tried
I investigated how to force urllib.quote() into only translating the above mentioned characters, but it seems it internally hardcodes the translation of those characters without any way to extend/customize it.
I can do this using re.sub(), but I would have to chain them, given the immutability of strings in Python anyhow. It resulted in code that looked more like Lisp than Python (not that Lisp is bad, just that "When in Rome ... etc."). And given that regexp translation probably involves repeated re compilation of the regexps, I gave up on that due to thoughts that it might be less performant that what I cooked up above.
Update #1: maketrans documentation string is misleading and/or incorrect
Looking at string.maketrans I see:
That gives no indication that the to argument is optional. Nor does it state anything about how you can use the from field as was helpfully indicated by the https://stackoverflow.com/a/51481561/257924 answer.
in python 3 you can do this:
if you only pass one argument to str.maketrans (a mapping dictionary) you can also have more than one character:
trans_dict = {
'(': '%28',
')': '%29',
'{': '%7B',
'}': '%7D',
'[': '%5B',
']': '%5D',
}
trans_table = str.maketrans(trans_dict)
print('dict={} tuple=()'.translate(trans_table))
# dict=%7B%7D tuple=%28%29
in python 2.7 you might try to use unicode.translate:
trans_dict = {
'(': '%28',
')': '%29',
'{': '%7B',
'}': '%7D',
'[': '%5B',
']': '%5D',
}
trans_table = {ord(char): unicode(repl) for char, repl in trans_dict.items()}
print(u'dict={} tuple=()'.translate(trans_table))
Related
Basically, I want to check if a string (main) starts with another string (sub), using both of the above methods. For example, following is my code:
main = gets.chomp
sub = gets.chomp
p main.start_with? sub
p main[/^#{sub}/]
And, here is an example with I/O - Try it online!
If I enter simple strings, then both of them works exactly the same, but when I enter strings like "1\2" in stdin, then I get errors in the Regexp variant, as seen in TIO example.
I guess this is because of the reason that the string passed into second one isn't raw. So, I tried passing sub.dump into second one - Try it online!
which gives me nil result. How to do this correctly?
As a general rule, you should never ever blindly execute inputs from untrusted sources.
Interpolating untrusted input into a Regexp is not quite as bad as interpolating it into, say, Kernel#eval, because the worst thing an attacker can do with a Regexp is to construct an Evil Regex to conduct a Regular expression Denial of Service (ReDoS) attack (see also the section on Performance in the Regexp documentation), whereas with eval, they could execute arbitrary code, including but not limited to, deleting the entire file system, scanning memory for unencrypted passwords / credit card information / PII and exfiltrate that via the network, etc.
However, it is still a bad idea. For example, when I say "the worst thing that happen is a ReDoS", that assumes that there are no bugs in the Regexp implementation (Onigmo in the case of YARV, Joni in the case of JRuby and TruffleRuby, etc.) Ruby's Regexps are quite powerful and thus Onigmo, Joni and co. are large and complex pieces of code, and may very well have their own security holes that could be used by a specially crafted Regexp.
You should properly sanitize and escape the user input before constructing the Regexp. Thankfully, the Ruby core library already contains a method which does exactly that: Regexp::escape. So, you could do something like this:
p main[/^#{Regexp.escape(sub)}/]
The reason why your attempt at using String#dump didn't work, is that String#dump is for representing a String the same way you would have to write it as a String literal, i.e. it is escaping String metacharacters, not Regexp metacharacters and it is including the quote characters around the String that you need to have it recognized as a String literal. You can easily see that when you simply try it out:
sub.dump
#=> "\"1\\\\2\""
# equivalent to '"1\\2"'
So, that means that String#dump
includes the quotes (which you don't want),
escapes characters that don't need escaping in Regexp just because they need escaping in Strings (e.g. # or "), and
doesn't escape characters that don't need escaping in Strings (e.g. [, ., ?, *, +, ^, -).
I have these 2 UTF-8 strings:
a = "N\u01b0\u0303"
b = "N\u1eef"
They look pretty different but the are the same once they are rendered:
irb(main):039:0> puts "#{a} - #{b}"
Nữ - Nữ
The a version is the one I have stored in the DB. The b version is the one is coming from the browser in a POST request, I don't know why the browser is sending a different combination of UTF8 characters, and it is not happening always, I can't reproduce the issue in my dev environment, it happens in production and in a percentage of the total requests.
The case is that I try to compare both of them but they return false:
irb(main):035:0> a == b
=> false
I've tried different things like forcing encoding:
irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8")
=> false
Another interesting fact is:
irb(main):005:0> a.chars
=> ["N", "ư", "̃"]
irb(main):006:0> b.chars
=> ["N", "ữ"]
How can I compare these kind of strings?
This is an issue with Unicode equivalence.
The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.
The b version of the string uses the character ữ (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).
So currently you have
a == b
which is false, but if you do
a.unicode_normalize == b.unicode_normalize
you should get true.
If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:
a.mb_chars.normalize == b.mb_chars.normalize
or perhaps something like:
ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)
If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:
UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)
(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)
There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.
You can see these are distinct characters. First and second. In the first case, it is using a modifier "combining tilde".
Wikipedia has a section on this:
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
and
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
It seems that Ruby supports this normalization, but only as of Ruby 2.2:
http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html
a = "N\u01b0\u0303".unicode_normalize
b = "N\u1eef".unicode_normalize
a == b # true
Alternatively, if you are using Ruby on Rails, there appears to be a built-in method for normalization.
Today I came across the following regular expression and wanted to know what Ruby would do with it:
> "#a" =~ /^[\W].*+$/
=> 0
> "1a" =~ /^[\W].*+$/
=> nil
In this instance, Ruby seems to be ignoring the + character. If that is incorrect, I'm not sure what it is doing with it. I'm guessing it's not being interpreted as a quantifier, since the * is not escaped and is being used as a quantifier. In Perl/Ruby regexes, sometimes when a character (e.g., -) is used in a context in which it cannot be interpreted as a special character, it is treated as a literal. But if that was happening in this case, I would expect the first match to fail, since there is no + in the lvalue string.
Is this a subtly correct use of the + character? Is the above behavior a bug? Am I missing something obvious?
Well, you can certainly use a + after a *. You can read a bit about it on this site. The + after the * is called a possessive quantifier.
What it does? It prevents * from backtracking.
Ordinarily, when you have something like .*c and using this to match abcde, the .* will first match the whole string (abcde) and since the regex cannot match c after the .*, the engine will go back one character at a time to check if there is a match (this is backtracking).
Once it has backtracked to c, you will get the match abc from abcde.
Now, imagine that the engine has to backtrack a few hundred characters, and if you have nested groups and multiple * (or + or the {m,n} form), you can quickly end up with thousands, millions of characters to backtrack, called catastrophic backtracking.
This is where possessive quantifiers come in handy. They actually prevent any form of backtracking. In the above regex I mentioned, abcde will not be matched by .*+c. Once .*+ has consumed the whole string, it cannot backtrack and since there's no c at the end of the string, the match fails.
So, another possible use of possessive quantifiers is that they can improve the performance of some regexes, provided the engine can support it.
For your regex /^[\W].*+$/, I don't think that there's any improvement (maybe a tiny little improvement) that the possessive quantifier provides though. And last, it might easily be rewritten as /^\W.*+$/.
Has anyone in this forum attempted to solve the ACM programming problem http://acm.mipt.ru/judge/problems.pl?browse=yes&problem=024? It is one of the simpler problems in ACM MIPT and the goal is to evaluate an expression consisting of +, -, * and parentheses. Despite the apparent simplicity, I haven't been able to get my solution accepted, apparently because one of the test case expressions has an operator not stated in the problem. I even added support for division ('/') but that too didn't help. Any idea on what other operator needs to be supported? FYI, my program removes all whitespaces from the input before processing so that spaces shouldn't be a problem. Anything not stated in the problem but needs to be taken care of?
You're being bitten by ruby's handling of strings and characters.
curr_ch = #input[i]
gives you an integer, for the input you get, the ASCII code of the character at index i of the input.
curr_ch == '('
for example compares that integer to the string "(", of course that fails. Also the regex matches fail because you pass them an integer where a string is expected.
Replacing all occurrences of some_var = #input[some_index] with some_var = #input[some_index...some_index+1] gives me a programme that seems to work (it works on a few test inputs I gave it). Probably someone who actually knows the quirks of ruby can give you a better fix.
Many programming languages allow trailing commas in their grammar following the last item in a list. Supposedly this was done to simplify automatic code generation, which is understandable.
As an example, the following is a perfectly legal array initialization in Java (JLS 10.6 Array Initializers):
int[] a = { 1, 2, 3, };
I'm curious if anyone knows which language was first to allow trailing commas such as these. Apparently C had it as far back as 1985.
Also, if anybody knows other grammar "peculiarities" of modern programming languages, I'd be very interested in hearing about those also. I read that Perl and Python for example are even more liberal in allowing trailing commas in other parts of their grammar.
I'm not an expert on the commas, but I know that standard Pascal was very persnickity about semi-colons being statement separators, not terminators. That meant you had to be very very careful about where you put one if you didn't want to get yelled at by the compiler.
Later Pascal-esque languages (C, Modula-2, Ada, etc.) had their standards written to accept the odd extra semicolon without behaving like you'd just peed in the cake mix.
I just found out that a g77 Fortran compiler has the -fugly-comma Ugly Null Arguments flag, though it's a bit different (and as the name implies, rather ugly).
The -fugly-comma option enables use of a single trailing comma to mean “pass an extra trailing null argument” in a list of actual arguments to an external procedure, and use of an empty list of arguments to such a procedure to mean “pass a single null argument”.
For example, CALL FOO(,) means “pass two null arguments”, rather than “pass one null argument”. Also, CALL BAR() means “pass one null argument”.
I'm not sure which version of the language this first appeared in, though.
[Does anybody know] other grammar "peculiarities" of modern programming languages?
One of my favorites, Modula-3, was designed in 1990 with Niklaus Wirth's blessing as the then-latest language in the "Pascal family". Does anyone else remember those awful fights about where semicolon should be a separator or a terminator? In Modula-3, the choice is yours! The EBNF for a sequence of statements is
stmt ::= BEGIN [stmt {; stmt} [;]] END
Similarly, when writing alternatives in a CASE statement, Modula-3 let you use the vertical bar | as either a separator or a prefix. So you could write
CASE c OF
| 'a', 'e', 'i', 'o', 'u' => RETURN Char.Vowel
| 'y' => RETURN Char.Semivowel
ELSE RETURN Char.Consonant
END
or you could leave off the initial bar, perhaps because you prefer to write OF in that position.
I think what I liked as much as the design itself was the designers' awareness that there was a religious war going on and their persistence in finding a way to support both sides.
Let the programmer choose!
P.S. Objective Caml allows permissive use of | in case expressions whereas the earlier and closely related dialect Standard ML does not. As a result, case expressions are often uglier in Standard ML code.
EDIT: After seeing T.E.D.'s answer I checked the Modula-2 grammar and he's correct, Modula-2 also supported semicolon as terminator, but through the device of the empty statement, which makes stuff like
x := x + 1;;;;;; RETURN x
legal. I suppose that's not a bad thing. Modula-2 didn't allow flexible use of the case separator |, however; that seems to have originated with Modula-3.
Something which has always galled me about C is that although it allows an extra trailing comma in an intializer list, it does not allow an extra trailing comma in an enumerator list (for defining the literals of an enumeration type). This little inconsistency has bitten me in the ass more times than I care to admit. And for no reason!