To Understand More about Encoding U+30C9 vs U+30C8U+3099 - utf-8

ド(U+30C9) vs ド(U+30C8U+3099)
Fyi, the situation is
a user uploaded a file with name containing ド(U+30C8U+3099) to AWS s3 from a web app.
Then, the website sent a POST request containing the file name without url encoding to AWS lambda function for further processing using Python. The name when arrived in Lambda became ド(U+30C9). Python then failed to access the file stored in s3 because of the difference in unicode.
I think the solution would be to do url encoding on frontend before sending the request and do url decoding using urllib.parse.unquote to have the same unicode.
My questions are
would url encoding solve that issue? I can't reproduce the same issue probably because I am on a different OS from the user's OS.
How exactly did it happen since both requests (uploading to s3 and sending the 2nd request to lambda) happened on the user's machine?
Thank you.

Your are hitting a common case (maybe more common in Latin scripts): canonical equivalence. Unicode requires to handle canonical equivalent sequences in the same manner.
If you look in UnicodeData.txt you will find:
30C8;KATAKANA LETTER TO;Lo;0;L;;;;;N;;;;;
30C9;KATAKANA LETTER DO;Lo;0;L;30C8 3099;;;;N;;;;;
so, 30C9 is canonical equivalent to 30C8 3099.
Usually, it is better to normalize Unicode strings to a common canonical form. Unfortunately we have two of them: NFC and NFD: Normalization Form Canonical Composition and Normalization Form Canonical Decomposition. Apple prefers the later (and Unicode original design/preference is about this form), and most of the other vendors the first.
So do no trust web browsers to keep the same form. But also consider that input methods on user side may give you different variations (and with keyboards you may have also non-canonical forms which should be normalized [this can happens with several combining characters]).
So, on your backend you should choose a normalization form, and transform all input data in such form (or just be sure that all search and comparing functions can handle equivalent sequences correctly, but this requires a normalization on every call, so it may be less efficient).
Python has unicodedata.normalize() (in standard library, see unicodedata module), to normalize Unicode strings. Eventually on other languages you should use ICU library. In any case, you should normalize Unicode strings.
Note: this has nothing about encoding, but it is in built directly in Unicode design. The reason is about requirement to be compatible with old encoding and old encodings had both ways to describe the same characters.

Related

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

HTML/XSS escape on input vs output

From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.

Unicode Normalization in Windows

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.
MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:
Win32 and the .NET Framework support all four normalization forms.
However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.
Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?
Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?
From the MSDN article Using Unicode Normalization to Represent Strings.
Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
Update: I've included some specific details relating to Question #2.
In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.
In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.
One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.
what normalization form is used by default for user input
Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.
Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.
For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave à). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.
This isn't in normal form C (which would be ầ U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).
There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.
are the kernel and file system normalization-agnostic?
Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.
First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:
But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

What does Canonical Representation mean and its potential vulnerability to websites

I searched on google for a meaning of canonical representation and turned up documents that are entirely too cryptic. Can anyone provide a quick explanation of canonical representation and also what are some typical vulnerabilities in websites to canonical representation attacks?
Canonicalisation is the process by which you take an input, such as a file name, or a string, and turn it into a standard representation.
For example if your web application only allows access to files under C:\websites\mydomain then typically any input referring to filenames is canonicalised to be a physical, direct path, rather than one which uses relative paths. If you wanted to open C:\websites\mydomain\example\example.txt one input into that function may be example\example.txt. It's hard to work out if this goes outside the boundaries of your web site, so the canonicalisation function would look at the application directory and change that relative path into a physical one, C:\websites\mydomain\example\example.txt. This is obviously easier to check as you simply do a string compare on the start of the file path.
For HTML inputs you take inputs like %20 and canonicalise them by unencoding, so this would turn into a space. This is a good idea as the number of different ways of encoding are numerous, canonicalisation means you would check the decoded string only, rather than try to cover all the encoding variations.
Basically you are taking input which is logically equivalent and converting them to a standard form which you can then act upon.
The following explanation is from the "Application Security and Development STIG" found here:
3.11 Canonical Representation
Canonical representation issues arise
when the name of a resource is used to
control resource access. There are
multiple methods of representing
resource names on a computer system.
An application relying solely on a
resource name to control access may
incorrectly make an access control
decision if the name is specified in
an unrecognized format.
For example,
in Windows, notepad.exe may be
represented by the following file and
path name combinations:
C:\Windows\System32\notepad.exe
%SystemRoot%\System32\notepad.exe
\?\C:\Windows\System32\notepad.exe
\host\c$\Windows\system32\notepad.exe
An application attempting to restrict
access to the file based solely on the
file path and name may improperly
grant or deny access. The same issue
may apply to other named resources on
a system, such as a hard- and
soft-links, URL, pipe, share,
directory, device name, or within data
files, if alternate encoding
mechanisms are used with the data.
The
following items may indicate potential
canonical representation issues in an
application:
• Access control
decisions based upon a resource name.
• Failure to reduce a resource name to
its canonical form before use.
In
order to minimize canonical
representation issues in the
application, implement the following
procedures:
• Do not rely solely on
resource names to control access.
• If
using resource names to control
access, validate the names to ensure
they are in the proper format; reject
all names not fitting the known-good
criteria.
• Use operating system-based
access control mechanisms such as
permissions and ACLs.
Canonicalisation means reducing the data received to its simplest form, it's used for Input validation.
Canonical (I think) means that console input is "typical behavior". Non-canonical means that input is non-standard and requires special knowledge, such as the input behavior of "vi" on linux.

Resources