Why Kickstarter applies "utf8=[Unicode character]" to the query string? - utf-8

I noticed that (for example) when searching on Kickstarter it applies utf8=[Unicode character] to the query string, like the following:
Is it some fancy way of detecting whether the client supports UTF-8? Never came across this trick before.

That parameter could be added to force older browsers (IE 6/7) to encode url parameters as unicode.

Related

To Understand More about Encoding U+30C9 vs U+30C8U+3099

ド(U+30C9) vs ド(U+30C8U+3099)
Fyi, the situation is
a user uploaded a file with name containing ド(U+30C8U+3099) to AWS s3 from a web app.
Then, the website sent a POST request containing the file name without url encoding to AWS lambda function for further processing using Python. The name when arrived in Lambda became ド(U+30C9). Python then failed to access the file stored in s3 because of the difference in unicode.
I think the solution would be to do url encoding on frontend before sending the request and do url decoding using urllib.parse.unquote to have the same unicode.
My questions are
would url encoding solve that issue? I can't reproduce the same issue probably because I am on a different OS from the user's OS.
How exactly did it happen since both requests (uploading to s3 and sending the 2nd request to lambda) happened on the user's machine?
Thank you.
Your are hitting a common case (maybe more common in Latin scripts): canonical equivalence. Unicode requires to handle canonical equivalent sequences in the same manner.
If you look in UnicodeData.txt you will find:
30C8;KATAKANA LETTER TO;Lo;0;L;;;;;N;;;;;
30C9;KATAKANA LETTER DO;Lo;0;L;30C8 3099;;;;N;;;;;
so, 30C9 is canonical equivalent to 30C8 3099.
Usually, it is better to normalize Unicode strings to a common canonical form. Unfortunately we have two of them: NFC and NFD: Normalization Form Canonical Composition and Normalization Form Canonical Decomposition. Apple prefers the later (and Unicode original design/preference is about this form), and most of the other vendors the first.
So do no trust web browsers to keep the same form. But also consider that input methods on user side may give you different variations (and with keyboards you may have also non-canonical forms which should be normalized [this can happens with several combining characters]).
So, on your backend you should choose a normalization form, and transform all input data in such form (or just be sure that all search and comparing functions can handle equivalent sequences correctly, but this requires a normalization on every call, so it may be less efficient).
Python has unicodedata.normalize() (in standard library, see unicodedata module), to normalize Unicode strings. Eventually on other languages you should use ICU library. In any case, you should normalize Unicode strings.
Note: this has nothing about encoding, but it is in built directly in Unicode design. The reason is about requirement to be compatible with old encoding and old encodings had both ways to describe the same characters.

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

Should JSON data contain formatted data?

When using JSON to populate a section of a page I often encounter that data needs special formatting - formatting that need to match that already on the page, which is done serverside.
A number might need to be formatted as a currency, a special date format or wrapped in for negative values.
But where should this formatting take place - doing it clientside will mean that I need to replicate all the formatting that takes place on the serverside. Doing it serverside and placing the formatted values in the JSON object means a less generic and reusable data set.
What is the recommended approach here?
The generic answer is to format data as late/as close to the user as is possible (or perhaps "practical" is a better term).
Irritatingly this means that its an "it depends" answer - and you've more or less already identified the compromise you're going to have to make i.e. do you remove flexibility/portability by formatting server side or do you potentially introduct duplication by doing it client side.
Personally I would tend towards client side unless there's a very good reason not to do so - simply because we're back to trying to format stuff as close to the user as possible, though I would be a bit concerned about making sure that I'm applying the right formatting rules in the browser.
JSON supports the following basic types:
Numbers,
Strings,
Boolean,
Arrays,
Objects
and Null (empty).
A currency is usually nothing else than a number, but formatted according to country-specific rules. And dates are not (yet) included in JSON at all.
Whatever is recommendable depends on what you do in your application and what kind of JScript libraries you are already using. If you are already formatting alot of data in your server side code, then add it there. If not, and you already have some classes included, which can cope with formatting (JQuery and MooTools have some capabilities), do it in the browser.
So either format them in the client or format them before sending them over - both solutions work.
If you want to delve deeper into this, i recommend this wikipedia article about JSON.

Ruby/LDAP non-ASCII character support

It seems like LDAP requires strings with non-ASCII characters to be Base64 encoded. The way to tell it that a string is to be parsed as a base64 encoded string is to add an extra colon to the attribute name such that "cn: name" becomes "cn:: name" (according to this site).
Now, my question is: How do I tell Ruby LDAP to do this? I could not find that the documentation mentions anything about it, but perhaps it is supported.
What about the other LDAP libraries, such as Net::LDAP? Do they support operations with non-ASCII characters?
Update:
The test suite for Ruby/LDAP (0.9.7, ruby v. 1.8.6) includes tests for adding entries with foreign characters in the LDAP. They set $KCODE="UTF8". However, this seems to have no effect in my setup.
non-ASCII characters are allowed for attributes as long as there is only ASCII-characters in the dn, so I currently use a workaround with an ASCII-only uid. However, this does not feel optimal.
I solved the problem by switching to Net::LDAP (which by the way feels much nicer to use). This required me to upgrade to ruby 1.8.7, though.

filtering user input in php

Am wondering if the combination of trim(), strip_tags() and addslashes() is enough to filter values of variables from $_GET and $_POST
That depends what kind of validation you are wanting to perform.
Here are some basic examples:
If the data is going to be used in MySQL queries make sure to use mysql_real_escape_query() on the data instead of addslashes().
If it contains file paths be sure to remove the "../" parts and block access to sensitive filename.
If you are going to display the data on a web page, make sure to use htmlspecialchars() on it.
But the most important validation is only accepting the values you are expecting, in other words: only allow numeric values when you are expecting numbers, etc.
Short answer: no.
Long answer: it depends.
Basically you can't say that a certain amount of filtering is or isn't sufficient without considering what you want to do with it. For example, the above will allow through "javascript:dostuff();", which might be OK or it might not if you happen to use one of those GET or POST values in the href attribute of a link.
Likewise you might have a rich text area where users can edit so stripping tags out of that doesn't exactly make sense.
I guess what I'm trying to say is that there is simple set of steps to sanitizing your data such that you can cross it off and say "done". You always have to consider what that data is doing.
It highly depends where you are going to use it for.
If you are going to display things as HTML, make absolutely sure you are properly specifying the encoding (e.g.: UTF-8). As long as you strip all tags, you should be fine.
For use in SQL queries, addslashes is not enough! If you use the mysqli library for example, you want to look at mysql::real_escape_string. For other DB libraries, use the designated escape function!
If you are going to use the string in javascript, addslashes will not be enough.
If you are paranoid about browser bugs, check out the OWASP Reform library
If you use the data in another context than HTML, other escaping techniques apply.

Resources