What does Canonical Representation mean and its potential vulnerability to websites

What does Canonical Representation mean and its potential vulnerability to websites - representation

I searched on google for a meaning of canonical representation and turned up documents that are entirely too cryptic. Can anyone provide a quick explanation of canonical representation and also what are some typical vulnerabilities in websites to canonical representation attacks?

Canonicalisation is the process by which you take an input, such as a file name, or a string, and turn it into a standard representation.
For example if your web application only allows access to files under C:\websites\mydomain then typically any input referring to filenames is canonicalised to be a physical, direct path, rather than one which uses relative paths. If you wanted to open C:\websites\mydomain\example\example.txt one input into that function may be example\example.txt. It's hard to work out if this goes outside the boundaries of your web site, so the canonicalisation function would look at the application directory and change that relative path into a physical one, C:\websites\mydomain\example\example.txt. This is obviously easier to check as you simply do a string compare on the start of the file path.
For HTML inputs you take inputs like %20 and canonicalise them by unencoding, so this would turn into a space. This is a good idea as the number of different ways of encoding are numerous, canonicalisation means you would check the decoded string only, rather than try to cover all the encoding variations.
Basically you are taking input which is logically equivalent and converting them to a standard form which you can then act upon.

The following explanation is from the "Application Security and Development STIG" found here:
3.11 Canonical Representation
Canonical representation issues arise
when the name of a resource is used to
control resource access. There are
multiple methods of representing
resource names on a computer system.
An application relying solely on a
resource name to control access may
incorrectly make an access control
decision if the name is specified in
an unrecognized format.
For example,
in Windows, notepad.exe may be
represented by the following file and
path name combinations:
C:\Windows\System32\notepad.exe
%SystemRoot%\System32\notepad.exe
\?\C:\Windows\System32\notepad.exe
\host\c$\Windows\system32\notepad.exe
An application attempting to restrict
access to the file based solely on the
file path and name may improperly
grant or deny access. The same issue
may apply to other named resources on
a system, such as a hard- and
soft-links, URL, pipe, share,
directory, device name, or within data
files, if alternate encoding
mechanisms are used with the data.
The
following items may indicate potential
canonical representation issues in an
application:
• Access control
decisions based upon a resource name.
• Failure to reduce a resource name to
its canonical form before use.
In
order to minimize canonical
representation issues in the
application, implement the following
procedures:
• Do not rely solely on
resource names to control access.
• If
using resource names to control
access, validate the names to ensure
they are in the proper format; reject
all names not fitting the known-good
criteria.
• Use operating system-based
access control mechanisms such as
permissions and ACLs.

Canonicalisation means reducing the data received to its simplest form, it's used for Input validation.

Canonical (I think) means that console input is "typical behavior". Non-canonical means that input is non-standard and requires special knowledge, such as the input behavior of "vi" on linux.

Related

To Understand More about Encoding U+30C9 vs U+30C8U+3099

ド(U+30C9) vs ド(U+30C8U+3099)
Fyi, the situation is
a user uploaded a file with name containing ド(U+30C8U+3099) to AWS s3 from a web app.
Then, the website sent a POST request containing the file name without url encoding to AWS lambda function for further processing using Python. The name when arrived in Lambda became ド(U+30C9). Python then failed to access the file stored in s3 because of the difference in unicode.
I think the solution would be to do url encoding on frontend before sending the request and do url decoding using urllib.parse.unquote to have the same unicode.
My questions are
would url encoding solve that issue? I can't reproduce the same issue probably because I am on a different OS from the user's OS.
How exactly did it happen since both requests (uploading to s3 and sending the 2nd request to lambda) happened on the user's machine?
Thank you.

Your are hitting a common case (maybe more common in Latin scripts): canonical equivalence. Unicode requires to handle canonical equivalent sequences in the same manner.
If you look in UnicodeData.txt you will find:
30C8;KATAKANA LETTER TO;Lo;0;L;;;;;N;;;;;
30C9;KATAKANA LETTER DO;Lo;0;L;30C8 3099;;;;N;;;;;
so, 30C9 is canonical equivalent to 30C8 3099.
Usually, it is better to normalize Unicode strings to a common canonical form. Unfortunately we have two of them: NFC and NFD: Normalization Form Canonical Composition and Normalization Form Canonical Decomposition. Apple prefers the later (and Unicode original design/preference is about this form), and most of the other vendors the first.
So do no trust web browsers to keep the same form. But also consider that input methods on user side may give you different variations (and with keyboards you may have also non-canonical forms which should be normalized [this can happens with several combining characters]).
So, on your backend you should choose a normalization form, and transform all input data in such form (or just be sure that all search and comparing functions can handle equivalent sequences correctly, but this requires a normalization on every call, so it may be less efficient).
Python has unicodedata.normalize() (in standard library, see unicodedata module), to normalize Unicode strings. Eventually on other languages you should use ICU library. In any case, you should normalize Unicode strings.
Note: this has nothing about encoding, but it is in built directly in Unicode design. The reason is about requirement to be compatible with old encoding and old encodings had both ways to describe the same characters.

HL7 FHIR mark resources as anonymized

I am trying to map an existing domain into HL7 FHIR.
So far it was pretty easy to find FHIR resources that more or less represent the same data and can be used for that purpose. But now I am running into a problem of which I am not sure how to solve it.
The existing domain allows that data can be anonymized depending on the users access level. e.g. a patient's name or address might be removed and marked as anonymized. Other data will be pseudonymised, for example a the birthdate in 1980 will be replaced with 01.01.1980. An Age of 37 will be replaced with a category of 30-40.
So I am unsure how to integrate that into the FHIR domain. I was thinking I could create an extension holding a boolean, indicating if a value was anonymized or not and always replace or remove the original value. This might work, but I will run into big problems when the anonymized value is of a different type than the original value (e.g. Age is replaced by a range of values)
Is that even a valid approach? I thought this might be common problem, but I could not find any examples where people described methods of how to mark data as altered. Unfortunately the documentation at http://build.fhir.org/extensibility-registry.html does not contain anything that would help my case.

You can use security labels for this purpose (Resource.meta.security). Take a look at REDACTED and SUBSETTED in the security label value set: https://www.hl7.org/fhir/valueset-security-labels.html
If you need to convey a data type other than the one allowed by the resource (e.g. wanting to convey a range rather than a birthdate), you'd need to use an extension. (Note that dates are valid even if you only include the year.)

How does Windows interpret multiple VersionInfo Resources?

I am currently studying the VersionInfo Resource(s) for Windows.
It is kind of confusing that you can have multiple VS_VERSIONINFO/VS_FIXEDFILEINFO structures within a VS_VERSION_INFO Resource.
As far as I get it, you can have multiple RT_VERSION->VS_VERSION_INFO Resources with different language ids. (Just as shown as in the picture)
These 2 language ids (0 and 1031) have actually 2 different VS_VERSIONINFO/VS_FIXEDFILEINFO in each.
0 is a neutral language and seems to be prioritized than your actual local language id (which is 1031).
To me this seems to be kind of a mess and confusing.
How is it possible to have multiple VS_VERSIONINFO structures within a VS_VERSION_INFO resource and what is the point? How does Windows interpret multiple Resources,Structures?
And how is it possible to get only one piece of buffer when you call GetFileVersionInfo?
It all makes little sense to me and I can't find much documentation about it.

You have to make a difference between the textual infos, and the bare VS_FIXEDFILEINFO block. The first block exist only once. The text Information is language dependent.
"Windows" does not prefers a specific one ;) What the explorer does is a different thing. It just shows the resource information. But in fact this is just the string information and not the information from the fixed version info.
When you call GetFileVersionInfo you get all language blocks! VerQueryValue is used to access he separate blocks.
The installer and other routines inside windows only use the VS_FIXEDFILEINFO block. They don't care about any text blocks. And this block only exists once.
I assume that the explorer just shows the first text block and also doesn't prefer a specific one. Just use a text editor and exchange the blocks in the resource file. But maybe the resource compiler reorders them.
To access the separate parts:
- VerQueryValue with "\" gives you the fixed version info block VS_FIXEDFILEINFO
- VerQueryValue with "\VarFileInfo\Translation" gives you a list of translations
- with "\StringFileInfo\langId_charset\keyname" you get the specific string parts
You find this information in the MSDN

Is the ReplaceFile Windows API a convenience function only?

Is the ReplaceFile Windows API a convenience function only, or does it achieve anything beyond what could be coded using multiple calls to MoveFileEx?
I'm currently in the situation where I need to
write a temporary file and then
rename this temporary file to the original filename, possibly replacing the original file.
I thought about using MoveFileEx with MOVEFILE_REPLACE_EXISTING (since I don't need a backup or anything) but there is also the ReplaceFile API and since it is mentioned under Alternatives to TxF.
This got me thinking: Does ReplaceFile actually do anything special, or is it just a convenience wrapper for MoveFile(Ex)?

I think the key to this can be found in this line from the documentation (my emphasis):
The replacement file assumes the name of the replaced file and its identity.
When you use MoveFileEx, the replacement file has a different identity. Its creation date is not preserved, the creator is not preserved, any ACLs are not preserved and so on. Using ReplaceFile allows you to make it look as though you opened the file, and modified its contents.
The documentation says it like this:
Another advantage is that ReplaceFile not only copies the new file data, but also preserves the following attributes of the original file:
Creation time
Short file name
Object identifier
DACLs
Security resource attributes
Encryption
Compression
Named streams not already in the replacement file
For example, if the replacement file is encrypted, but the
replaced file is not encrypted, the resulting file is not
encrypted.

Any app that wants to update a file by writing to a temp and doing the rename/rename/delete dance (handling all the various failure scenarios correctly), would have to change each time a new non-data attribute was added to the system. Rather than forcing all apps to change, they put in an API that is supposed to do this for you.
So you could "just do it yourself", but why? Do you correctly cover all the failure scenarios? Yes, MS may have a bug, but why try to invent the wheel?
NB, I have a number of issues with the programming model (better to do a "CreateUsingTemplate") but it's better than nothing.

Win32 File Name Comparison

Does anyone know what culture settings Win32 uses when dealing with case-insensitive files names?
Is this something that varies based on the user's culture, or are the casing rules that Win32 uses culture invariant?

An approximate answer is at
Comparing Unicode file names the right way.
Basically, the recommendation is to uppercase both strings (using CharUpper, CharUpperBuff, or LCMapString), then compare using a binary comparison (i.e. memcmp or wmemcmp, not CompareString with an invariant locale). The file system doesn't do Unicode normalization, and the case rules are not dependent on locale settings.
There are unfortunate ambiguous cases when dealing with characters whose casing rules have changed across different versions of Unicode, but it's about as good as you can do.

Comparing file names in native code and Don't compare filenames are a couple of good blog posts on this topic. The first has C/C++ code for OrdinalIgnoreCaseCompareStrings, and the second tells you how that doesn't always work for filenames and what to do to mitigate that.
Then there are the Unicode problems. While these new OrdinalIgnoreCase string comparison algorithms are great for your local NTFS drive, they might not yield the right answer on your FAT drive, or a network share.
So what's the answer? When possible, let the file system tell you. CreateFile can tell you if a given filename exists. Just pick the right creation disposition. If you need to compare to handles, you can often use GetFileInformationByHandle; look at dwVolumeSerialNumber/nFileIndexHigh/nFileIndexLow.

If you're using .NET, the official recommendation from Microsoft is to use StringComparison.OrdinalIgnoreCase for comparison and ToUpperInvariant for normalization (to be later compared using Ordinal comparison). This also applies to Registry keys and values, environment variables etc.
See New Recommendations for Using Strings in Microsoft .NET 2.0 for more details.
Note that while it's reliable on NTFS, it can fail with network shares, for example. See #SteveSteiner's answer and links in his post for solutions.

According to Windows Driver Samples FastFAT and CDFS, it uses RtlUpcaseUnicodeString to convert a string to uppercase. According to a brief look in Ghidra, that uses an internal function named NLS_UPCASE, whose behavior is based on your current system codepage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio