I'm reading about the NTFS attribute types and it come to the $FILE_NAME attribute structure. Here it is:
Offset Size Description
~ ~ Standard Attribute Header
0x00 8 File reference to the parent directory.
0x08 8 C Time - File Creation
0x10 8 A Time - File Altered
0x18 8 M Time - MFT Changed
0x20 8 R Time - File Read
0x28 8 Allocated size of the file
0x30 8 Real size of the file
0x38 4 Flags, e.g. Directory, compressed, hidden
0x3c 4 Used by EAs and Reparse
0x40 1 Filename length in characters (L)
0x41 1 Filename namespace
0x42 2L File name in Unicode (not null terminated)
What is "Filename Namespace" at the offset 0x41? I know a little about namespace i think. How can it be stored in just 1 byte? Can anyone clear this for me? Thank you.
It describes the "traits" of a filename, i.e. length, allowable characters, etc. It is not a "string" in itself (like a C++/C#/etc. namespace).
I found a document here, of which I have frankly no idea of its validity.
But anyway, it describes the namespaces as such (which makes it quite obvious, see chapter 13.2.):
0: POSIX
This is the largest namespace. It is case sensitive and
allows all Unicode characters except for NULL (0) and Forward Slash
'/'. The maximum name length is 255 characters. N.B. There are some
characters, e.g. Colon ':', which are valid in NTFS, but Windows will
not allow you to use.
1: Win32
Win32 is a subset of the POSIX
namespace and is case insensitive. It uses all the Unicode characters,
except: '"' '*' '/' ':' '<' '>' '?' '\' '|' N.B. Names cannot end with
Dot '.', or Space ''.
2: DOS
DOS is a subset of the Win32 namespace,
allowing only 8 bit upper case characters, greater than Space '', and
excluding: '"' '*' '+' ',' '/' ':' ';' '<' '=' '>' '?' '\'. N.B. Names
must match the following pattern: 1 to 8 characters, then '.', then 1
to 3 characters.
3: Win32 &DOS
This namespace means that both the
Win32 and the DOS filenames are identical and hence have been saved in
this single filename record.
So the field can be one byte, because it just contains a number identifying the respective namespace in use.
Related
When I cat a file in bash I get the following:
$ cat /tmp/file
microsoft
When I view the same file in vim I get the following:
^#m^#i^#c^#r^#o^#s^#o^#f^#t^#
How can I identify and remove these "non-printable" characters. What does '^#' mean in vim??
(Just a piece of background information: the file was created by base 64 decoding and cutting from the pssh header of an mpd file for Microsoft Playready)
What you see is Vim's visual representation of unprintable characters. It is explained at :help 'isprint':
Non-printable characters are displayed with two characters:
0 - 31 "^#" - "^_"
32 - 126 always single characters
127 "^?"
128 - 159 "~#" - "~_"
160 - 254 "| " - "|~"
255 "~?"
Therefore, ^# stands for a null byte = 0x00. These (and other non-printable characters) can come from various sources, but in your case it's an ...
encoding issue
If you clearly observe your output in Vim, every second byte is a null byte; in between are the expected characters. This is a clear indication that the file uses a multibyte encoding (utf-16, big endian, no byte order mark to be precise), and Vim did not properly detect that, and instead opened the file as latin1 or so (whereas things worked out properly in the terminal).
To fix this, you can either explicitly specify the encoding:
:edit ++enc=utf-16 /tmp/file
Or tweak the 'fileencodings' option, so that Vim can automatically detect this. However, be aware that ambiguities (as in your case) make this prone to fail:
For an empty file or a file with only ASCII characters most encodings
will work and the first entry of 'fileencodings' will be used (except
"ucs-bom", which requires the BOM to be present).
That's why a byte order mark (BOM) is recommended for 16-bit encodings; but that assumes that you have control over the output encoding.
^# is Vim's representation of a null byte. The ^ indicates a non-printable control character, with the following ASCII character indicating
which control character it is.
^# == 0 (NUL)
^A == 1
^B == 2
...
^H == 8
^K == 11
...
^Z == 26
^[ == 27
^\ == 28
^] == 29
^^ == 30
^_ == 31
^? == 127
9 and 10 aren't escaped because they are Tab and Line Feed respectively.
32 to 126 are printable ASCII characters (starting with Space).
I tried to put a colon in the String of the filename of a filestream.
Is it true that one can't use a colon in a TFileStream in Delphi?
And if you can, then how?
EDIT: Thanks for all the downvotes. It deserves that. In retrospekt I have asked a stupid question...
On Windows, which I presume is your platform, the colon is a reserved character and so not allowed in a filename. This is documented here:
File and Directory Names
Naming Conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
...
I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"
why we can not use any special character (?, <..) in windows File name ?
Fundamental rules for for Universal Naming Convention (UNC),which enable applications to create and process valid names for files and directories, regardless of the file system:
Following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255),
Because they have special meanings in filesystem:
C:*.? - get all files with single letter extensions from C drive
: \ * ? - all have special meanings
Since some character are Reserved characters in some operating system,say ? is used as wildcard,/ as path name component separator.