FindNextFile order NTFS - winapi

FindNextFile WinApi function is used to list content of directories. Microsoft is stating in documentation, that order is file system dependent. However NTFS should be in alphabetical order most of the time.
The order in which this function returns the file names is dependent on the file system type. With the NTFS file system and CDFS file systems, the names are usually returned in alphabetical order. With FAT file systems, the names are usually returned in the order the files were written to the disk, which may or may not be in alphabetical order. However, as stated previously, these behaviors are not guaranteed.
My application needs some ordering of object in directories. Because majority of Windows users use NTFS, I would like to optimize my application for that case. Therefore I use function _wcsicmp for name compare. Most of the time it is correct and results from FindNextFile are sorted according to _wcsicmp. However sometime result are not sorted. I thought, that it is natural, because FindFirstFile doesn't guaranteed the order and I must sort it anyway (in case of another file system). Then I noticed strange pattern. It looks like character '_' is returned after letters. Folder with content (a.txt, b.txt, _.txt) is returned in order a, b, _. Function _wcsicmp will sort that as _, a, b. Tested on Windows 8.1. I ran some test and this behavior is consistent.
Can someone explain me what is the comparison criteria used by NTFS? Or why is FindNextFile returning names out of alphabetical order?

Because NTFS sort rules are not so simple as just to sort in alphabetical order. Here is an msdn blog article to shed some light on the problem:
Why do NTFS and Explorer disagree on filename sorting?
One reason to this can be that NTFS captures the case mapping table at the time the drive is formatted and continues to use that table, even if the OS's case mapping tables change subsequently.

You can use CompareStringEx and set the flag SORT_DIGITSASNUMBERS
Minimum system requirement for this function is Windows Vista
LINK
int CompareStringEx(0,0x00000008/*SORT_DIGITSASNUMBERS*/,
lpString1, cchCount1, lpString2, cchCount2, NULL, NULL, 0);
Comparison result for this function is weird, it returns 1, 2, or 3:
#define CSTR_LESS_THAN 1 // string 1 less than string 2
#define CSTR_EQUAL 2 // string 1 equal to string 2
#define CSTR_GREATER_THAN 3 // string 1 greater than string 2
You can also try _wcsicoll for older systems. If I recall correctly _wcsicoll works better but not the same as Windows's sort.

Related

Why is the EntryID Changing in VSTO? The MailItem is not moving folders

I'm writing some code in C# that matches a pattern from the subject and then ingests the email. To initialize my datastore, I go through the current Microsoft.Office.Interop.Outlook.Table.
while (!table.EndOfTable)
{
Row row = table.GetNextRow();
string entryId = row["EntryID"].ToString();
this.SaveInXML(entryId, row);
}
It seems pretty simple. Well, I also have an event (Application.ItemLoad) that I'm watching, too. I notice that in the event the MailItem's EntryID is completely different than the Table's EntryID. In fact, the string lengths are not even the same (See example below). Why is this? Shouldn't they be the same? The item has not moved folders, so I'd assume it's the same. Thank you, all.
Example code:
NameSpace ns = this.Folder.Application.GetNamespace("MAPI");
var mi = ns.GetItemFromID("EF0000003E65593F1D361C44AFBFA24E6F365D6E04782F00") as MailItem;
string entryId = mi.EntryID;
System.Diagnostics.Debug.WriteLine("EF0000003E65593F1D361C44AFBFA24E6F365D6E04782F00");
System.Diagnostics.Debug.WriteLine(entryId);
// Output Produced:
// EF0000003E65593F1D361C44AFBFA24E6F365D6E04782F00
// 000000003E65593F1D361C44AFBFA24E6F365D6E0700CC348F1AD97A224B9898503750437E4700000000010C0000CC348F1AD97A224B9898503750437E470000F59160590000
//
// Notice that the second WriteLine isn't even remotely close to the EntryID that I requested.
Entry identifiers come in two types: short-term and long-term.
Short-term entry identifiers are faster to construct, but their uniqueness is guaranteed only over the life of the current session on the current workstation.
Long-term entry identifiers have a more prolonged lifespan. Short-term entry identifiers are used primarily for rows in tables and entries in dialog boxes, whereas long-term entry identifiers are used for many objects such as messages, folders, and distribution lists.
Use the MailItem.EntryID property if you need to get a long-term entry identifiers.
Entry identifiers cannot be compared directly because one object can be represented by two different binary values. Use the NameSpace.CompareEntryIDs method to determine whether two entry identifiers represent the same object.
As Eugene noted, you have two kinds of entry ids - long term and short term. Even for long-term entry ids, they can be different depending on how the item was opened. Long term entry ids always start with "00000000". Short term entry ids can only be used in the current MAPI session and therefore should not be persisted to be used across different sessions.
You must treat entry id as black boxes and never compare them directly - always use Namespace.CompareEntryIDs.

What is the number prepended to the Sublime Text "Goto Anything" search?

Whenever I use the Goto Anything search in Sublime Text and start typing to search the files in my current project I get a whole bunch of results based on Sublime Text's fuzzy-search algorithm, each prepended with a number.
I assume this is some sort of score for the search "strength" but I just wanted confirm this. What is this number based on?
It seems like the numbers are indeed representative of match strength, as you assumed.
I noticed an odd effect when testing your hypothesis, and then proceeded to create the dummy files CustomCompletions.CustomCompletions & CustomCompletions ( a file with no extension ) for further comparison.
Here are the results:
As you can see,
CustomCompletions has the highest ranking with 1524
CustomCompletions.py & CustomCompletions.todo share a rank of 1507
CustomCompletions.CustomCompletions & CustomCompletions.sublime-settings share a rank of 1490
All of the remaining files, which contain additional text in the base name, continue to receive lower rankings.
What I found odd was that the 2nd & 3rd groups had different rankings, despite sharing a base file name that exactly matches the query.
I figured that it might be due to the number of characters in the file extension, so I tested that assumption by creating the following files:
CustomCompletions.a
CustomCompletions.ab
CustomCompletions.abc
CustomCompletions.abcd
CustomCompletions.abcde
CustomCompletions.abcdef
CustomCompletions.abcdefg
CustomCompletions.abcdefgh
CustomCompletions.abcdefghi
CustomCompletions.abcdefghij
CustomCompletions.1
CustomCompletions.12
CustomCompletions.123
CustomCompletions.1234
CustomCompletions.12345
CustomCompletions.123456
CustomCompletions.1234567
CustomCompletions.12345678
CustomCompletions.123456789
CustomCompletions.1234567890
But it turns out they all ranked at 1507, the same ranking as the 2nd group.
Because of that outcome, I am still unsure what criteria affects the ranking of files which share a base name that is an exact match for the Goto Anything query, but have differing file extensions.

What do the statements in MonetDB query plan explanations mean?

I am trying to understand the query plan of MonetDB.
Is there a documentation anywhere where I can find what each instruction stays for?
If not, can anybody tell me what are returning
sql.projectdelta(X_15,X_23,X_25,r1_30,X_27)
and
sql.subdelta(X_246,X_4,X_10,X_247,X_249), for example?
In my query I am sorting the result by two attributes (e.g., by A,B). Can you tell me why the second sort has more parameters than the first?
(X_29,r1_36,r2_36) := algebra.subsort(X_28,false,false);
(X_33,r1_40,r2_40) := algebra.subsort(X_22,r1_36,r2_36,false,false);
Is algebra.subsort returning (oid, columnType) pairs, or just oid?
Thank you!!
Understanding output of the explain SQL statement requires knowledge of the MonetDB Assembly-like Language (MAL).
Concerning functions sql.projectdelta, sql.subdelta, and algebra.subsort, you'll find their signature and a (brief) description in the monetdb lib folder. Ex :
[MonetDB_install_folder]\MonetDB5\lib\monetdb5\sql.mal for all sql functions
[MonetDB_install_folder]\MonetDB5\lib\monetdb5\algebra.mal for all algebra functions
Concerning the different number of parameters for algebra.subsort :
(X_29,r1_36,r2_36) := algebra.subsort(X_28,false,false);
is described as :
Returns a copy of the BAT sorted on tail values, a BAT that specifies
how the input was reordered, and a BAT with group information.
The input and output are (must be) dense headed.
The order is descending if the reverse bit is set.
This is a stable sort if the stable bit is set.
(X_33,r1_40,r2_40) := algebra.subsort(X_22,r1_36,r2_36,false,false);
is described as:
Returns a copy of the BAT sorted on tail values, a BAT that specifies
how the input was reordered, and a BAT with group information.
The input and output are (must be) dense headed.
The order is descending if the reverse bit is set.
This is a stable sort if the stable bit is set.
MAL functions can be overloaded bassed on their return value. algebra.subsort can return 1, 2 or 3 values depending on what you're asking for. Checl algebra.mal for the different possibilities.

convert case of wide characters, given the LCID (Visual C++)

I have some existing Visual C++ code where I need to add the conversion of wide character strings to upper or lower case.
I know there are pitfalls to this (such as the Turkish "I"), but most of these can be ironed-out if you know the language. Fortunately in this area of code I know the LCID value (locale ID) which I guess is the same as knowing the language.
As LCID is a Windows type, is there a Windows function that will convert wide strings to upper or lower case?
The C runtime function _towupper_l() sounds like it would be ideal but it takes a _locale_t parameter instead of LCID, so I guess it's unsuitable unless there is a completely reliable way of converting an LCID to a _locale_t.
The function you're searching for is called LCMapString and it is part of the Windows NLS APIs. The LCMAP_UPPERCASE flag maps characters to uppercase, while the LCMAP_LOWERCASE maps characters to lowercase.
For applications targeting Windows Vista and later, there is an Ex variant that works on locale names instead of identifiers, which are what Microsoft now says you should prefer to use.
In fact, in the CRT implementation provided with VS 2010 (and presumably other versions as well), functions such as _towupper_l ultimately end up calling LCMapString after they extract the locale ID (LCID) from the specified _locale_t.
If you're like me, and less familiar with the i8n APIs than you should be, you probably already know about the CharUpper, CharLower, CharUpperBuff, and CharLowerBuff family of functions. These have been the old standbys from the early days of Windows for altering the case of chars/strings, but as their documentation warns:
Note that CharXxx always maps uppercase I to lowercase I ("i"), even when the current language is Turkish or Azeri. If you need a function that is linguistically sensitive in this respect, call LCMapString.
What it neglects to mention is filled in by a couple of posts on Michael Kaplan's wonderful blog on internationalization issues: What does "linguistic casing" mean?, How best to alter case. The executive summary is that you achieve the same results as the CharXxx family of functions by calling LCMapString and not specifying the LCMAP_LINGUISTIC_CASING flag, whereas you can be linguistically sensitive by ensuring that you do specify the LCMAP_LINGUISTIC_CASING flag.
Sample code:
std::wstring test("Does my code pass the Turkey test?");
if (!LCMapStringW(lcid, /* your LCID, defined elsewhere */
LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING,
test.c_str(), /* input string */
test.length(), /* length of input string */
&test[0], /* output buffer (can reuse input) */
test.length())) /* length of output buffer (same as input) */
{
// Uh-oh! Something went wrong in the call to LCMapString, so you need to
// handle the error somehow here.
// A good start is calling GetLastError to determine the error code.
}

How does Windows determine/handle the DOS short name of any given file?

I have a folder with these files:
alongfilename1.txt <--- created first
alongfilename3.txt <--- created second
When I run DIR /x in command prompt, I see these short names assigned:
ALONGF~1.TXT alongfilename1.txt
ALONGF~2.TXT alongfilename3.txt
Now, if I add another file:
alongfilename1.txt
alongfilename2.txt <--- created third
alongfilename3.txt
I see this:
ALONGF~1.TXT alongfilename1.txt
ALONGF~3.TXT alongfilename2.txt
ALONGF~2.TXT alongfilename3.txt
Fine. It seems to be assigning the "~#" according to the date/time that I created the file. Is this correct?
Now, if I delete "alongfilename1.txt", the other two files keep their short names.
ALONGF~3.TXT alongfilename2.txt
ALONGF~2.TXT alongfilename3.txt
When will that ID (in this case, ~1) be released for use in another shortname. Will it ever?
Also, is it possible that a file on my machine has a short name of X, whereas the same file has a short name of Y on another machine? I'm particularly concerned for installations whose custom actions utilize DOS short names.
Thanks, guys.
If I were you, I would never rely on any version of any file system driver (be it Microsoft's, be it another OS's) to be consistent about the algorithm it uses to generate short file names. The exact behavior of the Microsoft Fastfat and NTFS drivers is not "officially" documented (except as somewhat high level overviews) thus are not part of the API contract. What works today might not work tomorrow if you update the driver.
In addition, there is absolutely no requirement that short names contain tilde characters - see for example this post by Raymond Chen.
There's a treasure trove of info to be found about this topic in the MSDN blogs - for example:
Registry key to force Windows to use short filenames
NTFS curiosities (Part I): Short file names
Also, do not rely on the sole presence of alphanumerical characters. Look at the Linux VFAT driver which says, for example, that any combination of uppercase letters, digits, and the following characters is valid: $ % ' ` - # { } ~ ! # ( ) & _ ^. NTFS will operate in compatibility mode with that...
The short filename is created with the file. The algorithm works like this (usually, but see moocha's reply):
counter = 1
stripped_filename = strip_dots(strip_non_ascii_characters(filename))
shortfn = first_6_characters(stripped_filename)
while (file_exists(shortfn + "~" + counter + "." + extension)) {
increment counter by 1
if more digits are added to counter, shorten shortfn by 1
/* e.g. if counter comes to 9 and shortf~9.txt is taken. try short~10.txt next */
}
This means that once the file is created, it will keep its short name until it's deleted.
As soon as the file is deleted, the short name may be used again.
If you move the file somewhere else, it may get a new short name (e.g. you're moving c:\somefilewithlongname.txt ("c:\somefi~1.txt") to d:\stuff\somefilewithlongname.txt, if there's d:\stuff\somefileelse.txt ("d:\stuff\somefi~1.txt"), the short name of the moved file will be somefi~2.txt). It seems that the short name is only persistent within a given directory on a given machine.
So: the short filenames will be generated by the filesystem, usually by the method outlined above. It is better to assume that short filenames are not persistent, as c:\longfi~1.txt on one machine might be "c:\longfilename.txt", whereas on another it might be "c:\longfish_story.txt"; also, when a file is deleted, the short name is immediately available again.
I believe MSDOS stores the association between the long and the short name in a per directory file.
It does not depends on the date/time.
If you move your files in a new directory... this will reset the algo mentionned by Piskvor applies itself again
In the new directory (after a move), you will get:
ALONGF~1.TXT alongfilename1.txt
ALONGF~2.TXT alongfilename2.txt
ALONGF~3.TXT alongfilename3.txt
even though alongfilename2.txt has initially been created third.
This link says how NTFS does it. I would guess it's still the same idea on more recent version.
In Windows 2000, both FAT and NTFS use
the Unicode character set for their
names, which contain several forbidden
characters that MS-DOS cannot read. To
generate a short MS-DOS-readable file
name, Windows 2000 deletes all of
these characters from the LFN and
removes any spaces. Because an
MS-DOS-readable file name can have
only one period, Windows 2000 also
removes all extra periods from the
file name. Next, Windows 2000
truncates the file name, if necessary,
to six characters and appends a tilde
( ~ ) and a number. For example, each
non-duplicate file name is appended
with ~1 . Duplicate file names end
with ~2 , then ~3, and so on. After
the file names are truncated, the file
name extensions are truncated to three
or fewer characters. Finally, when
displaying file names at the command
line, Windows 2000 translates all
characters in the file name and
extension to uppercase.
When the files are provided by a network server which is running Samba, then the short names are generated by the server, and they do not follow a predictable pattern.
So it is not safe to assume that you can predict the form of the short name.
G:\>dir /x *.txt
Directory of G:\
08/25/2009 12:34 PM 1,848 S2XYYV~1.TXT strace_output.txt
03/01/2010 05:32 PM 325,428 TEY7IH~O.TXT tomcat-dump-march-1.txt
03/11/2010 12:01 AM 5,811 DI356A~S.TXT ddmget-output.txt
01/23/2009 01:03 PM 313,880 DLA94Q~K.TXT ddm-log-fn.txt
04/20/2010 07:42 PM 7,491 A50QZP~A.TXT april-20-2010.txt

Resources