Windows Explorer sort method - algorithm

I'm looking for an algorithm that sorts strings similar to the way files (and folders) are sorted in Windows Explorer. It seems that numeric values in strings are taken into account when sorted which results in something like
name 1, name 2, name 10
instead of
name 1, name 10, name 2
which you get with a regular string comparison.
I was about to start writing this myself but wanted to check if anyone had done this before and was willing to share some code or insights. The way I would approach this would be to add leading zeros to the numeric values in the name before comparing them. This would result in something like
name 00001, name 00010, name 00002
which when sorted with a regular string sort would give me the correct result.
Any ideas?

It's called "natural sort order". Jeff had a pretty extensive blog entry on it a while ago, which describes the difficulties you might overlook and has links to several implementations.

Explorer uses the API StrCmpLogicalW() for this kind of sorting (called 'natural sort order').
You don't need to write your own comparison function, just use the one that already exists.
A good explanation can be found here.

There is StrCmpLogicalW, but it's only available starting with Windows XP and only implemented as Unicode.
Some background information:
http://www.siao2.com/2006/10/01/778990.aspx

The way I understood it, Windows Explorer sorts as per your second example - it's always irritated me hugely that the ordering comes out 1, 10, 2. That's why most apps which write lots of files (like batch apps) always use fixed length filenames with leading 0's or whatever.
Your solution should work, but you'd need to be careful where the numbers were in the filename, and probably only use your approach if they were at the very end.

Have a look at
http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting
for some source code.

I also posted a related question with additional hints and pitfalls:
Sorting strings is much harder than you thought

I posted code (C#) and a description of the algorithm here:
Natural Sort Order in C#

This is a try to implement it in Java:
Java - Sort Strings like Windows Explorer
In short it splits the two Strings to compare in Letter - Digit Parts and compares this parts in a specific way to achieve this kind of sorting.

Related

What locale/rule gives this string sort order?

I have subscribed to a cloud drive service. All good until I needed to check that all folders in a long list had been actually uploaded. I realized that the name-based sort order in my local system is different than the one used by the remote cloud service or its web interface whatsoever. So I tried to figure out what is the remote sort order, in order to then use it also locally (before, I had tried to find configurations for the remote system to no avail). I am totally lost. So the question is:
What rules/locale sorts the following strings in this exact order?
T. J. Smith
T. Smith
T.J. Smith
Talya
T'Amya
Tamya
(I put Talya in there so to show that the sort order is "ascending" because in any (reasonable) sort order Talya comes before Tamya)
I have tried different ways to sort the list of strings in the hopes that one would match the cloud service's own order. This is what I tried
In Windows 10 (with my locale!) this list is sorted as:
In my Ubuntu Nautilus I get this:
And, finally, if I put those strings in a file (sortme.txt) and call "sort" from command line in Ubuntu I get the following:
(first LC_COLLATE=C)
(second, LC_COLLATE=en_US.UTF-8)
As you see no one of these match the desired ordering in particular no one matches the order of the strings "Talya", "T'Amya", "Tamya"
I would be very thankful if anybody could help me sort this out :-)
Not sure whether I should delete this question altogether. I realized that the sorting algorithm that I was trying to understand is just severely bugged: it has an outright wrong behavior where it puts half of al list in increasing order and the other half in decreasing order. So this order is probably just the result of a bug.
Anyway I resorted to use the cloud storage from another OS, with another interface, that just works as expected.

Search-For Utility Mainframe Algorithm

Can someone please give me some pointers on how the IBM mainframe Search-For Utility algorithm works?
How does it compare strings? What kind of matching algorithm does it use? How should I enter different strings in order to make the less comparisons possible?
I am using the utility but I do not know how it works, and I believe I am not using it as well as I should.
Thank you very much for your help!
Think of it as a very dumb search.
It doesn't have the capacity to enter a REGEX or anything like that. I don't think anyone will be able to tell you what algorithm is used.
Search-For uses the SuperC program to actually perform the search. What it appears to do is search line by line for a match to the string you provided. So if I do a search for:
'PIC 9(9)'
I am going to get back results for every line that has that string in it. The only way I could bring back less search results, would be to add more to that string. So maybe search for:
'PIC 9(9).' 'PIC 9(9) VALUE 'PIC 9(9) COMP'
any of these 3 would provide less results than the first search. So if that string breaks a line like:
05 WS-SOME-VARIABLE PIC 9(9)
VALUE 123456.
a search for 'PIC 9(9) VALUE' will not return anything, but a search for 'PIC 9(9)' would.
The more specific you are, the less search results you will get back. Depending on what you are looking for, you may be able to get better results by using Search-For in batch, or using File-Aid instead. Every specific scenario is different. So without knowing exactly what you are searching for and what your requirement it, its hard to tell you how to proceed.
You might consider IBM Developer for z, which which can do regular expression based searches. When the Remote Systems Explorer Daemon (RSED) is setup and running on the z/OS lpar, you can do searches across a single PDS or groups of PDS's using IDz filters. Very powerful. It also searches in the background so you can do other tasks while it searches. The searches can be saved for future ease of reference.

Tips on writing a duplicate file finder in F#

I am new to programming and F# is my first .NET language as well as my first functional language. As a beginner's project, I would like to try implementing my own duplicate file finder, and I am looking for some tips on the F# tools that are relevant to my project. I apologise in advance if my question doesn't meet StackOverflow's standards: I will gladly make changes to it as required.
Here is the rough idea I have come up with: I will retrieve all files from a desired folder, read the file contents into byte arrays, and then use a hash-table to store the byte arrays and remove duplicates. Will more experienced programmers tell me whether this is a good approach? What improvements can I make? Additionally, as asked earlier, what are the relevant F# tools to look at? MSDN has a huge list of libraries and namespaces and etc., and it is really overwhelming for a newbie like me.
Thank you warmly in advance for your help!
I'd recommend starting with a console application.
There are a couple of relevant .Net APIs:
System.IO.Directory.GetFiles
System.IO.Directory.EnumerateFiles
GetFiles returns an easy to use array of all file paths but blocks until all files are found, where as EnumerateFiles lets you enumerate the files one-by-one and give feedback to the user.
For performance when finding duplicates, the file length can be used to find potential duplicates before comparing the data. Here you could use the Length property of System.IO.FileInfo.
If you create a sequence of tuple of file name and file length, you could use Seq.groupBy to group potential matches. Finally for groups of 2 or more you can read the files and compare the bytes to find exact duplicates.

Is it possible to design our own algorithm to create unique GUIDs?

GUID are generated by the combination of numbers and characters with a hyphen.
eg) {7B156C47-05BC-4eb9-900E-89966AD1430D}
In Visual studio, we have the 'Create GUID' tool to create it. I hope the same can be created programmatically through window APIs.
How GUIDs are made to be unique? Why they don't use any special characters like #,^ etc...
Also Is it possible to design our own algorithm to create unique GUIDs?
Yes, it is possible, however you shouldn't try to reinvent the wheel without a good reason to do so. GUIDs are made unique by including elements which (statistically speaking) are very unlikely to be the same in two distinct cases, such as:
Current timestamp
The update of the system
The MAC address of the system
A random number
...
Also, consider the privacy implications of GUIDs if you implement it from zero, because they contain the above data which some people deem sensitive.
UUIDs are defined e.g. in http://www.faqs.org/rfcs/rfc4122.html. One problem with using your own algorithm to generate something like a UUID is that you could collide with other people's UUIDs. So you should definitely use somebody else's implementation if at all possible and, if not, write your own implementation of one of the standard algorithms.
Just to answer this: Why they don't use any special characters like #,^ etc..
It is supposed to be a 128bit Integer. So the common representation is simple 32 Hexadecimals.
You can create also use 128 characters of 1's and 0's etc.
As for the rest of the question, wiki has good answers.
You should create new GUID's by calling the API designed for it. In native windows land it is
CoCreateGUID; in .NET land it is System.Guid.NewGuid();
Yes although easier to make use of someone else's e.g. Boost uuid

Eliminating code duplication in a single file

Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
Thanks in advance.
Edit:
Thanks for all the great tools! I'll definitely check them out.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.
Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.
http://www.getatomiq.com/
If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.
You don't say what language you are using, which is going to affect what tools you can use.
For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.
See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.
Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you
Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.
One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines
There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.
Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd

Resources