What locale/rule gives this string sort order? - sorting

I have subscribed to a cloud drive service. All good until I needed to check that all folders in a long list had been actually uploaded. I realized that the name-based sort order in my local system is different than the one used by the remote cloud service or its web interface whatsoever. So I tried to figure out what is the remote sort order, in order to then use it also locally (before, I had tried to find configurations for the remote system to no avail). I am totally lost. So the question is:
What rules/locale sorts the following strings in this exact order?
T. J. Smith
T. Smith
T.J. Smith
Talya
T'Amya
Tamya
(I put Talya in there so to show that the sort order is "ascending" because in any (reasonable) sort order Talya comes before Tamya)
I have tried different ways to sort the list of strings in the hopes that one would match the cloud service's own order. This is what I tried
In Windows 10 (with my locale!) this list is sorted as:
In my Ubuntu Nautilus I get this:
And, finally, if I put those strings in a file (sortme.txt) and call "sort" from command line in Ubuntu I get the following:
(first LC_COLLATE=C)
(second, LC_COLLATE=en_US.UTF-8)
As you see no one of these match the desired ordering in particular no one matches the order of the strings "Talya", "T'Amya", "Tamya"
I would be very thankful if anybody could help me sort this out :-)

Not sure whether I should delete this question altogether. I realized that the sorting algorithm that I was trying to understand is just severely bugged: it has an outright wrong behavior where it puts half of al list in increasing order and the other half in decreasing order. So this order is probably just the result of a bug.
Anyway I resorted to use the cloud storage from another OS, with another interface, that just works as expected.

Related

Search-For Utility Mainframe Algorithm

Can someone please give me some pointers on how the IBM mainframe Search-For Utility algorithm works?
How does it compare strings? What kind of matching algorithm does it use? How should I enter different strings in order to make the less comparisons possible?
I am using the utility but I do not know how it works, and I believe I am not using it as well as I should.
Thank you very much for your help!
Think of it as a very dumb search.
It doesn't have the capacity to enter a REGEX or anything like that. I don't think anyone will be able to tell you what algorithm is used.
Search-For uses the SuperC program to actually perform the search. What it appears to do is search line by line for a match to the string you provided. So if I do a search for:
'PIC 9(9)'
I am going to get back results for every line that has that string in it. The only way I could bring back less search results, would be to add more to that string. So maybe search for:
'PIC 9(9).' 'PIC 9(9) VALUE 'PIC 9(9) COMP'
any of these 3 would provide less results than the first search. So if that string breaks a line like:
05 WS-SOME-VARIABLE PIC 9(9)
VALUE 123456.
a search for 'PIC 9(9) VALUE' will not return anything, but a search for 'PIC 9(9)' would.
The more specific you are, the less search results you will get back. Depending on what you are looking for, you may be able to get better results by using Search-For in batch, or using File-Aid instead. Every specific scenario is different. So without knowing exactly what you are searching for and what your requirement it, its hard to tell you how to proceed.
You might consider IBM Developer for z, which which can do regular expression based searches. When the Remote Systems Explorer Daemon (RSED) is setup and running on the z/OS lpar, you can do searches across a single PDS or groups of PDS's using IDz filters. Very powerful. It also searches in the background so you can do other tasks while it searches. The searches can be saved for future ease of reference.

Sequential browsing of an elasticsearch index

I am building a system that uses Elasticsearch to store and retrieve library catalogue data. One thing I've been asked for is a browse interface.
Here's a definition of what this is:
The user does a search, for example "Author starts with" and they
supply "Smith"
The system puts them into the middle of a list of authors, at or near
the position of the first one that starts with "Smith", so they might
see:
Smart, Murray
Smart, Murray J.
Smeaton, Duncan
Smieliauskas, Wally
Smillie, John
Smith Milway, Katie <-- this being the first actual search result
Smith, A. M. C.
Smith, Andrew
Smith, Andrew M. C.
etc.
The one with the marker is the one actually searched for, but you can see the ones around it according to the sort order, including ones that don't actually match the query.
These will be paged, so having ~20 or so results per page. If the user pages back, they head towards the start of the alphabet, if they page forwards they will go onward.
Each result shown will have a count beside it showing how many results (i.e. catalogue items) are associated with that author.
Clicking on a result takes you to everything by that author (this and everything beyond it is fairly easy and mostly implemented already.)
I'm wondering if anyone has any good ideas on how to approach this. At this stage, I don't care too much about handling searches that aren't "field starts with" searches, as exactly how that will be done is currently up in the air and I'll deal with it when the time comes.
Here's what I'm thinking, but there are serious issues with it:
All the fields that are going to be browsed are faceted
I get a list of all the facets for that field, search through it to find the starting point, and handle the paging manually in code.
This has the big problem that I might be fetching hundreds of thousands of terms and processing them, which won't be quick.
In retrospect, it's no different to loading all the values into its own index and fetching all them in sorted order.
I'm open to any options here, whether I can somehow jump into the middle of a large set of facets like the query "from" field, or if I should instead put everything into another index specifically for this purpose (though I don't know how I'd structure and query it), or something else.
From what I can see, my ideal solution would be that I can specify the facet field, tell ES that I want to start at the one that starts with "Smith", and it displays from around there, then I have the ability to say "go 20 back", but I'm not sure that this is possible.
You can see an example of the sort of thing I'm talking about in action here: http://hollisclassic.harvard.edu/ - put in Smith as "Author (last name first)", and it gives you a (terribly ugly looking) browse list.
Any thoughts?
On:
The one with the marker is the one actually searched for, but you can
see the ones around it according to the sort order, including ones
that don't actually match the query.
I had a similar requirement: "Show the user how many records we would have found if the search-conditions were more relaxed".
I solved this by doing two searches (one exact, one more relaxed), as the performance of ES is so good that doing one or two searches does not matter. The time gets eaten up in the displaying (in my case) and not in the search.
Still you would need to merge these two results in you application to generate one list to display.

Why are files returned by a For Each loop sorted, but not always?

I'm not sure if this is the correct place to post this question because I have a hunch that the behavior I witness will also be observed using other methods. But anyway, here it goes.
I have a VBscript that contains code like this:
For Each objFile In colFiles
...
Next
I've been running this code for quite some time on many different systems. I never bothered to order the files alphabetically. But today I found out by accident that the logic of my program depends on it. I ran the code on a new system (under Citrix) and the files were returned in a seemingly random order.
Does anybody know why Windows sometimes returns the files sorted alphabetically while sometimes it doesn't?
Added note: It might be relevant to note that the script as well as the input folder are on a network share (where my script outputs randomly ordered files).
Ordering is not supported for FileSystemObject. See KB 189751 http://support.microsoft.com/kb/189751/en-us
Also check out an answer on how to deal with that on SO Order of Files collection in FileSystemObject
The docs do not specify an ordering. Thus, you cannot depend on it to have an order. The Files property needs to ask the underlying file system for the files, and then gives it to you as it, without any processing. If that file system happens to return the files in order, that's great. If not, you'll have to sort it. Regardless of whether it is in order, you should always order it if you expect it in a certain order because the implementation may change tomorrow (as you've just witnessed).
It depends on what data structure you are looping through.
You will obviously get a different order if you use foreach loop in an array and a hashset, for example.
Personally, I don't know anything about VB. But it does work this way in C#.

LDAP Syntax/Semantics: Filter vs. Base DN?

This is probably pretty stupid, but I'm still green to LDAP. So I hope someone can lend me a hand.
I am using Apache Directory Studio to do my searches and I am confused about when I should be using a filter or when I should be breaking up my filter into two, using one part as the filter and the other as my search base.
Here's an example where I'm trying filter out a group.
Filter: CN=JohnTestGroup,OU=TECH,DC=lab,DC=ing
Base: DC=lab,DC=ing
This yielded zero results. I realized that perhaps I am being redundant as part of the base is in the filter, so I got rid of that part in the filter.
Filter: CN=JohnTestGroup,OU=TECH
Base: DC=lab,DC=ing
This still did not yield anything. So I tried this:
Filter: CN=JohnTestGroup
Base: OU=TECH,DC=lab,DC=ing
I moved the OU parameter into the Base. This worked, but I don't understand why the first or second attempts didn't. Someone care to drop some knowledge on me?
This is probably a matter of syntax/semantics, so if anyone could point me to a resource, I'd be more than willing to read more about it.
Read about Scopes there: http://www.idevelopment.info/data/LDAP/LDAP_Resources/SEARCH_Setting_the_SCOPE_Parameter.shtml
If you set you search scope to SUBTREE both (2 and 3), possible 1 variants start work, but searching by subtree works slower
I think you are misunderstanding how the filter works. It is meant to be key=value pairings.
So (objectClass=iNetOrgPerson) as an example.
If you wish a filter to find a DN, then you pick an identifying chracteristic like CN, and filter (CN=JohnTestGroup) or perhaps (mail=John#mail.net).
The base tells the LDAP server where to start looking, as seriyPS notes in his/her answer, the SCOPE is the next question. How deep should the server search, as that adds overhead and performance issues. Subtree is simplist conceptually. Just keep looking from here down, till you run out of tree to look through.
That is why your last one works.
Now, if you want to find a specific object and you know its DN, you do an ENTRY scope query for the base of the specific DN.

Windows Explorer sort method

I'm looking for an algorithm that sorts strings similar to the way files (and folders) are sorted in Windows Explorer. It seems that numeric values in strings are taken into account when sorted which results in something like
name 1, name 2, name 10
instead of
name 1, name 10, name 2
which you get with a regular string comparison.
I was about to start writing this myself but wanted to check if anyone had done this before and was willing to share some code or insights. The way I would approach this would be to add leading zeros to the numeric values in the name before comparing them. This would result in something like
name 00001, name 00010, name 00002
which when sorted with a regular string sort would give me the correct result.
Any ideas?
It's called "natural sort order". Jeff had a pretty extensive blog entry on it a while ago, which describes the difficulties you might overlook and has links to several implementations.
Explorer uses the API StrCmpLogicalW() for this kind of sorting (called 'natural sort order').
You don't need to write your own comparison function, just use the one that already exists.
A good explanation can be found here.
There is StrCmpLogicalW, but it's only available starting with Windows XP and only implemented as Unicode.
Some background information:
http://www.siao2.com/2006/10/01/778990.aspx
The way I understood it, Windows Explorer sorts as per your second example - it's always irritated me hugely that the ordering comes out 1, 10, 2. That's why most apps which write lots of files (like batch apps) always use fixed length filenames with leading 0's or whatever.
Your solution should work, but you'd need to be careful where the numbers were in the filename, and probably only use your approach if they were at the very end.
Have a look at
http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting
for some source code.
I also posted a related question with additional hints and pitfalls:
Sorting strings is much harder than you thought
I posted code (C#) and a description of the algorithm here:
Natural Sort Order in C#
This is a try to implement it in Java:
Java - Sort Strings like Windows Explorer
In short it splits the two Strings to compare in Letter - Digit Parts and compares this parts in a specific way to achieve this kind of sorting.

Resources