sorting and splitting in a DFSORT together? - sorting

Input file Layout:
01 to 10 - 10 Digit Acct#
53 to 01 - An indicator with values 'Y' or 'N'
71 to 10 - Time stamp
(Rest of the fields are insignificant for this sort)
While sorting the input file by splitting and eliminating duplicates in two ways result in different results. I wanna know why?
Casei: Splitting and Eliminating duplicates in the same step.
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),
OUTFIL FILES=02,
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),
Caseii: Splitting and eliminating duplicates in two different steps:
STEP:01
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
STEP:02
SORT FIELDS=COPY
OUTFIL FILES=01,
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),
OUTFIL FILES=02,
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),
These two steps are resulting different output. Do u see any difference between both cases? Please clarify.

You are asking to sort on an Account Number (10 characters ascending) then on an Indicator (1 character ascending).
These two fields alone determine the key of the record - Timestamp is not part of the sort key. Consequently if there
are two or more records with the same key they could be placed in any (random) order by the sort. No telling
what order the Timestamp values will appear.
Keeping the above in mind, consider what happens when you have two records with the same key but different
Timestamp values. One of these Timestamp values meets the given INCLUDE criteria and the other one doesn't.
The SUM FIELDS=NONE parameter is asking to remove duplicates based on the key. It does this by grouping
all of the records with the same key together and then selecting the last one in the group. Since key
does not include the Timestamp the choosen record is essentially a random event. Consequently it is unpredictable
as to whether you get the record that meets the subsequent INCLUDE condition.
There are a couple of ways to fix this:
Add Timestamp to the sort key. This might not work because it may leave multiple records for the same Account Number / Inidcator, that is it may break your duplicate removal requirement
Request a stable sort.
A stable sort causes records having the same sort key to maintain their same relative positions after the sort.
This will preserve the original order of the Timestamp values in your file given the same key. When the removal of duplicates occurs DFSORT will choose the last record from the set of duplicates. This should bring the predicability to the duplicate elimination process you are looking for. Specify
a stable sort by adding an OPTIONS EQUALS control card before the SORT card.
EDIT Comment: ...picks the VERY FIRST record
The book I based my original answer on clearly stated the last record in a group of records with the same
key would be selected when SUM=NONE is specified. However, it is always
best to consult the vendors own manuals. IBM's DFSORT Application Programming Guide only states
that one record with each key will be selected. However,
it also has the following note:
The FIRST operand of ICETOOL's SELECT operator can be used to perform the same
function as SUM FIELDS=NONE with OPTION EQUALS. Additionally, SELECT's FIRSTDUP,
ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, and LAST operands can be
used to select records based on other criteria related to duplicate and non-duplicate
keys. SELECT's DISCARD(savedd) operand can be used to save the records discarded by
FIRST, FIRSTDUP, ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, or
LAST. See SELECT Operator for complete details on the SELECT operator.
Based on this information I would suggest using ICETOOL's SELECT operator to select the correct record.
Sorry for the misinformation.

The problem is as NealB identified.
The easiest thing to do is to "get rid of" the records you don't want by date before the SORT. The SORT will take less time. This assumes that SORTOUT is not required. If it is, you have to keep your INCLUDE= on the OUTFILs.
SELECT is a good option. SELECT uses OPTION EQUALS by default. The below Control Cards can be included in an xxxxCNTL dataset, and action from the SELECT with USING(xxxx). SELECT gives you greater flexibility than SUM (you can get the last, amongst other things).
The whole task sounds flawed. If there are records per account with different dates, I'd expect either the first date or the last date, or something else specific, to be required, not just whatever record happens to be hanging around at the end of the SUM.
OPTION EQUALS
INCLUDE COND=(71,10,CH,GT,&DATE2(-))
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,
INCLUDE=(53,01,CH,EQ,C'Y')
OUTFIL FILES=02,
INCLUDE=(53,01,CH,EQ,C'N')
Or, if the Y/N cover all records:
OUTFIL FILES=02,SAVE

Related

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Key assignment scheme for sorting rows in table

I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.

multicolumn index column order

I've be told and read it everywhere (but no one dared to explain why) that when composing an index on multiple columns I should put the most selective column first, for performance reasons.
Why is that?
Is it a myth?
I should put the most selective column first
According to Tom, column selectivity has no performance impact for queries that use all the columns in the index (it does affect Oracle's ability to compress the index).
it is not the first thing, it is not the most important thing. sure, it is something to consider but it is relatively far down there in the grand scheme of things.
In certain strange, very peculiar and abnormal cases (like the above with really utterly skewed data), the selectivity could easily matter HOWEVER, they are
a) pretty rare
b) truly dependent on the values used at runtime, as all skewed queries are
so in general, look at the questions you have, try to minimize the indexes you need based on that.
The number of distinct values in a column in a concatenated index is not relevant when considering
the position in the index.
However, these considerations should come second when deciding on index column order. More importantly is to ensure that the index can be useful to many queries, so the column order has to reflect the use of those columns (or the lack thereof) in the where clauses of your queries (for the reason illustrated by AndreKR).
HOW YOU USE the index -- that is what is relevant when deciding.
All other things being equal, I would still put the most selective column first. It just feels right...
Update: Another quote from Tom (thanks to milan for finding it).
In Oracle 5 (yes, version 5!), there was an argument for placing the most selective columns first
in an index.
Since then, it is not true that putting the most discriminating entries first in the index
will make the index smaller or more efficient. It seems like it will, but it will not.
With index
key compression, there is a compelling argument to go the other way since it can make the index
smaller. However, it should be driven by how you use the index, as previously stated.
You can omit columns from right to left when using an index, i.e. when you have an index on col_a, col_b you can use it in WHERE col_a = x but you can not use it in WHERE col_b = x.
Imagine to have a telephone book that is sorted by the first names and then by the last names.
At least in Europe and US first names have a much lower selectivity than last names, so looking up the first name wouldn't narrow the result set much, so there would still be many pages to check for the correct last name.
The ordering of the columns in the index should be determined by your queries and not be any selectivity considerations. If you have an index on (a,b,c), and most of your single column queries are against column c, followed by a, then put them in the order of c,a,b in the index definition for the best efficiency. Oracle prefers to use the leading edge of the index for the query, but can use other columns in the index in a less efficient access path known as skip-scan.
The more selective is your index, the fastest is the research.
Simply imagine a phonebook: you can find someone mostly fast by lastname. But if you have a lot of people with the same lastname, you will last more time on looking for the person by looking at the firstname everytime.
So you have to give the most selective columns firstly to avoid as much as possible this problem.
Additionally, you should then make sure that your queries are using correctly these "selectivity criterias".

A good algorithm for generating an order number

As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit

What is the benefit for a sort algorithm to be stable?

A sort is said to be stable if it maintains the relative order of elements with equal keys. I guess my question is really, what is the benefit of maintaining this relative order? Can someone give an example? Thanks.
It enables your sort to 'chain' through multiple conditions.
Say you have a table with first and last names in random order. If you sort by first name, and then by last name, the stable sorting algorithm will ensure people with the same last name are sorted by first name.
For example:
Smith, Alfred
Smith, Zed
Will be guaranteed to be in the correct order.
A sorting algorithm is stable if it preserves the order of duplicate keys.
OK, fine, but why should this be important? Well, the question of "stability" in a sorting algorithm arises when we wish to sort the same data more than once according to different keys.
Sometimes data items have multiple keys. For example, perhaps a (unique) primary key such as a social insurance number, or a student identification number, and one or more secondary keys, such as city of residence, or lab section. And we may very well want to sort such data according to more than one of the keys. The trouble is, if we sort the same data according to one key, and then according to a second key, the second key may destroy the ordering achieved by the first sort. But this will not happen if our second sort is a stable sort.
From Stable Sorting Algorithms
A priority queue is an example of this. Say you have this:
(1, "bob")
(3, "bill")
(1, "jane")
If you sort this from smallest to largest number, an unstable sort might do this.
(1, "jane")
(1, "bob")
(3, "bill")
...but then "jane" got ahead of "bob" even though it was supposed to be the other way around.
Generally, they are useful for sorting multiple entries in multiple steps.
Not all sorting is based upon the entire value. Consider a list of people. I may only want to sort them by their names, rather than all of their information. With a stable sorting algorithm, I know that if I have two people named "John Smith", then their relative order is going to be preserved.
Last First Phone
-----------------------------
Wilson Peter 555-1212
Smith John 123-4567
Smith John 012-3456
Adams Gabriel 533-5574
Since the two "John Smith"s are already "sorted" (they're in the order I want them), I won't want them to change positions. If I sort these items by last, then first with an unstable sorting algorithm, I could end up either with this:
Last First Phone
-----------------------------
Adams Gabriel 533-5574
Smith John 123-4567
Smith John 012-3456
Wilson Peter 555-1212
Which is what I want, or I could end up with this:
Last First Phone
-----------------------------
Adams Gabriel 533-5574
Smith John 012-3456
Smith John 123-4567
Wilson Peter 555-1212
(You see the two "John Smith"s have switched places). This is NOT what I want.
If I used a stable sorting algorithm, I would be guaranteed to get the first option, which is what I'm after.
An example:
Say you have a data structure that contains pairs of phone numbers and employees who called them. A number/employee record is added after each call. Some phone numbers may be called by several different employees.
Furthermore, say you want to sort the list by phone number and give a bonus to the first 2 people who called any given number.
If you sort with an unstable algorithm, you may not preserve the order of callers of a given number, and the wrong employees could be given the bonus.
A stable algorithm makes sure that the right 2 employees per phone number get the bonus.
It means if you want to sort by Album, AND by Track Number, that you can click Track number first, and it's sorted - then click Album Name, and the track numbers remain in the correct order for each album.
One case is when you want to sort by multiple keys. For example, to sort a list of first name / surname pairs, you might sort first by the first name, and then by the surname.
If your sort was not stable, then you would lose the benefit of the first sort.
The advantage of stable sorting for multiple keys is dubious, you can always use a comparison that compares all the keys at once. It's only an advantage if you're sorting one field at a time, as when clicking on a column heading - Joe Koberg gives a good example.
Any sort can be turned into a stable sort if you can afford to add a sequence number to the record, and use it as a tie-breaker when presented with equivalent keys.
The biggest advantage comes when the original order has some meaning in and of itself. I couldn't come up with a good example, but I see JeffH did so while I was thinking about it.
Let's say you are sorting on an input set which has two fields, and, you only sort on the first. The '|' character divides the fields.
In the input set you have many entries, but, you have 3 entries that look like
.
.
.
AAA|towing
.
.
.
AAA|car rental
.
.
.
AAA|plumbing
.
.
.
Now, when you get done sorting you expect all the fields with AAA in them to be together.
A stable sort will give you:
.
.
.
AAA|towing
AAA|car rental
AAA|plumbing
.
.
.
ie, the three records which had the same sort key, AAA, are in the same order in the output that they were in the input. Note that they are not sorted on the second field, because you didn't sort on the second field in the record.
An unstable sort will give you:
.
.
.
AAA|plumbing
AAA|car rental
AAA|towing
.
.
.
Note that the records are still sorted only on the first field, and, the order of the
second field differs from the input order.
An unstable sort has the potential to be faster. A stable sort tends to mimic what non-computer scientist/non-math folks have in their mind when they sort something. Ie, if you did an insertion sort with index cards you would most likely have a stable sort.
You can't always compare all the fields at once. A couple of examples: (1) memory limits, where you are sorting a large disk file, and there isn't room for all the fields of all records in main memory; (2) Sorting a list of base class pointers, where some of the objects may be derived subclasses (you only have access to the base class fields).
Also, stable sorts have deterministic output given the same input, which can be important for debugging and testing.

Resources