Advice on DB design Best Practices/Standard - Oracle - oracle

I'm designing the DB for a new app which is something I've done a thousand times, but in this occasion I suddenly start wondering on some aspects that I've never stopped before. Is there some standard/recommendation for the following things?
Whats the recommended data type for storing currencies (no financial operations, just displaying).
Recommended size for storing phone numbers (internationals)
Recommended minimum size for storing first names / last names (minimum meaning smallest maximum recommended size)
Recommended minimum size for storing comment blocks.(minimum meaning smallest maximum recommended size also)
I'm aware that every application has its own particular requirements to consider, but I feel that there must be something more specific than gut feeling and common sense.
Help, as always, will be deeply appreciated.

Whats the recommended data type for storing currencies
This depends on what kind of currency, and to what degree of accuracy.
If it's cents and dollars, rounded to the nearest cent, it's NUMBER(12,2) which allows you to store amounts between -999,999,999,999.99 and 999,999,999,999.99 - which for most currencies should be enough.
If you need to store intermediate results from, say, interest rate calculations, you may need more precision, e.g. NUMBER(15,5).
If you're talking Zimbabwean dollars, perhaps you should choose the maximum NUMBER instead :)
Recommended size for storing phone numbers (internationals)
VARCHAR2(30) should be sufficient. If it's too long your users will enter all sorts of rubbish data in there.
Recommended minimum size for storing first names / last names /
Recommended minimum size for storing comment blocks
These don't apply since you're in Oracle - use VARCHAR2, so you don't have to worry about minimum size. All you need to specify is the maximum size.

Currencies:
NUMBER(15,2), really depends on how big the numbers are that you expect to run into.
Phone numbers:
VARCHAR2(30), please don't hurt me if it should be larger - can't remember the length per se just that VARCHAR allows flexibility for formatting.
I don't see the point of looking at the minimum size if using VARCHAR2. The concerns for the physical model revolve around how much space the database will consume over time, assuming fields are maxed out.
Comment blocks:
Maximum of VARCHAR2(4000)

EDIFACT generally uses 35 as the size of a Name field and I'd copy that (and document that as a basis). Newer stuff tends to be defined in XML and doesn't normally go into field length definitions.
Alternatively the Canadian post office recommends no more than 40 characters per address line.
Note, that is characters and not bytes. Sizing should take into account multi-byte characters, but obviously not all names will be the maximum length. I've used ten characters per name as a broad approximation for sizing estimates but that could vary a lot between countries, ethnicities etc.

I know you were asking minimum size for comment blocks, but for large free-text areas you ought to consider using a CLOB value. Oracle is pretty smart about how these things are handled, how the data is stored, etc. You NEVER have to worry about size. In addition, you can usually pretend that they are VARCHAR2 columns for easy manipulation.

Related

How to generate a 4-digit validation code according to latitude & longitude information?

My application needs this feature:
User A can upload his location information and get ADD CODE which is
generated on server.
User B can input ADD CODE and also has to upload his location
information. Only when userA and userB is close enough and ADD CODE
is matched can they finally be friends.
Before calculating distance and comparing ADD CODE, I will check
whether their city number(unique for each city) is same. In other
words, I have to make sure that in each city, ADD CODE won't
conflict with another at the same time(or few minutes).
Of course, a 4-digit number won't satisfy all possibilities, but is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Assuming this sequence happens within a limited period of time (and not that an Add code is valid for all-time):
Your 4-digit number doesn't need to be globally unique, it only needs to be unique within this window of time. So, with that observation, maintain a table of Add codes, when they were issued and for what location. Generate them randomly ensuring they aren't already in the table. Periodically remove any Add codes that have expired.
Provided you never have more than 10,000 users simultaneously trying to connect with each other this will work.
If you need more than that consider allowing duplicates in the table but using the lat/long to ensure that the same Add code is never allocated to any point within 2x the max distance allowed for pairing.
Is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Yes. There are probably millions of possible ways to generate a 4-digit number (where almost all of them are awful and don't satisfy most of the requirements); and if you sort them in order of how much they satisfy, then one of them must satisfy the feature as much as possible.
The real question is, how awful and unsatisfactory would "as much as possible" be?
If you assume 4 decimal digits; then you're limited to 10000 locations or 10000 unique users. That's unlikely to be enough for anything good.
If you assume 4 hexadecimal digits; then you're limited to 65536 locations or 65536 unique users. That's better but still not enough.
So.. what if you used "base 1234567"? In this case a 4-digit number has 2323050529221952581345121 permutations. The surface of the Earth is about 510.1 million square kilometers, so this would be enough to encode a location very precisely (probably within a few meters?).

Determine if unique number has been seen from a range of numbers

I am trying to find the best, quickest and most efficient way to determine if a number has been seen in a range.
Example:
Key for record: Raffle Event 1 (database key)
Tickets Available: 1 - 1000000 (the range)
Ticket number 4 was turned in. Has it been turned in already for this event?
Ticket number 865401 was turned in. Has it been turned in already for this event?
I've thought about bit masks, storing data as buckets, etc. But none of these seem to be the answer I am trying to find. Maybe it does not exist.
We have 800,000 events, each event with 1 million tickets. We are currently storing the last number turned in, and anything lower is rejected. We want to have finer granularity, but need efficiency and storing each ticket would be impractical.
Data is stored using SQL
Any ideas?
EDIT
The best idea I've come up with so far is using a bitmap. Have 10 columns for each event. Each column stores 100,000 bits. This should allow for quick data retrieval, then just check if the bit is on or off. This should be about 1mb of storage per event, or 100k per column read.
I'm still searching for alternative ideas or recommendations.
You could use a bitmask if you anticipate your range to be moderate. Else you could try using a set structure. These can be implemented using binary search trees.
I went with a bitmap. I store 1/0 if the ticket was used or not. I split this bitmap into 16 buckets for storage. This was my magic number as it made just under 8K buckets, the perfect size for SQL paging.
Each bucket is null at first until needed. This saves space. This way each even is taking up zero space. And we only use 8K for each "62500" chunk of tickets. (8K)
It is efficient and does everything I need. I played around with compression to save space, but worst case for me was a few trillion records totaling 100GB if all tickets were accounted for (all buckets for each event). This might sound like a lot of space, but with today's cost its nearly negatable and not worth trying to compress the buckets of data.

Common strategies to deal with rounding errors in currency-intensive soft?

What is your advice on:
compensation of accumulated error in bulk math operations on collections of Money objects. How is this implemented in your production code for your locale?
theory behind rounding in accountancy.
any literature on topic.
I currently read Fowler. He mentions Money type, it's typcal structure (int, long, BigDecimal), but says nothing on strategies.
Older posts on money-rounding (here, and here) do not provide a details and formality I need.
Thoughts I found in the inet relate to "Round half even" as the best way to balance error.
Thanks for help.
There are many rounding issues when recording financial data.
First issue is ability to store and retrieve exact decimal numbers
most databases offer decimal data type on which you can specify the number of digits before and after decimal point (currencies vary in number of decimal digits, too, I've dealt with currencies with 0, 2, 3 decimal digits)
when dealing with this data and you want to avoid any unexpected rounding errors on the application side you can use BCD as generic approach, or you can use integers to represent any fixed decimal notation or mix your own
If this first issue is sorted out then no addition (or substraction) can introduce any rounding errors. Same goes for multiplication by integer.
The second issue, after you are able to store and retrieve data without loss of information, are expected rounding errors due to division (or multiplication by non integer).
For example if your currency format allows 2 decimals and you want to store transaction that records balances a debit of 10 to 3 equal pieces you can only store it like
10.00
-3.33
-3.33
-3.33
and
-0.01
(rounding error)
This is expected problem that will occur regardless of the data type storage choice and that needs to be taken care of if you want your accounts to balance. This situation is mainly introduced by division (or by multiplication by non integers that have many significant digits).
One way to deal with this is to verify if your data balances after such operations and recognize the allowed rounding difference as opposed to an error situation.
EDIT:
As for references to literature, this one seems interesting and not too long and concerns quite wide audience with interesting scenarios.
Use Banker's rounding. You round to the nearest two-penny.
http://www.xbeat.net/vbspeed/i_BankersRounding.htm
You can expand upon this to round toward the nearest two-penny instead. So 22.5 rounds to 22, but 23.5 rounds to 24. 23.1 and 22.9 both round to 23. However, the original banker's algorithm is more popular.
Never store money values in a double or float - use an int or long as there is no way to store 0.1 accurately in binary.
It all depends on the application. Hopefully there aren't too many situations where rounding is required. For example, transferring money from one account to another requires no rounding.
For situations where rounding is required, it doesn't really matter what you do as long as you pick a policy, communicate it, and stick to it. For instance, I believe the interest on my savings account rounds down to the nearest penny.
What you should do may well be informed by the conventions of the market or jurisdiction you are operating in. For example, pricing bonds in the Australian market requires that you round certain intermediate operations to 8 decimal places. The final price is quoted to a specific number of decimals (3 I think off the top of my head).
If you are dealing with an accounting app, I would expect the relevant accounting standards for your legal environment to possibly dictate this.
I've worked a bit (just a bit) with monetary amounts and I was extremely curious as to the strategy used in my company...
It turns out that we use double, but they've thought about it.
The thing is that the amounts we deal with are not that great (say less than 10k) and at most we need 3 digits after the decimal, for a total of 7 significant digits.
Since we are using 64bits software (and C++) the double type offers enough significant digits for the number of operations we carry on it :)
If you need more precision, there are algorithms to use (for example while adding multiple moneys) but personally I think the heart of the issue comes more from:
conversion from one money to another, which keeps changing of course
printing issues, with some moneys requiring no decimal, others requiring 2 at most, etc...
Perhaps could you expand on the operations you're doing ?

How can I determine the maximum row size just from the column datatype sizes?

How can I work out the maximum row size in a table, if I'm only given the datatype lengths (from all_tab_cols.data_length column) of the columns in the table (i.e. no statistics or ANALYZE)? There's extra complications in that this is an IOT, so there's index tree size to consider as well.
The formulas sizing indexes are involved -- they're generally focused on calculating rows per block to determine actual disk space required for N rows -- and the last time I can find that Oracle included this in their documentation was in Oracle 8.0. See here.
I'd recreate the table, generate some data with a Data Generator and actually measure it.
I generally wouldn't go further than 10s of MB of generated data as sufficient for a guide.
In some places, they tend to generate over-generous data length (eg standardizing on string sizes of 20, 50, 100 bytes/characters). Even on names you may allow 30 characters when most people are in the 5-10 length. As such estimates derived from field sizes, rather than actual lengths, will be VERY vague and you'll have a very large error margin.

What is the best format for a customer number, order number?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A large international company deploys a new web and MOTO (Mail Order and Telephone Order) handling system. Among other things you are tasked to design format for both order and customer identification numbers.
What would be the best format in your opinion? Please list any assumptions and considerations.
Accepted Answer
Michael Haren's answer selected due to the most up votes, but please do read other answers and comments as they make Michael's answer more complete.
Go with all numbers or all letters. If you must mix it up, then make sure there are no ambiguous characters (Il1m, O0, etc.).
When displayed/printed, put spaces in every 3-4 characters but make sure your systems can handle inputs without the spaces.
Edit:
Another thing to consider is having a built in way to distinguish orders, customers, etc. e.g. customers always start with 10, orders always start with 20, vendors always start with 30, etc.
DON'T encode ANY mutable customer/order information into the numbers! And you have to assume that everything is mutable!
Some of the above suggestions include a region code. Companies can move. Your own company might reorganize and change its own definition of regions. Customer/company names can change as well.
Customer/order information belongs in the customer/order record. Not in the ID. You can modify the customer/order record later. IDs are generally written in stone.
Even just encoding the date on which the number was generated into the ID might seem safe, but that assumes that the date is never wrong on the systems generating the numbers. Again, this belongs in the record. Otherwise it can never be corrected.
Will more than one system be generating these numbers? If so, you have the potential for duplication if you use only date-based and/or sequential numbers.
Without knowing much about the company, I'd start down this path:
A one-character code identifying the type of number. C for customers, R for orders (don't use "O" as it could be confused with zero), etc.
An identifier of the system that generated the number. The length of this identifier depends on how many of these systems there will be.
A sequence number, unique to the system generating it. Just a counter.
A random number, to prevent guessable order/customer numbers. Make this as long as your paranoia requires.
A simple checksum. Not for security, but for error checking.
Breaking this up into segments makes it more human-readable as others have pointed out.
CX5-0000758-82314-12 is a possible number generated by this approach
. This consists of:
C: it's a customer number.
X5: the station that generated the number.
0000758: this is the 758th number generated by X5. We can generate 10 million before retiring this station ID or the station itself. Or don't pad with zeros and there's no limit.
82314: this was randomly generated and results in a 1/100,000 chance of guessing a customer ID.
12: checksum.
A primary advantage of using only numbers is that they can be entered much more efficiently using 10-key.
The length of that number should be as short as possible while still encompassing the entire entity space you expect to catalog with room to spare. This can be tricky and should be given a bit of thought. A little set theory can give you the number of unique keys you will have access to, given a group of elements.
It is natural when speaking, to break numbers up into sets of two to four digits. By inserting dashes in some pattern, you can "force" the customer to repeat them in a more efficient and unambiguous manner.
For instance, 323-23-5344, which, of course, is social security number format, helps to inform the speaker where to pause when vocalizing the number. It also provides a visual delineation when writing the number and makes it easy to compare when copying the number.
I second the recommendation that the ordering system masks the input correctly so that no dashes need to be entered at any time. This should be carried through to printed forms to provide a clear expectation of what should be entered. For instance, a printed box for each digit separated by printed dashes.
I disagree that too much information should be embedded in this number especially if those attributes might change. For instance, say we give "323" the meaning of "is a nice customer" but then they call in four times with an attitude. Are we then going to change their customer key to "324", "is a jerk"? What if they are in region 04 and move their company to region 05?
If that happens, your options will be to update that primary key throughout the database or live with the ambiguity that the information embedded in that key is no longer reliable, thus rendering all of the information embedded in the keys of questionable utility.
It is better to store attributes that may change as separate fields in the database and have the customer number be a unique, unchanging key for that customer.
To build on Daniel and Michael's questions: it's even better if the separated numbers MEAN something else. For example, I worked for a company where account numbers were like this:
xxxx-xxxx-xxxxxxxx
The first set of numbers represented the region and the second set represented the market within that region. Once you got used to knowing what numbers were from were, it made it really easy to tell what area an account was in without even having to look at the customer's account.
There are several assumptions that I make when answering this question; some are based on the fact that it is a large international organization, and some are based on the fact that the format is for two separate table types.
Assumptions based on the fact that it's an international organization:
It is probable that each region will need to operate independently -- that is, region A must be able to add customer numbers independently from region B
Each region probably uses a different language so to make the identifiers easily type-able by users around the world, it is best to stick to numbers and spaces only.
Assumptions based on the fact that there are two tables for which this format will be used:
This format may be used by more than the two tables listed, so it should be able to handle an arbitrarily large number of tables.
Experienced users should be able to know what type of identifier they are looking at based on information encoded into the identifier itself.
It would be nice if identifiers were globally unique within the entire system.
Considerations:
For a global company, identifiers can be very long if only numerics are used. We should attempt to limit the amount of extraneous information encoded into the identifier as much as possible.
Identifiers should be self-verifiable to a limited extent; that is a program should be able to detect a large percent of invalid identifiers without looking anything up at all. This implies a checksum.
Proposed format:
SSSS0RR0TTC
The format proposed is as simple as possible, but no simpler:
C The first (rightmost) character will be a checksum of all other characters in the identifier. A simple checksum will do. This will eliminate 90% of all typing errors. If it is decided that this is not enough, then this can be expanded to 2 digits which will eliminate 99% of all typing errors.
TT The next N digits represent the table type number. No table type number can contain the digit zero.
The next digit is a zero. This zero separates the table type number from the region number.
RR The next N digits are the region number. No region numbers can contain a zero.
The next digit is a zero. This zero separates the region from the sequence number.
SSSS The next N digits are the sequence number. This number can contain zeros.
Each set of four numbers are separated by spaces when printed or typed in by convention. Internally they are not separated, but this helps the user transfer them correctly.
Examples
Assuming:
Customer table type=1
Order table table type=2
Region code for US-Alabama=1
Region code for CA-Alberta=43
Region code for Ethopia=924
10 1013 - Customer #1 in Alabama (3 is the checksum: 1 +1 + 1)
10 1024 - Order #1 in Alabama
9259 0304 3016 - customer # 925903 in Alberta, Canada
20 3043 4092 4023 - order number 2030434 in Ethopia
Advantages of this approach:
90% of mistyped numbers will be caught
There are an unlimited number of table types
There are an unlimited number of regions
There are an unlimited number of sequential numbers for each table
Identifier numbers are globally unique to the system. This is important - a customer number cannot be mistaken for an order number and visa versa.
Each region can independently add sequence numbers without a global key
Disadvantages
Each identifier is at least six characters
table types numbers and region numbers cannot contain a zero because the zero is used to separate the sequence number from the region number from the table type number.
Make the number as long as necessary, but not any longer. Every time I pay my water bill, I have to enter my 20-digit customer number, and an 18-digit invoice number. Thankfully, a dash in my customer number separates it into two parts.
Do not depend on leading zeros. Having to figure out how many zeros are in my invoice number is extremely annoying. Take 000000000051415432 for example. Their system won't recognize just 51415432.
Group digits together. If you absolutely have to use long numbers, four-digit chunks should work well.
I would never use user information in IDs. Suppose you use the first letters of the customer's last name followed by some number: e.g. Thomsom could be customer THOM-0001.
Only, it appears you made a mistake, and the man's name is Tomson instead of Thomson. User data can be corrected, IDs should never be modifiable. So next time you look up Tomson under TOMS-... you can't find him. Same with other data, like a customer type. It can always change, the ID can't.
This is very basic to RDBMS.
Simply use counting numbers. For readability it's a good idea to insert separators such that you never have more than 4 successive digits: 9999-9999 is better than 999-99999. And don't make the number longer than necessary; people are much more annoyed by being reduced to a 20 digit number than just being reduced to a number.
There's a catch, though. Especially if you have a small business simple counters can give more away than you would appreciate. Say I order something from you, and the order number is 090145. Next month I order again, and the order number is 090171. Er.. 26 orders in a month? Same, I wouldn't feel comfortable to become customer 0006 in a business which has been active for 10 years.
The solution is simple: skip numbers. Don't use random numbers, because you still want them to be in sequence.
I would have my order numbers follow this format:
ddmmyyyy-####-####
Where ####-#### resets to zero at the beginning of every day. This makes it very easy to correlate orders with the date it was placed.
For customer IDs, I would mix capital letters and numbers, but as Michael said avoid commonly mistaken letters (0,o,L,1,5,s). This will give you 30 characters to deal with. If you use 20 characters, that will give you almost a 64 bit range of customer IDs -- pretty good for security. Make sure you use a secure random number generator when generating ID. As for how you display the format, it should be the following:
####-####-####-####-####
As Michael said again, make sure your system can deal with dashes, spaces, no spaces, or no dashes. (It should just strip all those characters from the input before validation.)
I hope that helps!
You may add a small checksum (using XOR for instance) to ensure (enhance) correctness of given ids.
If it's by mail, consider z-base-32 encoding. But here, with telephone orders, you may prefer decimal identification.
assuming that the creation of orders/customers is not centralized, or will not always be centralized, use a GUID
if the creation of orders/customers will always be centralized, an unsigned integer would be fine
there is no compelling reason for the order number of customer number to "mean" anything, and it is likely that any segmented number scheme invented will have to be overhauled down the road. Stick to something unique and meaningless.
EDIT: for MOTO, any multi-character alphabetical identifier will cause problems over the phone, so GUIDs are right out. Assuming multiple decentralized MOTO locations, assign each MOTO location a prefix (A, B, C, etc., or 01, 02, ...) and use an integer or big-integer for the customer and order IDs, e.g. 01-1 is the first order from MOTO location #1. Note that zero-padding is unnecessary, imposes an implicit digit limit to the numbers, and requires the customer to distinguish between six zeros and seven zeros when speaking the number. If you must use a padded fixed-length format, break the number up into groups of no more than 4 or 5 digits each.
ADDENDUM: the order number and the customer number do not have to be the primary keys of their respective tables, just unique indexed columns for lookup. You'll probably want to use something simpler/more efficient for the primary keys in the database.
We use leading zeroes for some of our references "numbers" where I work and I can't tell you how many wasted hours I've had over the last seven years forcing Excel to treat them as text. Don't do it.
Auto-incrementing integers are all well and good for computers, but they greatly reduce human beings ability to spot errors. How important that is will depend on your business. I work with property (housing) related data and our primary reference has the front door embedded in it. It's not elegant but it means that experienced admin staff can spot 90% of minor errors (when we get invoices, etc in) before they get near a database. But in an environment where you're not relying on that kind of process this argument is less compelling.
(Now, some folks have strongly warned about using meaningful data in references as it could be changed, and there's some truth in that, but you can be smart. You don't have to pick something obviously fickle like whether the person is married - you can anchor yourself on past events like a character representing the region they first opened a particular account. Even if you don't do that, have some kind of pattern to help communication with customers. I've worked in a number of call centres and people sometimes come to phone with every piece of documentation from birth certificate onwards as they desperately try to find their account/order/customer number. I don't think saying "It'll be a number between 1 and 100 trillion" would be very handy)
It's been said, but don't create enormously long references. We're busy people, we haven't got time to be keying in this crap over a phone system and making a mistake on digit 17 only to restart (again). Some of your customers may have disabilities and it's likely a growing number will be over 55+. Once again, watch out for the zeroes. You see purchase order numbers and the like with fourteen digits. How many orders do they think they're going to be placing?
If there's going to be any data aggregation outside of your network (and thus not connected to your database) - have some sort of check digit/regular expression pattern which your partners/suppliers can verify they've not made mistakes. One example of this is the UK's electrical supply numbering system (MPAN) is a good example of this - designed for people to maintain their own records without having to download the big list of every electricity meter in the universe to check they've not made a typo.
I would use numbers only since it is an international company. I would use spaces or dashes every 4-6 numbers to separate it. I would also keep the format separate for quick identification
Example:
000-00000-00000 - could be an customer number
00000-00000-00000-00000 - could be a order number
Stick to numbers (no chars or special stuff):
Can be easily input in an IVR flow
its international - No language hassles
No confusion in chars vs. numbers - O vs. 0, I vs. 1
As long as leading 0 is meaningless, you can store/manipulate them more effectively
I would use a completely numeric systems for both Order Number and Customer Number, this will allow you to avoid issues with other languages.
Avoid leading zeros, as this can cause issues with data entry and validation.
The number of digits for each will be dependent on your expected volume. You will always have a greater number of Order Numbers than Customer Numbers. A six digit Customer number starting at 100000 will still give you 899,999 customers. Add an additional 3-4 digits for the order number, will give you 999 to 9,999 orders per customer (more if you consider one off customers).
There is no need to build any sort of identification into your numbering sequence. You have other database fields to identify where a customer is from, etc. Do not overly complicate your system.
KISS (keep it simple stackoverflow)
I would suggest using 16 digit identifiers that when printed or shown to customers are formatted in the format of xxxx-xxxx-xxxx-xxxx but stored as numbers without the dashes in your system.
The reason for using this format is that it makes it easier for people reading out the number over to phone to read as they can do it in batches of 4 rather then trying to remember how much they have said already.
If you wish the first 4 digits can be used to identify the type of number, 1000 for customers, 2000 for suppliers, 3000 for orders, 4000 for invoices etc.
The second set can then by a year/month identifier if you wish to keep that sort of information encoded in the number itself, using a format of yymm so 1000-0903-xxxx-xxxx would be a customer entered in march 2009.
This then leaves you with 8 digits for the actual data itself.
I would consider the use of letters in the identifiers to be a very bad idea for any system that deals with telephones as the differences in accents and understand is so varied that people are bound to get upset at trying to get their identifier recognised by someone who cannot understand their accent properly.
An additional consideration to the format issue- in the code, create a separate class for OrderId and CustomerId. These classses are immutable, and validate their input to ensure that they are acceptable IDs. Also, no value could be and order ID and a customer ID.
The simplest approach would just be to have the backing values for OrderId be ints that start with 1, and CustomerIds be ints that start with 2, or something similar.
Wow - what a simple yet revealing question! And what a lot of contradictory answers. I think there are 3 obvious candidate answers here:
1) Use an autoincrementing long integer.
2) Use a GUID
3) Use a compound type that includes other information in the ID.
For simpler systems, and especially web based systems where all users are hitting a central database, (1) works well. It has the advantage that numbers stay as short and simple as possible, but no shorter, avoids alphabetic characters (you would be amazed how different the names for the same letters are in different countries - one countries E is another countries I). It does not differentiate the order ID from the customer ID intrinsically, but you could always prepend or append a "C" or "O" to each and silently drop them on entry?
It also does not have a checksum or error check.
For distributed systems where many software components need to create the numbers on the fly, without reference to a master database (2) is the only way to go. They have the advantage of being largely error checking, since the address space is so large, but by the same token, are too long and alphanumeric to comfortably read over the telephone.
As for (3) - embedding region information or today's date into the number - those are the sorts of ideas that experienced developers train themselves out of. Looks like a good idea at first, but always comes back to haunt you. Consider the case where a customer moves to a new state, or an order is manually rekeyed a week after originally issued? These items of information belong in related tables where they can be edited independantly of the ID which should represent the entities identity only.
To repeat: NEVER ENCODE BUSINESS DATA IN AN ID OR PRIMARY KEY - every time you do that you leave a time bomb for others to clean up one day.
Given that this is a centralised (phone based) system I would go with option (1) until a clear need arose to change. Simpler is usually better. Insert hyphens as others suggest and prepend or postpend a checksum and/or identifying letter if required.
First step: in an org sufficiently large to require such a system, there is an existing system that you're replacing. Continue the previous system's scheme, if possible. It makes a lot of things easier if you can access, even at a basic level, the data from the old system.
That said, there's often a good reason to change the scheme, particularly when it's coming from a legacy system. i find, though, that it's often helpful to formally rule out the old scheme before proceeding.
Second step: systems like this never exist in a vacuum. Is there already an organization-wide scheme for user and/or order IDs, such as in the accounting, inventory management, or CRM system? If so, consider adopting the existing schemes to make interoperability easier. Many large orgs have multiple ways to specify a single customer or order, and it just makes getting useful intelligence out of the data that much harder.
Third step: if the old system's scheme is too awful to continue and there's no other scheme to adopt, roll your own. In this case, look at the shortcomings of the original scheme, whatever they are, and correct them. The right answer will depend on the specific requirements of the application. The problem statement you've given us is too vague to speculate usefully on what the final form might look like.
I always stick with auto-increment numbers, and I always seed the sequence high enough so that they will all have a consistent number of digits - seems to be less confusing.
I also sometimes start an order number, say 6 digits, starting at 200,000 and customer numbers at 5 digits, starting at 10,000 which would for example give me 90,000 unique customer numbers and 800,000 unique order numbers to use, and you could always tell just by looking at it whether it was a customer number or an order number. (i.e. so if a customer rep was asking for a number over the phone it would immediately be obvious which was which)
I would not however build logic in the app that would depend on that, so even if it did roll over, the system wouldn't care.
The biggest issue here is to try not to overthink the problem.
Although I'm more experienced in e-commerce systems I think some of the points made in this post could be applied to mail order and telephone order systems.
For orders, an auto-increment integer works perfectly as the primary key in the database as well as the number that the customer will see on his/her invoice. There is absolutely no reason to create some overcomplicated algorithm for your numbers. If you want to tell which country/region they're from use a separate field in your database. Also if you are concerned about your competitors spying on you; let them! If your business revolves around spying on your competitors because you're not generating enough revenue then most likely your businessidea isn't good in the first place. Also if you wanted to fool your competitor you could just create your own script that will autocreate fake orders. If your e-commerce system is well designed then this won't be an issue.
Key stuff using an auto-increment integer:
All numbers/digits => easier to communicate, no ambiguities over the phone, works for all languages/cultures that use 0-9 as their numerical system
No extra coding
Looks nice on the invoice and it's the shortest possible number of digits a customer would ever need to spell out over the phone
Works for small AND large businesses
It's scalable
Serviceminded/Customerminded (What's best for the customer) (se bullets 1 and 3)
Simple
Whenever or whatever you're designing should always begin with what's best for the customer. At the end of the day they are the ones putting food on your table. A happy customer is a returning customer.
For me, my preferred is getting the combination of date + a counter for today's transaction. I was challenged to come up with only 5 digit order number. So with that, I come up with the following below:
I have to get the current date then
get the current counter for today's transaction then add 1.
I decided to use a counting larger than decimal(10), so i use base 16 for counting. So with that, if i will get the max of 5 digit out of hexadecimal(FFFFF) that will be 1,048,575 counts. By involving the date, I can say I can get 1,048,575 counts per day. So to make that count unique every day, I mixed the date by getting the sum of the following:
Current Year count starting from the year of the implementation which is 1
Current hour(max is 24)
Current day of the year(max is 365)
So with that, I will have a max 3 characters start for my counting. So that will be XXX + Todays current transaction. Example:
Current Date:
2014-12-31 01:22 PM
Implementation date: 2010
Running total for today's transaction: 100
Count: (5 + 13 + 365) + 101 = 383101
Order Number: AD-5D87D
AD there is just a custom order number prefix. So by the time i will be out of order number that will 1000000 years from the time of my implementation date.
Anyway, this is not a good solution if you think your transaction per day can be high as 1000000 counts.

Resources