Is 23,148,855,308,184,500 a magic number, or sheer chance? - magic-numbers

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
News reports such as this one indicate that the above number may have arisen as a programming bug.
A man in the United States popped out
to his local petrol station to buy a
pack of cigarettes - only to find his
card charged $23,148,855,308,184,500.
That is $23 quadrillion (£14
quadrillion) - many times the US
national debt.*
In hex it's $523DC2E199EBB4 which doesn't appear terribly interesting at first sight.
Anyone have any thoughts about what programming error would have caused this?

Add the cents to the number and you get 2314885530818450000, which in hexadecimal is 2020 2020 2020 1250.
Do you see the pattern? The first six bytes have been overwritten by spaces (hex 20, dec 32).

Hold on a second; there’s something fishy going on.
While the space-padded explanation certainly seems good, it may be (at least partly) specious.
VISA said that there were “fewer than 13,000” customers affected by the snafu with the Visa Buxx pre-paid cards. I’ve found news on several so far. Josh Muszynski in New Hampshire, Jason Bryan in Tennessee, Ron Seale in Texas, Karen Taylor’s teenage son in Bethel, and a teenage girl, Elizabeth Lewis in Owatonna .
The thing is that all of them have the exact same charge: $23,148,855,308,184,500.00. If the problem was the space-padding, then how is it that all of them had the exact same $0x1250 ($46.88) charge? Two of them had purchased cigarettes at gas stations, another two had paid at restaurants, Lewis bought eggs and milk, the last one at a drug store. Do all these varied items happen to cost the same? $46.88 for a restaurant bill seems okay, but for a pack of cigarettes? for milk and eggs‽
The space-padding error makes sense, except it does not account for the 0x1250 constant. Why is it that all of them ended up with 0x2020 2020 2020 1250 instead of 0x2020 2020 2020 2020 or different numbers in the last WORD?
Hmmm, if only 13,000 customers were affected, it may be that somehow that exact, specific charge triggered the error. In that case, it is more than just a field error. If it was just the text field being interpreted as a 64-bit integer, then why didn’t other amounts cause it, thus affecting everyone, not just <13,000. Still, how is it that 13,000 people could have just happened to charge the exact same amount in the same week?
They say it’s a “temporary programming error”, and it may well be, but could it be a hacking thing? In that case, it probably would be a magic-number. In fact, it may be a combination of both: some hacker putting a 0x1250 automatic charge, that got combined with the space-padding error, causing one or both errors to be detected.
The Register thinks that the answer is indeed the padded-field error, but does not expand on why they are all the same, although one of the comments mentions the number possibly being rounded to the nearest $100 (unlikely since banks and banking software explicitly go to lengths to ensure precision).
(There is also a report of a similar, earlier error.)
Jason Bryant’s bill:
Elizabeth Lewis’s bill:
Ron Seale’s bill:
Josh Muszynski’s bill:

What happens when you make a purchase by card is that the software immediately goes online to ensure you have sufficient funds for the purchase, but only places a hold on the funds for the transaction. At the end of the working day the software then gathers all the transactions placed in the last 24hrs and submits them to the acquiring bank for processing.
The submission to the bank is known as settlement, and its done by sending a plain text file in a very rigid format. (This was all developed decades ago and the number of systems now using it makes it hard to modernise)
Each transaction appears in the file as a line of text, and part of that is the transaction value. This field should be 11 numeric characters (zero padded on the left hand side) and will always hold the value in lowest common denominator (in this case cents). 11 numeric characters caters well for values in any currency.
Looks like the payment processor in this case had made some changes to their submission software and erroneously replaced the zero padding with space padding. Quite how this got by a) service provider, b) acquiring bank and c) Visa without being picked up escapes me. The net value of that settlement file (13,000 high value transactions) would have been astronomical, and maybe that also was a contributing factor somewhere.

If you remove the trailing zero, this validates as a VISA card number. My guess is they swiped the card then manually entered the number, thinking the swipe had failed.

The ultimate mystery is still where 12 50 is coming from. They are the ASCII codes for Ctrl+R, P. Which happens to be the secret keystrokes you have to type to enter the validation code for QuickBooks.
Link: Where to enter Validation code
Quite a coincidence. I wonder what happens when you type these keys in the wrong place...

If you shift left 64-bit representation 8 bits left (multiply by 256)
You will get a well formed credit card number and 3 empty positions for thise 3 secure extra numbers (all zeroes for some reason). There is only 1 out of 10 chance that random number gives a well formed CC number.
5926 1069 5889 5232 000

If you use the binary equivelant (1110101110110100) decode of the number 23148855308184500, you get K鑛, which is the Mandarin character for mining and ore. Kmine could mean "knowledge mine," or something like kmine Holdings Ltd. Perhaps there's a correlation between K(mine or ore) and Bank of America or Visa?

Related

10 digit phone numbers... Is that enough for USA?

My app may be used anywhere in the USA, but will be used by local businesses serving their own areas.
As my project-in-development exists now (and I can change it) I'm storing only 10 digits. I'd like to think my software may be in use 10 years from now, although I can certainly release updates. But since the trend is for every person to have a phone instead of just one number per household, I understand the USA is running out of 10-digit phone numbers.
I know it may not seem so, but yes, I HAVE Googled and the answer I seek is still as clear as mud.
I read that there are locales within the USA (I don't know where) in which even within the same area code, a 1 and the area code must be dialed first. Other times, just the area code must be dialed, without the 1, even within the same area code.
MY QUESTION IS: To accommodate the whole USA and the foreseeable future, will I need to add an "optional 1" in front of the number, in the form of a check box or other device to distinguish those which need a 1 from those that don't? Is there another phone number schema coming in the future? Or putting it all more simply: Is 10 digits enough?
if you only want to store North american numbers you'll be fine
North American Numberin Plan
10 is the standard length in north america (includes canada)
You should allow for 15 digits including the country code. You already need 12 to 14 digits (including country code) for many countries.
Store all numbers in E.164 format including the country code, without spaces or punctuation.
This will allow easy expansion internationally to other countries and also allow manipulation of numbers in the database if the length of numbers used in any country were to ever change.
There's talk that US numbers will become a digit longer some time in the next decade or or two. You should plan for that now, not when you have tens of millions of numbers stored.
There's constant change in national number plans. If you know that area code 765 in country 980 is changing to area code 77 and all local numbers are having 88 added to the beginning it's a simple operation to make that change if all the numbers in the database include the country code.

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Is there pseudocode for UK address or phone number validation?

Do you have pseudocode for field validation of the following items in the UK? I am from the USA, so I only know the ones in the USA right now.
Address Line 1
Phone Number
Mobile Number (in case they have a special rule for this, which they might not)
Post Code
Address line 1, if you want to validate what the user entered freeform, you're probably hosed. There's huge variability. You can use the PostCode Address File (see below) to assist,
Typically, if you want a "standard" address, UK-oriented websites ask for the postcode, then prompt the user to choose the correct address from all addresses at that postcode
Phone and mobile numbers. See here http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom. A script to validate these (in several languages) can be found here: http://www.braemoor.co.uk/software/telnumbers.shtml
Post code format: http://en.wikipedia.org/wiki/UK_postcodes (contains a regular expression for validation, and refers to the Postcode Address File which lists valid addresses)
Address line 1 could be almost anything. There aren't always house numbers.
Phone numbers: the length of an area code varies. I wouldn't like to swear whether the full length is constant, but I suspect it's always at least 10 digits. Mobile numbers typically (IME) start with 07 whereas landlines typeically start with 01 or 02. Special numbers (free, local rate etc) typically start with 08. I'll try to find a reference for this. (EDIT: Again, there's a good Wikipedia article.)
Wikipedia has a good article about UK postcodes, including regular expressions for them.
Perl has the Number::Phone module that can handle UK phone numbers. Royal Mail has services for address validation and list cleansing.
For UK phone numbers, your best bet (unfortunately) is probably to download the numbering plans from Ofcom (they're excel spreadsheets, with all relevant number ranges, split up into area codes, "geographical numbers", personal numbers, mobile numbers, assorted service numbers, the different pay-for numbers and also have a mapping from number range to operator (the latter is probably NOT something you need, though).
As always with this sort of question, don't get hung up on over-validation. Check for the most likely grossly-malformed inputs then move on. Trying to keep track of what prefixes and lengths of phone number are in use in any particular locale is an enormous waste of your time. Letting a few, possibly-fixable mistakes through is better than losing customers by telling them they are ‘invalid’.
(The nearest to a standard addressing format you get in the UK is postcode plus house number. Even then there are always exceptions. )
I've started the code (or at least a bunch of RegEx patterns) for Django forms validation for GB telephone numbers here.
With a basic explanation here.

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

What is the best format for a customer number, order number?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A large international company deploys a new web and MOTO (Mail Order and Telephone Order) handling system. Among other things you are tasked to design format for both order and customer identification numbers.
What would be the best format in your opinion? Please list any assumptions and considerations.
Accepted Answer
Michael Haren's answer selected due to the most up votes, but please do read other answers and comments as they make Michael's answer more complete.
Go with all numbers or all letters. If you must mix it up, then make sure there are no ambiguous characters (Il1m, O0, etc.).
When displayed/printed, put spaces in every 3-4 characters but make sure your systems can handle inputs without the spaces.
Edit:
Another thing to consider is having a built in way to distinguish orders, customers, etc. e.g. customers always start with 10, orders always start with 20, vendors always start with 30, etc.
DON'T encode ANY mutable customer/order information into the numbers! And you have to assume that everything is mutable!
Some of the above suggestions include a region code. Companies can move. Your own company might reorganize and change its own definition of regions. Customer/company names can change as well.
Customer/order information belongs in the customer/order record. Not in the ID. You can modify the customer/order record later. IDs are generally written in stone.
Even just encoding the date on which the number was generated into the ID might seem safe, but that assumes that the date is never wrong on the systems generating the numbers. Again, this belongs in the record. Otherwise it can never be corrected.
Will more than one system be generating these numbers? If so, you have the potential for duplication if you use only date-based and/or sequential numbers.
Without knowing much about the company, I'd start down this path:
A one-character code identifying the type of number. C for customers, R for orders (don't use "O" as it could be confused with zero), etc.
An identifier of the system that generated the number. The length of this identifier depends on how many of these systems there will be.
A sequence number, unique to the system generating it. Just a counter.
A random number, to prevent guessable order/customer numbers. Make this as long as your paranoia requires.
A simple checksum. Not for security, but for error checking.
Breaking this up into segments makes it more human-readable as others have pointed out.
CX5-0000758-82314-12 is a possible number generated by this approach
. This consists of:
C: it's a customer number.
X5: the station that generated the number.
0000758: this is the 758th number generated by X5. We can generate 10 million before retiring this station ID or the station itself. Or don't pad with zeros and there's no limit.
82314: this was randomly generated and results in a 1/100,000 chance of guessing a customer ID.
12: checksum.
A primary advantage of using only numbers is that they can be entered much more efficiently using 10-key.
The length of that number should be as short as possible while still encompassing the entire entity space you expect to catalog with room to spare. This can be tricky and should be given a bit of thought. A little set theory can give you the number of unique keys you will have access to, given a group of elements.
It is natural when speaking, to break numbers up into sets of two to four digits. By inserting dashes in some pattern, you can "force" the customer to repeat them in a more efficient and unambiguous manner.
For instance, 323-23-5344, which, of course, is social security number format, helps to inform the speaker where to pause when vocalizing the number. It also provides a visual delineation when writing the number and makes it easy to compare when copying the number.
I second the recommendation that the ordering system masks the input correctly so that no dashes need to be entered at any time. This should be carried through to printed forms to provide a clear expectation of what should be entered. For instance, a printed box for each digit separated by printed dashes.
I disagree that too much information should be embedded in this number especially if those attributes might change. For instance, say we give "323" the meaning of "is a nice customer" but then they call in four times with an attitude. Are we then going to change their customer key to "324", "is a jerk"? What if they are in region 04 and move their company to region 05?
If that happens, your options will be to update that primary key throughout the database or live with the ambiguity that the information embedded in that key is no longer reliable, thus rendering all of the information embedded in the keys of questionable utility.
It is better to store attributes that may change as separate fields in the database and have the customer number be a unique, unchanging key for that customer.
To build on Daniel and Michael's questions: it's even better if the separated numbers MEAN something else. For example, I worked for a company where account numbers were like this:
xxxx-xxxx-xxxxxxxx
The first set of numbers represented the region and the second set represented the market within that region. Once you got used to knowing what numbers were from were, it made it really easy to tell what area an account was in without even having to look at the customer's account.
There are several assumptions that I make when answering this question; some are based on the fact that it is a large international organization, and some are based on the fact that the format is for two separate table types.
Assumptions based on the fact that it's an international organization:
It is probable that each region will need to operate independently -- that is, region A must be able to add customer numbers independently from region B
Each region probably uses a different language so to make the identifiers easily type-able by users around the world, it is best to stick to numbers and spaces only.
Assumptions based on the fact that there are two tables for which this format will be used:
This format may be used by more than the two tables listed, so it should be able to handle an arbitrarily large number of tables.
Experienced users should be able to know what type of identifier they are looking at based on information encoded into the identifier itself.
It would be nice if identifiers were globally unique within the entire system.
Considerations:
For a global company, identifiers can be very long if only numerics are used. We should attempt to limit the amount of extraneous information encoded into the identifier as much as possible.
Identifiers should be self-verifiable to a limited extent; that is a program should be able to detect a large percent of invalid identifiers without looking anything up at all. This implies a checksum.
Proposed format:
SSSS0RR0TTC
The format proposed is as simple as possible, but no simpler:
C The first (rightmost) character will be a checksum of all other characters in the identifier. A simple checksum will do. This will eliminate 90% of all typing errors. If it is decided that this is not enough, then this can be expanded to 2 digits which will eliminate 99% of all typing errors.
TT The next N digits represent the table type number. No table type number can contain the digit zero.
The next digit is a zero. This zero separates the table type number from the region number.
RR The next N digits are the region number. No region numbers can contain a zero.
The next digit is a zero. This zero separates the region from the sequence number.
SSSS The next N digits are the sequence number. This number can contain zeros.
Each set of four numbers are separated by spaces when printed or typed in by convention. Internally they are not separated, but this helps the user transfer them correctly.
Examples
Assuming:
Customer table type=1
Order table table type=2
Region code for US-Alabama=1
Region code for CA-Alberta=43
Region code for Ethopia=924
10 1013 - Customer #1 in Alabama (3 is the checksum: 1 +1 + 1)
10 1024 - Order #1 in Alabama
9259 0304 3016 - customer # 925903 in Alberta, Canada
20 3043 4092 4023 - order number 2030434 in Ethopia
Advantages of this approach:
90% of mistyped numbers will be caught
There are an unlimited number of table types
There are an unlimited number of regions
There are an unlimited number of sequential numbers for each table
Identifier numbers are globally unique to the system. This is important - a customer number cannot be mistaken for an order number and visa versa.
Each region can independently add sequence numbers without a global key
Disadvantages
Each identifier is at least six characters
table types numbers and region numbers cannot contain a zero because the zero is used to separate the sequence number from the region number from the table type number.
Make the number as long as necessary, but not any longer. Every time I pay my water bill, I have to enter my 20-digit customer number, and an 18-digit invoice number. Thankfully, a dash in my customer number separates it into two parts.
Do not depend on leading zeros. Having to figure out how many zeros are in my invoice number is extremely annoying. Take 000000000051415432 for example. Their system won't recognize just 51415432.
Group digits together. If you absolutely have to use long numbers, four-digit chunks should work well.
I would never use user information in IDs. Suppose you use the first letters of the customer's last name followed by some number: e.g. Thomsom could be customer THOM-0001.
Only, it appears you made a mistake, and the man's name is Tomson instead of Thomson. User data can be corrected, IDs should never be modifiable. So next time you look up Tomson under TOMS-... you can't find him. Same with other data, like a customer type. It can always change, the ID can't.
This is very basic to RDBMS.
Simply use counting numbers. For readability it's a good idea to insert separators such that you never have more than 4 successive digits: 9999-9999 is better than 999-99999. And don't make the number longer than necessary; people are much more annoyed by being reduced to a 20 digit number than just being reduced to a number.
There's a catch, though. Especially if you have a small business simple counters can give more away than you would appreciate. Say I order something from you, and the order number is 090145. Next month I order again, and the order number is 090171. Er.. 26 orders in a month? Same, I wouldn't feel comfortable to become customer 0006 in a business which has been active for 10 years.
The solution is simple: skip numbers. Don't use random numbers, because you still want them to be in sequence.
I would have my order numbers follow this format:
ddmmyyyy-####-####
Where ####-#### resets to zero at the beginning of every day. This makes it very easy to correlate orders with the date it was placed.
For customer IDs, I would mix capital letters and numbers, but as Michael said avoid commonly mistaken letters (0,o,L,1,5,s). This will give you 30 characters to deal with. If you use 20 characters, that will give you almost a 64 bit range of customer IDs -- pretty good for security. Make sure you use a secure random number generator when generating ID. As for how you display the format, it should be the following:
####-####-####-####-####
As Michael said again, make sure your system can deal with dashes, spaces, no spaces, or no dashes. (It should just strip all those characters from the input before validation.)
I hope that helps!
You may add a small checksum (using XOR for instance) to ensure (enhance) correctness of given ids.
If it's by mail, consider z-base-32 encoding. But here, with telephone orders, you may prefer decimal identification.
assuming that the creation of orders/customers is not centralized, or will not always be centralized, use a GUID
if the creation of orders/customers will always be centralized, an unsigned integer would be fine
there is no compelling reason for the order number of customer number to "mean" anything, and it is likely that any segmented number scheme invented will have to be overhauled down the road. Stick to something unique and meaningless.
EDIT: for MOTO, any multi-character alphabetical identifier will cause problems over the phone, so GUIDs are right out. Assuming multiple decentralized MOTO locations, assign each MOTO location a prefix (A, B, C, etc., or 01, 02, ...) and use an integer or big-integer for the customer and order IDs, e.g. 01-1 is the first order from MOTO location #1. Note that zero-padding is unnecessary, imposes an implicit digit limit to the numbers, and requires the customer to distinguish between six zeros and seven zeros when speaking the number. If you must use a padded fixed-length format, break the number up into groups of no more than 4 or 5 digits each.
ADDENDUM: the order number and the customer number do not have to be the primary keys of their respective tables, just unique indexed columns for lookup. You'll probably want to use something simpler/more efficient for the primary keys in the database.
We use leading zeroes for some of our references "numbers" where I work and I can't tell you how many wasted hours I've had over the last seven years forcing Excel to treat them as text. Don't do it.
Auto-incrementing integers are all well and good for computers, but they greatly reduce human beings ability to spot errors. How important that is will depend on your business. I work with property (housing) related data and our primary reference has the front door embedded in it. It's not elegant but it means that experienced admin staff can spot 90% of minor errors (when we get invoices, etc in) before they get near a database. But in an environment where you're not relying on that kind of process this argument is less compelling.
(Now, some folks have strongly warned about using meaningful data in references as it could be changed, and there's some truth in that, but you can be smart. You don't have to pick something obviously fickle like whether the person is married - you can anchor yourself on past events like a character representing the region they first opened a particular account. Even if you don't do that, have some kind of pattern to help communication with customers. I've worked in a number of call centres and people sometimes come to phone with every piece of documentation from birth certificate onwards as they desperately try to find their account/order/customer number. I don't think saying "It'll be a number between 1 and 100 trillion" would be very handy)
It's been said, but don't create enormously long references. We're busy people, we haven't got time to be keying in this crap over a phone system and making a mistake on digit 17 only to restart (again). Some of your customers may have disabilities and it's likely a growing number will be over 55+. Once again, watch out for the zeroes. You see purchase order numbers and the like with fourteen digits. How many orders do they think they're going to be placing?
If there's going to be any data aggregation outside of your network (and thus not connected to your database) - have some sort of check digit/regular expression pattern which your partners/suppliers can verify they've not made mistakes. One example of this is the UK's electrical supply numbering system (MPAN) is a good example of this - designed for people to maintain their own records without having to download the big list of every electricity meter in the universe to check they've not made a typo.
I would use numbers only since it is an international company. I would use spaces or dashes every 4-6 numbers to separate it. I would also keep the format separate for quick identification
Example:
000-00000-00000 - could be an customer number
00000-00000-00000-00000 - could be a order number
Stick to numbers (no chars or special stuff):
Can be easily input in an IVR flow
its international - No language hassles
No confusion in chars vs. numbers - O vs. 0, I vs. 1
As long as leading 0 is meaningless, you can store/manipulate them more effectively
I would use a completely numeric systems for both Order Number and Customer Number, this will allow you to avoid issues with other languages.
Avoid leading zeros, as this can cause issues with data entry and validation.
The number of digits for each will be dependent on your expected volume. You will always have a greater number of Order Numbers than Customer Numbers. A six digit Customer number starting at 100000 will still give you 899,999 customers. Add an additional 3-4 digits for the order number, will give you 999 to 9,999 orders per customer (more if you consider one off customers).
There is no need to build any sort of identification into your numbering sequence. You have other database fields to identify where a customer is from, etc. Do not overly complicate your system.
KISS (keep it simple stackoverflow)
I would suggest using 16 digit identifiers that when printed or shown to customers are formatted in the format of xxxx-xxxx-xxxx-xxxx but stored as numbers without the dashes in your system.
The reason for using this format is that it makes it easier for people reading out the number over to phone to read as they can do it in batches of 4 rather then trying to remember how much they have said already.
If you wish the first 4 digits can be used to identify the type of number, 1000 for customers, 2000 for suppliers, 3000 for orders, 4000 for invoices etc.
The second set can then by a year/month identifier if you wish to keep that sort of information encoded in the number itself, using a format of yymm so 1000-0903-xxxx-xxxx would be a customer entered in march 2009.
This then leaves you with 8 digits for the actual data itself.
I would consider the use of letters in the identifiers to be a very bad idea for any system that deals with telephones as the differences in accents and understand is so varied that people are bound to get upset at trying to get their identifier recognised by someone who cannot understand their accent properly.
An additional consideration to the format issue- in the code, create a separate class for OrderId and CustomerId. These classses are immutable, and validate their input to ensure that they are acceptable IDs. Also, no value could be and order ID and a customer ID.
The simplest approach would just be to have the backing values for OrderId be ints that start with 1, and CustomerIds be ints that start with 2, or something similar.
Wow - what a simple yet revealing question! And what a lot of contradictory answers. I think there are 3 obvious candidate answers here:
1) Use an autoincrementing long integer.
2) Use a GUID
3) Use a compound type that includes other information in the ID.
For simpler systems, and especially web based systems where all users are hitting a central database, (1) works well. It has the advantage that numbers stay as short and simple as possible, but no shorter, avoids alphabetic characters (you would be amazed how different the names for the same letters are in different countries - one countries E is another countries I). It does not differentiate the order ID from the customer ID intrinsically, but you could always prepend or append a "C" or "O" to each and silently drop them on entry?
It also does not have a checksum or error check.
For distributed systems where many software components need to create the numbers on the fly, without reference to a master database (2) is the only way to go. They have the advantage of being largely error checking, since the address space is so large, but by the same token, are too long and alphanumeric to comfortably read over the telephone.
As for (3) - embedding region information or today's date into the number - those are the sorts of ideas that experienced developers train themselves out of. Looks like a good idea at first, but always comes back to haunt you. Consider the case where a customer moves to a new state, or an order is manually rekeyed a week after originally issued? These items of information belong in related tables where they can be edited independantly of the ID which should represent the entities identity only.
To repeat: NEVER ENCODE BUSINESS DATA IN AN ID OR PRIMARY KEY - every time you do that you leave a time bomb for others to clean up one day.
Given that this is a centralised (phone based) system I would go with option (1) until a clear need arose to change. Simpler is usually better. Insert hyphens as others suggest and prepend or postpend a checksum and/or identifying letter if required.
First step: in an org sufficiently large to require such a system, there is an existing system that you're replacing. Continue the previous system's scheme, if possible. It makes a lot of things easier if you can access, even at a basic level, the data from the old system.
That said, there's often a good reason to change the scheme, particularly when it's coming from a legacy system. i find, though, that it's often helpful to formally rule out the old scheme before proceeding.
Second step: systems like this never exist in a vacuum. Is there already an organization-wide scheme for user and/or order IDs, such as in the accounting, inventory management, or CRM system? If so, consider adopting the existing schemes to make interoperability easier. Many large orgs have multiple ways to specify a single customer or order, and it just makes getting useful intelligence out of the data that much harder.
Third step: if the old system's scheme is too awful to continue and there's no other scheme to adopt, roll your own. In this case, look at the shortcomings of the original scheme, whatever they are, and correct them. The right answer will depend on the specific requirements of the application. The problem statement you've given us is too vague to speculate usefully on what the final form might look like.
I always stick with auto-increment numbers, and I always seed the sequence high enough so that they will all have a consistent number of digits - seems to be less confusing.
I also sometimes start an order number, say 6 digits, starting at 200,000 and customer numbers at 5 digits, starting at 10,000 which would for example give me 90,000 unique customer numbers and 800,000 unique order numbers to use, and you could always tell just by looking at it whether it was a customer number or an order number. (i.e. so if a customer rep was asking for a number over the phone it would immediately be obvious which was which)
I would not however build logic in the app that would depend on that, so even if it did roll over, the system wouldn't care.
The biggest issue here is to try not to overthink the problem.
Although I'm more experienced in e-commerce systems I think some of the points made in this post could be applied to mail order and telephone order systems.
For orders, an auto-increment integer works perfectly as the primary key in the database as well as the number that the customer will see on his/her invoice. There is absolutely no reason to create some overcomplicated algorithm for your numbers. If you want to tell which country/region they're from use a separate field in your database. Also if you are concerned about your competitors spying on you; let them! If your business revolves around spying on your competitors because you're not generating enough revenue then most likely your businessidea isn't good in the first place. Also if you wanted to fool your competitor you could just create your own script that will autocreate fake orders. If your e-commerce system is well designed then this won't be an issue.
Key stuff using an auto-increment integer:
All numbers/digits => easier to communicate, no ambiguities over the phone, works for all languages/cultures that use 0-9 as their numerical system
No extra coding
Looks nice on the invoice and it's the shortest possible number of digits a customer would ever need to spell out over the phone
Works for small AND large businesses
It's scalable
Serviceminded/Customerminded (What's best for the customer) (se bullets 1 and 3)
Simple
Whenever or whatever you're designing should always begin with what's best for the customer. At the end of the day they are the ones putting food on your table. A happy customer is a returning customer.
For me, my preferred is getting the combination of date + a counter for today's transaction. I was challenged to come up with only 5 digit order number. So with that, I come up with the following below:
I have to get the current date then
get the current counter for today's transaction then add 1.
I decided to use a counting larger than decimal(10), so i use base 16 for counting. So with that, if i will get the max of 5 digit out of hexadecimal(FFFFF) that will be 1,048,575 counts. By involving the date, I can say I can get 1,048,575 counts per day. So to make that count unique every day, I mixed the date by getting the sum of the following:
Current Year count starting from the year of the implementation which is 1
Current hour(max is 24)
Current day of the year(max is 365)
So with that, I will have a max 3 characters start for my counting. So that will be XXX + Todays current transaction. Example:
Current Date:
2014-12-31 01:22 PM
Implementation date: 2010
Running total for today's transaction: 100
Count: (5 + 13 + 365) + 101 = 383101
Order Number: AD-5D87D
AD there is just a custom order number prefix. So by the time i will be out of order number that will 1000000 years from the time of my implementation date.
Anyway, this is not a good solution if you think your transaction per day can be high as 1000000 counts.

Resources