Privacy and Anonymization "Algorithm" - algorithm

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.

I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )

"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.

Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Related

How to optimize employee transport service?

Gurus,
I am in the process of writing some code to optimize employee transport for corporate. I need all you expert's advice on how can this be achieved. Here is my scenario.
There are 100 pick up points all over city from where employees need to be brought to company with multiple vehicles. Each vehicle can occupy say 4 or 6 employees. My objective is to write some code which will group the people from nearby areas and bring them to company. Master data will have addresses and its latitude/longitude. I want to build an algorithm to optimize vehicle occupancy as well as distance and time. Could you guys please give some directions how can this be achieved. I understand I may need to use google maps or direction API for this but looking for some logic hint/advice how this can be achieved.
Some more inputs: These vehicles are of company's vehicle with driver. Travel time should not be more than 1.5 Hrs.
Thanks in advance.
Your problem description is a more complicated version of "The travelling salesman problem". You can look it up on and find some different examples and how they are implemented.
One point that need to be clarified : the vehicules to be use will be the employee vehicule that will be carshared or it will be company's vehicule with a driver?
You also need to define some time constrain. For example 50 employees should have under 30 min travel, 40 employe under 1h travel and 10 employe under 1,5H.
You also need to define the travel time for each road depending of the time, because at different time there will be traffic jam or not.
You also need to define group within the employee: usually people (admin clerck or CEO) in a company don't commute at the same time, it can have 1 hour range or more.
In fine, don't forget to include about 10% of the employee that wil be 2 to 5 min late to their meeting point.

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

Is 23,148,855,308,184,500 a magic number, or sheer chance?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
News reports such as this one indicate that the above number may have arisen as a programming bug.
A man in the United States popped out
to his local petrol station to buy a
pack of cigarettes - only to find his
card charged $23,148,855,308,184,500.
That is $23 quadrillion (£14
quadrillion) - many times the US
national debt.*
In hex it's $523DC2E199EBB4 which doesn't appear terribly interesting at first sight.
Anyone have any thoughts about what programming error would have caused this?
Add the cents to the number and you get 2314885530818450000, which in hexadecimal is 2020 2020 2020 1250.
Do you see the pattern? The first six bytes have been overwritten by spaces (hex 20, dec 32).
Hold on a second; there’s something fishy going on.
While the space-padded explanation certainly seems good, it may be (at least partly) specious.
VISA said that there were “fewer than 13,000” customers affected by the snafu with the Visa Buxx pre-paid cards. I’ve found news on several so far. Josh Muszynski in New Hampshire, Jason Bryan in Tennessee, Ron Seale in Texas, Karen Taylor’s teenage son in Bethel, and a teenage girl, Elizabeth Lewis in Owatonna .
The thing is that all of them have the exact same charge: $23,148,855,308,184,500.00. If the problem was the space-padding, then how is it that all of them had the exact same $0x1250 ($46.88) charge? Two of them had purchased cigarettes at gas stations, another two had paid at restaurants, Lewis bought eggs and milk, the last one at a drug store. Do all these varied items happen to cost the same? $46.88 for a restaurant bill seems okay, but for a pack of cigarettes? for milk and eggs‽
The space-padding error makes sense, except it does not account for the 0x1250 constant. Why is it that all of them ended up with 0x2020 2020 2020 1250 instead of 0x2020 2020 2020 2020 or different numbers in the last WORD?
Hmmm, if only 13,000 customers were affected, it may be that somehow that exact, specific charge triggered the error. In that case, it is more than just a field error. If it was just the text field being interpreted as a 64-bit integer, then why didn’t other amounts cause it, thus affecting everyone, not just <13,000. Still, how is it that 13,000 people could have just happened to charge the exact same amount in the same week?
They say it’s a “temporary programming error”, and it may well be, but could it be a hacking thing? In that case, it probably would be a magic-number. In fact, it may be a combination of both: some hacker putting a 0x1250 automatic charge, that got combined with the space-padding error, causing one or both errors to be detected.
The Register thinks that the answer is indeed the padded-field error, but does not expand on why they are all the same, although one of the comments mentions the number possibly being rounded to the nearest $100 (unlikely since banks and banking software explicitly go to lengths to ensure precision).
(There is also a report of a similar, earlier error.)
Jason Bryant’s bill:
Elizabeth Lewis’s bill:
Ron Seale’s bill:
Josh Muszynski’s bill:
What happens when you make a purchase by card is that the software immediately goes online to ensure you have sufficient funds for the purchase, but only places a hold on the funds for the transaction. At the end of the working day the software then gathers all the transactions placed in the last 24hrs and submits them to the acquiring bank for processing.
The submission to the bank is known as settlement, and its done by sending a plain text file in a very rigid format. (This was all developed decades ago and the number of systems now using it makes it hard to modernise)
Each transaction appears in the file as a line of text, and part of that is the transaction value. This field should be 11 numeric characters (zero padded on the left hand side) and will always hold the value in lowest common denominator (in this case cents). 11 numeric characters caters well for values in any currency.
Looks like the payment processor in this case had made some changes to their submission software and erroneously replaced the zero padding with space padding. Quite how this got by a) service provider, b) acquiring bank and c) Visa without being picked up escapes me. The net value of that settlement file (13,000 high value transactions) would have been astronomical, and maybe that also was a contributing factor somewhere.
If you remove the trailing zero, this validates as a VISA card number. My guess is they swiped the card then manually entered the number, thinking the swipe had failed.
The ultimate mystery is still where 12 50 is coming from. They are the ASCII codes for Ctrl+R, P. Which happens to be the secret keystrokes you have to type to enter the validation code for QuickBooks.
Link: Where to enter Validation code
Quite a coincidence. I wonder what happens when you type these keys in the wrong place...
If you shift left 64-bit representation 8 bits left (multiply by 256)
You will get a well formed credit card number and 3 empty positions for thise 3 secure extra numbers (all zeroes for some reason). There is only 1 out of 10 chance that random number gives a well formed CC number.
5926 1069 5889 5232 000
If you use the binary equivelant (1110101110110100) decode of the number 23148855308184500, you get K鑛, which is the Mandarin character for mining and ore. Kmine could mean "knowledge mine," or something like kmine Holdings Ltd. Perhaps there's a correlation between K(mine or ore) and Bank of America or Visa?

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

What is the best format for a customer number, order number?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A large international company deploys a new web and MOTO (Mail Order and Telephone Order) handling system. Among other things you are tasked to design format for both order and customer identification numbers.
What would be the best format in your opinion? Please list any assumptions and considerations.
Accepted Answer
Michael Haren's answer selected due to the most up votes, but please do read other answers and comments as they make Michael's answer more complete.
Go with all numbers or all letters. If you must mix it up, then make sure there are no ambiguous characters (Il1m, O0, etc.).
When displayed/printed, put spaces in every 3-4 characters but make sure your systems can handle inputs without the spaces.
Edit:
Another thing to consider is having a built in way to distinguish orders, customers, etc. e.g. customers always start with 10, orders always start with 20, vendors always start with 30, etc.
DON'T encode ANY mutable customer/order information into the numbers! And you have to assume that everything is mutable!
Some of the above suggestions include a region code. Companies can move. Your own company might reorganize and change its own definition of regions. Customer/company names can change as well.
Customer/order information belongs in the customer/order record. Not in the ID. You can modify the customer/order record later. IDs are generally written in stone.
Even just encoding the date on which the number was generated into the ID might seem safe, but that assumes that the date is never wrong on the systems generating the numbers. Again, this belongs in the record. Otherwise it can never be corrected.
Will more than one system be generating these numbers? If so, you have the potential for duplication if you use only date-based and/or sequential numbers.
Without knowing much about the company, I'd start down this path:
A one-character code identifying the type of number. C for customers, R for orders (don't use "O" as it could be confused with zero), etc.
An identifier of the system that generated the number. The length of this identifier depends on how many of these systems there will be.
A sequence number, unique to the system generating it. Just a counter.
A random number, to prevent guessable order/customer numbers. Make this as long as your paranoia requires.
A simple checksum. Not for security, but for error checking.
Breaking this up into segments makes it more human-readable as others have pointed out.
CX5-0000758-82314-12 is a possible number generated by this approach
. This consists of:
C: it's a customer number.
X5: the station that generated the number.
0000758: this is the 758th number generated by X5. We can generate 10 million before retiring this station ID or the station itself. Or don't pad with zeros and there's no limit.
82314: this was randomly generated and results in a 1/100,000 chance of guessing a customer ID.
12: checksum.
A primary advantage of using only numbers is that they can be entered much more efficiently using 10-key.
The length of that number should be as short as possible while still encompassing the entire entity space you expect to catalog with room to spare. This can be tricky and should be given a bit of thought. A little set theory can give you the number of unique keys you will have access to, given a group of elements.
It is natural when speaking, to break numbers up into sets of two to four digits. By inserting dashes in some pattern, you can "force" the customer to repeat them in a more efficient and unambiguous manner.
For instance, 323-23-5344, which, of course, is social security number format, helps to inform the speaker where to pause when vocalizing the number. It also provides a visual delineation when writing the number and makes it easy to compare when copying the number.
I second the recommendation that the ordering system masks the input correctly so that no dashes need to be entered at any time. This should be carried through to printed forms to provide a clear expectation of what should be entered. For instance, a printed box for each digit separated by printed dashes.
I disagree that too much information should be embedded in this number especially if those attributes might change. For instance, say we give "323" the meaning of "is a nice customer" but then they call in four times with an attitude. Are we then going to change their customer key to "324", "is a jerk"? What if they are in region 04 and move their company to region 05?
If that happens, your options will be to update that primary key throughout the database or live with the ambiguity that the information embedded in that key is no longer reliable, thus rendering all of the information embedded in the keys of questionable utility.
It is better to store attributes that may change as separate fields in the database and have the customer number be a unique, unchanging key for that customer.
To build on Daniel and Michael's questions: it's even better if the separated numbers MEAN something else. For example, I worked for a company where account numbers were like this:
xxxx-xxxx-xxxxxxxx
The first set of numbers represented the region and the second set represented the market within that region. Once you got used to knowing what numbers were from were, it made it really easy to tell what area an account was in without even having to look at the customer's account.
There are several assumptions that I make when answering this question; some are based on the fact that it is a large international organization, and some are based on the fact that the format is for two separate table types.
Assumptions based on the fact that it's an international organization:
It is probable that each region will need to operate independently -- that is, region A must be able to add customer numbers independently from region B
Each region probably uses a different language so to make the identifiers easily type-able by users around the world, it is best to stick to numbers and spaces only.
Assumptions based on the fact that there are two tables for which this format will be used:
This format may be used by more than the two tables listed, so it should be able to handle an arbitrarily large number of tables.
Experienced users should be able to know what type of identifier they are looking at based on information encoded into the identifier itself.
It would be nice if identifiers were globally unique within the entire system.
Considerations:
For a global company, identifiers can be very long if only numerics are used. We should attempt to limit the amount of extraneous information encoded into the identifier as much as possible.
Identifiers should be self-verifiable to a limited extent; that is a program should be able to detect a large percent of invalid identifiers without looking anything up at all. This implies a checksum.
Proposed format:
SSSS0RR0TTC
The format proposed is as simple as possible, but no simpler:
C The first (rightmost) character will be a checksum of all other characters in the identifier. A simple checksum will do. This will eliminate 90% of all typing errors. If it is decided that this is not enough, then this can be expanded to 2 digits which will eliminate 99% of all typing errors.
TT The next N digits represent the table type number. No table type number can contain the digit zero.
The next digit is a zero. This zero separates the table type number from the region number.
RR The next N digits are the region number. No region numbers can contain a zero.
The next digit is a zero. This zero separates the region from the sequence number.
SSSS The next N digits are the sequence number. This number can contain zeros.
Each set of four numbers are separated by spaces when printed or typed in by convention. Internally they are not separated, but this helps the user transfer them correctly.
Examples
Assuming:
Customer table type=1
Order table table type=2
Region code for US-Alabama=1
Region code for CA-Alberta=43
Region code for Ethopia=924
10 1013 - Customer #1 in Alabama (3 is the checksum: 1 +1 + 1)
10 1024 - Order #1 in Alabama
9259 0304 3016 - customer # 925903 in Alberta, Canada
20 3043 4092 4023 - order number 2030434 in Ethopia
Advantages of this approach:
90% of mistyped numbers will be caught
There are an unlimited number of table types
There are an unlimited number of regions
There are an unlimited number of sequential numbers for each table
Identifier numbers are globally unique to the system. This is important - a customer number cannot be mistaken for an order number and visa versa.
Each region can independently add sequence numbers without a global key
Disadvantages
Each identifier is at least six characters
table types numbers and region numbers cannot contain a zero because the zero is used to separate the sequence number from the region number from the table type number.
Make the number as long as necessary, but not any longer. Every time I pay my water bill, I have to enter my 20-digit customer number, and an 18-digit invoice number. Thankfully, a dash in my customer number separates it into two parts.
Do not depend on leading zeros. Having to figure out how many zeros are in my invoice number is extremely annoying. Take 000000000051415432 for example. Their system won't recognize just 51415432.
Group digits together. If you absolutely have to use long numbers, four-digit chunks should work well.
I would never use user information in IDs. Suppose you use the first letters of the customer's last name followed by some number: e.g. Thomsom could be customer THOM-0001.
Only, it appears you made a mistake, and the man's name is Tomson instead of Thomson. User data can be corrected, IDs should never be modifiable. So next time you look up Tomson under TOMS-... you can't find him. Same with other data, like a customer type. It can always change, the ID can't.
This is very basic to RDBMS.
Simply use counting numbers. For readability it's a good idea to insert separators such that you never have more than 4 successive digits: 9999-9999 is better than 999-99999. And don't make the number longer than necessary; people are much more annoyed by being reduced to a 20 digit number than just being reduced to a number.
There's a catch, though. Especially if you have a small business simple counters can give more away than you would appreciate. Say I order something from you, and the order number is 090145. Next month I order again, and the order number is 090171. Er.. 26 orders in a month? Same, I wouldn't feel comfortable to become customer 0006 in a business which has been active for 10 years.
The solution is simple: skip numbers. Don't use random numbers, because you still want them to be in sequence.
I would have my order numbers follow this format:
ddmmyyyy-####-####
Where ####-#### resets to zero at the beginning of every day. This makes it very easy to correlate orders with the date it was placed.
For customer IDs, I would mix capital letters and numbers, but as Michael said avoid commonly mistaken letters (0,o,L,1,5,s). This will give you 30 characters to deal with. If you use 20 characters, that will give you almost a 64 bit range of customer IDs -- pretty good for security. Make sure you use a secure random number generator when generating ID. As for how you display the format, it should be the following:
####-####-####-####-####
As Michael said again, make sure your system can deal with dashes, spaces, no spaces, or no dashes. (It should just strip all those characters from the input before validation.)
I hope that helps!
You may add a small checksum (using XOR for instance) to ensure (enhance) correctness of given ids.
If it's by mail, consider z-base-32 encoding. But here, with telephone orders, you may prefer decimal identification.
assuming that the creation of orders/customers is not centralized, or will not always be centralized, use a GUID
if the creation of orders/customers will always be centralized, an unsigned integer would be fine
there is no compelling reason for the order number of customer number to "mean" anything, and it is likely that any segmented number scheme invented will have to be overhauled down the road. Stick to something unique and meaningless.
EDIT: for MOTO, any multi-character alphabetical identifier will cause problems over the phone, so GUIDs are right out. Assuming multiple decentralized MOTO locations, assign each MOTO location a prefix (A, B, C, etc., or 01, 02, ...) and use an integer or big-integer for the customer and order IDs, e.g. 01-1 is the first order from MOTO location #1. Note that zero-padding is unnecessary, imposes an implicit digit limit to the numbers, and requires the customer to distinguish between six zeros and seven zeros when speaking the number. If you must use a padded fixed-length format, break the number up into groups of no more than 4 or 5 digits each.
ADDENDUM: the order number and the customer number do not have to be the primary keys of their respective tables, just unique indexed columns for lookup. You'll probably want to use something simpler/more efficient for the primary keys in the database.
We use leading zeroes for some of our references "numbers" where I work and I can't tell you how many wasted hours I've had over the last seven years forcing Excel to treat them as text. Don't do it.
Auto-incrementing integers are all well and good for computers, but they greatly reduce human beings ability to spot errors. How important that is will depend on your business. I work with property (housing) related data and our primary reference has the front door embedded in it. It's not elegant but it means that experienced admin staff can spot 90% of minor errors (when we get invoices, etc in) before they get near a database. But in an environment where you're not relying on that kind of process this argument is less compelling.
(Now, some folks have strongly warned about using meaningful data in references as it could be changed, and there's some truth in that, but you can be smart. You don't have to pick something obviously fickle like whether the person is married - you can anchor yourself on past events like a character representing the region they first opened a particular account. Even if you don't do that, have some kind of pattern to help communication with customers. I've worked in a number of call centres and people sometimes come to phone with every piece of documentation from birth certificate onwards as they desperately try to find their account/order/customer number. I don't think saying "It'll be a number between 1 and 100 trillion" would be very handy)
It's been said, but don't create enormously long references. We're busy people, we haven't got time to be keying in this crap over a phone system and making a mistake on digit 17 only to restart (again). Some of your customers may have disabilities and it's likely a growing number will be over 55+. Once again, watch out for the zeroes. You see purchase order numbers and the like with fourteen digits. How many orders do they think they're going to be placing?
If there's going to be any data aggregation outside of your network (and thus not connected to your database) - have some sort of check digit/regular expression pattern which your partners/suppliers can verify they've not made mistakes. One example of this is the UK's electrical supply numbering system (MPAN) is a good example of this - designed for people to maintain their own records without having to download the big list of every electricity meter in the universe to check they've not made a typo.
I would use numbers only since it is an international company. I would use spaces or dashes every 4-6 numbers to separate it. I would also keep the format separate for quick identification
Example:
000-00000-00000 - could be an customer number
00000-00000-00000-00000 - could be a order number
Stick to numbers (no chars or special stuff):
Can be easily input in an IVR flow
its international - No language hassles
No confusion in chars vs. numbers - O vs. 0, I vs. 1
As long as leading 0 is meaningless, you can store/manipulate them more effectively
I would use a completely numeric systems for both Order Number and Customer Number, this will allow you to avoid issues with other languages.
Avoid leading zeros, as this can cause issues with data entry and validation.
The number of digits for each will be dependent on your expected volume. You will always have a greater number of Order Numbers than Customer Numbers. A six digit Customer number starting at 100000 will still give you 899,999 customers. Add an additional 3-4 digits for the order number, will give you 999 to 9,999 orders per customer (more if you consider one off customers).
There is no need to build any sort of identification into your numbering sequence. You have other database fields to identify where a customer is from, etc. Do not overly complicate your system.
KISS (keep it simple stackoverflow)
I would suggest using 16 digit identifiers that when printed or shown to customers are formatted in the format of xxxx-xxxx-xxxx-xxxx but stored as numbers without the dashes in your system.
The reason for using this format is that it makes it easier for people reading out the number over to phone to read as they can do it in batches of 4 rather then trying to remember how much they have said already.
If you wish the first 4 digits can be used to identify the type of number, 1000 for customers, 2000 for suppliers, 3000 for orders, 4000 for invoices etc.
The second set can then by a year/month identifier if you wish to keep that sort of information encoded in the number itself, using a format of yymm so 1000-0903-xxxx-xxxx would be a customer entered in march 2009.
This then leaves you with 8 digits for the actual data itself.
I would consider the use of letters in the identifiers to be a very bad idea for any system that deals with telephones as the differences in accents and understand is so varied that people are bound to get upset at trying to get their identifier recognised by someone who cannot understand their accent properly.
An additional consideration to the format issue- in the code, create a separate class for OrderId and CustomerId. These classses are immutable, and validate their input to ensure that they are acceptable IDs. Also, no value could be and order ID and a customer ID.
The simplest approach would just be to have the backing values for OrderId be ints that start with 1, and CustomerIds be ints that start with 2, or something similar.
Wow - what a simple yet revealing question! And what a lot of contradictory answers. I think there are 3 obvious candidate answers here:
1) Use an autoincrementing long integer.
2) Use a GUID
3) Use a compound type that includes other information in the ID.
For simpler systems, and especially web based systems where all users are hitting a central database, (1) works well. It has the advantage that numbers stay as short and simple as possible, but no shorter, avoids alphabetic characters (you would be amazed how different the names for the same letters are in different countries - one countries E is another countries I). It does not differentiate the order ID from the customer ID intrinsically, but you could always prepend or append a "C" or "O" to each and silently drop them on entry?
It also does not have a checksum or error check.
For distributed systems where many software components need to create the numbers on the fly, without reference to a master database (2) is the only way to go. They have the advantage of being largely error checking, since the address space is so large, but by the same token, are too long and alphanumeric to comfortably read over the telephone.
As for (3) - embedding region information or today's date into the number - those are the sorts of ideas that experienced developers train themselves out of. Looks like a good idea at first, but always comes back to haunt you. Consider the case where a customer moves to a new state, or an order is manually rekeyed a week after originally issued? These items of information belong in related tables where they can be edited independantly of the ID which should represent the entities identity only.
To repeat: NEVER ENCODE BUSINESS DATA IN AN ID OR PRIMARY KEY - every time you do that you leave a time bomb for others to clean up one day.
Given that this is a centralised (phone based) system I would go with option (1) until a clear need arose to change. Simpler is usually better. Insert hyphens as others suggest and prepend or postpend a checksum and/or identifying letter if required.
First step: in an org sufficiently large to require such a system, there is an existing system that you're replacing. Continue the previous system's scheme, if possible. It makes a lot of things easier if you can access, even at a basic level, the data from the old system.
That said, there's often a good reason to change the scheme, particularly when it's coming from a legacy system. i find, though, that it's often helpful to formally rule out the old scheme before proceeding.
Second step: systems like this never exist in a vacuum. Is there already an organization-wide scheme for user and/or order IDs, such as in the accounting, inventory management, or CRM system? If so, consider adopting the existing schemes to make interoperability easier. Many large orgs have multiple ways to specify a single customer or order, and it just makes getting useful intelligence out of the data that much harder.
Third step: if the old system's scheme is too awful to continue and there's no other scheme to adopt, roll your own. In this case, look at the shortcomings of the original scheme, whatever they are, and correct them. The right answer will depend on the specific requirements of the application. The problem statement you've given us is too vague to speculate usefully on what the final form might look like.
I always stick with auto-increment numbers, and I always seed the sequence high enough so that they will all have a consistent number of digits - seems to be less confusing.
I also sometimes start an order number, say 6 digits, starting at 200,000 and customer numbers at 5 digits, starting at 10,000 which would for example give me 90,000 unique customer numbers and 800,000 unique order numbers to use, and you could always tell just by looking at it whether it was a customer number or an order number. (i.e. so if a customer rep was asking for a number over the phone it would immediately be obvious which was which)
I would not however build logic in the app that would depend on that, so even if it did roll over, the system wouldn't care.
The biggest issue here is to try not to overthink the problem.
Although I'm more experienced in e-commerce systems I think some of the points made in this post could be applied to mail order and telephone order systems.
For orders, an auto-increment integer works perfectly as the primary key in the database as well as the number that the customer will see on his/her invoice. There is absolutely no reason to create some overcomplicated algorithm for your numbers. If you want to tell which country/region they're from use a separate field in your database. Also if you are concerned about your competitors spying on you; let them! If your business revolves around spying on your competitors because you're not generating enough revenue then most likely your businessidea isn't good in the first place. Also if you wanted to fool your competitor you could just create your own script that will autocreate fake orders. If your e-commerce system is well designed then this won't be an issue.
Key stuff using an auto-increment integer:
All numbers/digits => easier to communicate, no ambiguities over the phone, works for all languages/cultures that use 0-9 as their numerical system
No extra coding
Looks nice on the invoice and it's the shortest possible number of digits a customer would ever need to spell out over the phone
Works for small AND large businesses
It's scalable
Serviceminded/Customerminded (What's best for the customer) (se bullets 1 and 3)
Simple
Whenever or whatever you're designing should always begin with what's best for the customer. At the end of the day they are the ones putting food on your table. A happy customer is a returning customer.
For me, my preferred is getting the combination of date + a counter for today's transaction. I was challenged to come up with only 5 digit order number. So with that, I come up with the following below:
I have to get the current date then
get the current counter for today's transaction then add 1.
I decided to use a counting larger than decimal(10), so i use base 16 for counting. So with that, if i will get the max of 5 digit out of hexadecimal(FFFFF) that will be 1,048,575 counts. By involving the date, I can say I can get 1,048,575 counts per day. So to make that count unique every day, I mixed the date by getting the sum of the following:
Current Year count starting from the year of the implementation which is 1
Current hour(max is 24)
Current day of the year(max is 365)
So with that, I will have a max 3 characters start for my counting. So that will be XXX + Todays current transaction. Example:
Current Date:
2014-12-31 01:22 PM
Implementation date: 2010
Running total for today's transaction: 100
Count: (5 + 13 + 365) + 101 = 383101
Order Number: AD-5D87D
AD there is just a custom order number prefix. So by the time i will be out of order number that will 1000000 years from the time of my implementation date.
Anyway, this is not a good solution if you think your transaction per day can be high as 1000000 counts.

Resources