Is there a way to generate a short random id, avoiding collisions, without hitting persistent storage? - performance

If you've used GoToMeeting, that's the type of ID I want. I'd like it to be random so that it obfuscates the number of items being tracked and short, so that it's easy to reference manually; UUIDs are way too long. I'd like to avoid hitting persistent storage merely for performance reasons, but I can't think of any other way to avoid collisions. Is 9 digits enough to do something time-based?
In response to questions:
I'm building a ticket-tracking application. This ID would be used as the primary key for a table, but it would be needed before the record is persisted which would result in an extra database call that I'd like to avoid if possible.
I'd like to keep it at a 9 digit int. I consider a UUID to be too long because people are going to have to reference the ID manually (via email, phone, etc.).
I'm thinking of using the time of generation somehow. Since time is always ticking on forward, it would continually limit the set of potential IDs, excluding those that had already been generated.

One way is to take a unique number or string (like a random UUID) then calculate a fixed-length digest (such as MD5 or SHA-1) and/or encode it in a higher base (like base64) to shorten it further.

Git does something similar where it generates a sha numbers for commits (and other events) and then the user can references the numbers manually in order to lookup those commits. The trick they used is that the user doesn't have to enter the whole string in order to find the correct event, they simply have to enter a long enough string that it doesn't collide with any other commit currently in the repository. In general this only require 5 or so hex digits for relatively large repositories.

Related

What are alternatives for GUIDs for key generation when central server is not possible?

I am looking for alternative to GUIDs for key generation in a distributed app. For example supposed I have Bob, James, and Jack all running a bug tracking application on their desktop where they can do thing like create bug tickets ala JIRA, or Bugzilla ... etc. When a ticket is created it is assigned a number such as T-1, T-2, T-3, T-4 ... etc. Tickets need to have a stable ID and should be creatable without having to consult a central server.
I understand that this is what GUID's are really good for but it in my case displaying a GUID in a UI is ugly people can't just copy and paste it and discuss it on a phone call, I really want integers or some sort of short string that is easy to talk about read in one glance .. etc.
Is there a way to use the bitcoin block chain as some sort of counter?
You may evaluate the approach taken by git. They use sha1 hash of commit information. And then abbreviate IDs are allowed which are much shorter and easier to read\transfer manually.
Having the number of bugs in your tracker is not going to reach millions that should be sufficient. Once it is you'll just need a longer abbreviation.
There seem to be plenty info around on how git calculates hash IDs and abbreviates them.
If I recall correctly how UUIDv1 works - it's "just" putting together the mac address and a very exact timestamp + maybe some additional integer. As your mac address should be unique (unless you've fiddled with it) and there are only so many UUIDs one computer can generate within a nano second, the resulting ID will be unique.
This is a very general and uninformed way to create IDs. If you'd implement a version of it yourself for your specific use case you could get much smaller IDs.
Assuming you can identify each node with a bug tracking system with a simple and unique string - for instance "Bob", "James", "Jack" - and you can create unique continuous integers within each node, you could combine those two and have IDs like "Bob-1", "James-12", ...
As you can see, actually there has to be again one central point, which will assign the unique strings, however depending on the number of nodes and how long they stay within the system, this could be as well done just by a human being.
The additional disadvantage (or advantage, depends how you look at it) of this approach (as well as of UUIDv1) would be, that you'd know where the ticket has been created as well as order of the tickets within one system.

Techniques for data anonymization

I'm looking for a good way to anonymize data in my database while retaining the capability of aggregating / summarizing statistical information.
As an example, let's say I want to track clicks by IP address per hour but I don't actually want to store the IP address.
My first thought is to store only a hash (e.g. SHA-256) of the IP. However, I'm not sure this provides sufficient security. If an attacker got ahold of our database and was determined to reverse our anonymization they could generate a rainbow table of IP's and get back the real IP info fairly easily.
My next thought was to add a static prefix to the IP before hashing (e.g. 192.168.1.10 becomes MY_SECRET_STRING-192.168.1.10). Of course, if the attacker finds the static prefix then it is essentially useless.
I've been searching for sound solutions to this problem and I haven't found anything I really like so far. Are there any well known methods for anonymizing data like this?
If someone have access to your salt and database I would say it's almost impossible (if not impossible) to keep them from creating some sort of collision table and "cracking" your hashes. The only option you have is to make their job hard/expensive.
Using a static salt is a bad idea though, this since the whole point of a salt is to prevent an attacker from generating a rainbow table for all your records. The uniqueness is what makes a salt a good salt , this since the purpose of the salt is to make each hash unique regardless if the original content was the same as another record (thus obligating an attacker to brute-force each row to figure out its content).
Also something that is worth noticing is that salts don't need to be secret, so you can just store your salt in an additional column.
There is this nice article about salting and hashing if you have any doubt about the topic.
The problem with the described approach is that in the end, just like an attacker, you won't be able to tell which of the rows are the same IPs.
One potential solution I can see if you really really need to implement this is having a table where you store the IPs + click count, and then every 1 hour have a process to anonymize the data by simply replacing all the IPs/hash from the last hour with a good RANDOM value. This in the end means that you will only be able to group the clicks per hour without knowing the actual IP, but, please notice two things:
Although an attacker will never be able to figure out the past data, you will have 1 hour worth of data that is not anonymized at any given time. Meaning that an attacker could "spy" on you and store this information over time which could become a much bigger problem than "we just leaked 1 hour worth of data".
You won't be able to tell the same IP apart between each hour. For example: if IP 127.0.0.1 did 3 click from 17:00 to 18:00 and the same IP did 6 clicks from 18:00 to 19:00 you wouldn't able to tell that 127.0.0.1 did 9 clicks from 17:00 to 19:00.
Also to make the hourly non-anonymized IP a bit more hard to crack you could have a function that takes an IP and generates unique salt and then caches that unique salt for that IP till the next hour, meaning that each IP would have its own unique salt every hour. This way the attacker would have to calculate a new rainbow table for each row every hour and you could still figure out what IP row to increment|create.
Why, yes, there are. The most well-known is called "salting". Basically, instead of adding a static string to all of the plain texts, you add a unique string to each one. This string is randomly or algorithmically generated and stored separately. It doesn't make a single hash any harder to crack, but it prevents use of tables to crack multiple hashes. See the wikipedia article on Salt(crytography).
That being said, I think that a one-way hash of the IP is sufficient. An attacker would have to crack each IP address. No matter what method you use, once an IP is cracked then all of the records for that IP will be exposed. But cracking one IP doesn't help with any of the others.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

Algorithm for unique CD-KEY generation with validation

I am trying to create a unique CD-KEY to put in our product's box, just like a normal CD-KEY found in standard software boxes that users use to register the product.
However we are not selling software, we are selling DNA collection kit for criminal and medical purposes. Users will receive a saliva collection kit by mail with the CD-KEY on it and they will use that CD-KEY to create an account on our website and get their results. The results from the test will be linked to the CD-KEY. This is the only way that we will have to link the results to the patients. It is therefore important that it does not fail :)
One of the requirements would be that the list of CD-KEYs must be sufficiently "spread" apart so that there is no possibility of someone entering an incorrect CD-KEY and still having it approved for someone else kit, thereby mixing up two kits. That could cost us thousands of dollars in liability.
For example, it cannot be a incremental sequence of numbers such as
00001
00002
00003
...
The reason is that if someone receives the kit 00002, but registers it as 000003 by accident, then his results will be matched to someone else. So it must be like credit card numbers... Unless a valid sequence is entered, your chances of randomly hitting a valid number is 1 in a million...
Also, we are selling over 50,000 kits annually to various providers (who will generate their own CD-KEYS using our algorithm) so we cannot maintain a list of all previously issued CD-KEYS to check for duplicate. The algorithm must generate unique CD-KEYs.
We also require the ability to verify that the CD-KEY is valid using a quick check algorithm, so that we can inform the user if the code he enters is invalid. This leaves out many hashing or MD5 algorithms I believe. And it cannot be a 128 bit because, who would take that time to type it out on the computer screen?
So far this is what I was thinking the final CD-KEY structure would look like
(4 char product code) - (4 char reseller code) - (12 char unique, verifiable CD-KEY)
Ex. 384A - GTLD - {4565 - FR54 - EDF3}
To insure the uniqueness of the KEYS, I could include the current date (20090521) as part of the source. We wont generate unique keys more than once a week, so this value changes often enough for the purpose of unique initial value.
What possible algorithm can I use to generate the unique keys?
Create the strings <providername>000001, <providername>000002, etc. or whatever and encrypt them with a public key, and that's your "CD-KEY" that the user enters. Decrypt the CD-KEY with the private key and validate that when decrypted you get a valid string with a valid provider name.
Credit Card numbers use the Luhn algorithm you might want to look at something similar to that.
I use SeriousBit Ellipter link for software protection but I don't see any reason you could generate a group of unique keys each week and us the library to verify the key validity when entered into your web site. You can also encode optional services into the key allow you to control how the sample is processed from the key (that's if you have different service levels).
As it uses an encrypted method of key generation in the first place and it's relatively cheap, it's certainly worth a look I would say.
I finally settled for a cd-key of this form
<TIMESTAMP>-<incremented number>-<8 char MD5 hash>-<checksumdigit>
I used the mod 11 ISBN checksum digit algorithm.
Generate GUID and catenate a random number to it. GUID is guaranteed to be unique and random number will make it improbable to hit a code accidentally. Just don't modify the GUID in any way or you might compromise the uniqueness.
http://msdn.microsoft.com/en-us/library/aa475087.aspx

Resources