Ruby UUID and uniqueness, for an ID

Ruby UUID and uniqueness, for an ID - ruby

I need a 6 character alphanumeric ID for use in my rails app, which will be presented to users of the system and must be unique among all the object instances in my system. I don't expect more than a few thousand object instances, so 6 characters is far more than I really need.
At this point I'm using the UUIDTools gem in my Rails app to generate a uuid. Which of the UUIDTools generation methods should I use, and which end of the resulting uuid should I take the 6 characters from, to guarantee uniqueness?
for example, if I generate ef1cf087-95c9-4868-bd95-cea950a52b58, would I want to use ef1cf0 from the front of it, or a52b58 from the back end?
... as a side note / question: am i going about this wrong? is there a better way?

No way. UUID is considered unique because it is very long and it is practically impossible to generate same UUIDs. If you trim it to 6 chars then you drammatically increase possiblility of duplicate. You have to use either incrementing id or full UUID.
Only deterministic generation (id(x + 1) = id(x) + 1) can guarantee uniqueness. UUID doesn't guarantee it and 6 chars guarantee it even less.
Other option is to create ID generation service, it will have single method getNewId and will keep knowledge that will be enought to provide unique ids. (Simplest case - counter)

When you say that incrementing the ID isn't an option, is that because you don't want users to see the scheme you're using, or because the generation must be stateless (i.e., you can't keep track of all IDs you've generated)?
If it's the former, then you can generate an ID, check to see if you've already used it, and if so, generate another new ID. (Seems pretty obvious so sorry if I'm on the wrong track.) You could do something like this:
while id = rand(2**256).to_s(36)[0..5]
break unless Ids.exists?(id)
end
where Ids.exists?(id) is the does-this-already-exist method.

Related

Best way to architect unique identifier generation

I have an API in rails that will handle a lot of requests. Lets say millions a day. I want to be able to assign a unique id to each of these requests. The way i am doing it is starting with a uid of 3 letters/numbers and continue up till 9 letters/numbers when the previous bracket is all taken.
One one i am doing it is generate the uid in real time when the request comes, so the app will try to find the first available uid and assign it. But after a while i have the impression that this will impact the performance of the app.
The second way I am thinking is to auto-generate all the possible uids in advance and have a flag assigned to them [free/taken] so that when a request comes in i assign the first free uid to it, which should be very fast if that filed is indexed.
Any suggestions are much appreciated. Thank you

I would just generate a random string and assign it while it doesn't exist. Before I get into that let me just mention that SecureRandom.uuid is the best way to go. It generates random uuids whose chance of collision are mathematically improbable.
Anyway, here is a way to use your own custom random string generator that will only assign if it doesn't already exist:
def generate_random_uid
begin
uid = my_custom_random_string_method
object.uid = uid
end while ObjectModel.exists? uid: uid
object.save
end
The do while block will execute once and set a variable called uid to be a random string you generate in the method my_custom_random_string_method, then in the while part it checks if a record exists whose uid attribute matches the random uid you just generated, if it does it runs the do block again, rinse and repeat until the expression in the while part returns false meaning the uid doesn't exists, then your object is saved and the uid written to the db. This guarantees that you will only ever save the object with a uid that doesn't exist in your db.

How to create unique ID in format xx-123 on rails

is it possible to create some unique ID for articles on rails?
For example, first article will get ID - aa-001,
second - aa-002
...
article #999 - aa-999,
article #1000 - ab-001 and so on?
Thanks in advance for your help!

The following method gives the next id in the sequence, given the one before:
def next_id(id, limit = 3, seperator = '-')
if id[/[0-9]+\z/] == ?9 * limit
"#{id[/\A[a-z]+/i].next}#{seperator}#{?0 * (limit - 1)}1"
else
id.next
end
end
> next_id("aa-009")
=> "aa-010"
> next_id("aa-999")
=> "ab-001"
The limit parameter specifies the number of digits. You can use as many prefix characters as you want.
Which means you could use it like this in your application:
> Post.last.special_id
=> "bc-999"
next_id(Post.last.special_id)
=> "bd-001"
However, I'm not sure I'd advice you to do it like this. Databases have smart methods to avoid race conditions for creating ids when entries are created concurrently. In Postgres, for example, it doesn't guarantee gapless ids.
This approach has no such mechanism, which could potentially lead to race conditions. However, if this is extremely unlikely to happen such in a case where you are the only one writing articles, you could do it anyway. I'm not exactly sure what you want to use this for, but you might want to look into to_param.

You may want to look into the FriendlyId gem. There’s also a Railscast on this topic which covers a manual approach as well as the usage of FriendlyId.

Create Random Integer Based on Id in Ruby

I have a scenario where I need to generate 4 digit confirmation codes for individual orders. I don't want to just do random codes due to the off chance that two exact codes would be generated near the same time. Is there a way to use the id of each order and generate a 4 digit code from that? I know I am going to eventually have repetitive codes with this but it will be ok because they will not be generated around the same time.

Do you really need to base the code on the ID? Four digits only gives you ten thousand possible values so you could generate them all with a script and toss them in a database table. Then just pull a random one out of the database when you need it and put it back in when you're done with it.
Your code table would look like this:
code: The code
uuid: A UUID, a NULL value here indicates that this code is free.
Then, to grab a code, first generate a UUID, uuid, and do this:
update code_table
set uuid = ?
where code = (
select code
from code_table
where uuid is null
order by random()
limit 1
)
-- Depending on how your database handles transactions
-- you might want to add "and uuid is null" to the outer
-- WHERE clause and loop until it works
(where ? would be your uuid) to reserve the code in a safe manner and then this:
select code
from code_table
where uuid = ?
(where ? is again your uuid) to pull the code out of the database.
Later on, someone will use the code for something and then you just:
update code_table
set uuid = null
where code = ?
(where code is the code) to release the code back into the pool.
You only have ten thousand possible codes, that's pretty small for a database even if you are using order by random().
A nice advantage of this approach is that you can easily see how many codes are free; this lets you automatically check the code pool every day/week/month/... and complain if the number of free codes fall below, say, 20% of the entire code space.
You have to track the in-use codes anyway if you want to avoid duplicates so why not manage it all in one place?

If your order id has more than 4 digits, it is theoreticly impossible without checking the generated value in a array of already generated values, you can do something like this:
require 'mutex'
$confirmation_code_mutex = Mutex.new
$confirmation_codes_in_use = []
def generate_confirmation_code
$confirmation_code_mutex.synchronize do
nil while $confirmation_codes_in_use.include?(code = rand(8999) + 1000)
$confirmation_codes_in_use << code
return code
end
end
Remember to clean up $confirmation_codes_in_use after using the code.

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link

You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.

If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.

I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000

The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Creating an id from name and address data. Hash/Digest

My problem:
I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?
In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?
>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734
Application Context (for those that are interested):
This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.

I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.

Use a cryptographic hash for its collision resistance, not its other qualities
Use as many bytes from the hash as you want (truncate)
convert to alpha-numeric characters
You can also truncate the alpha-numeric string instead of the hash
An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.
N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
hash = hashlib.md5(name + "|" + address).digest().encode("base64")
alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
return alnum_hash[:N_HASH_CHARS]
How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).

A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.

I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:
Create a SHA1 hash from the data, and store it in a table with a serial number field.
Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.

I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?
So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).
If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.
EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.
Is that feasible in your situation?

Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).
but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...
Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...
You could use AAAAA01 for first person at first address,
AAAAA02 for second person at first address,
AAAAB07 for the seventh resident at the second adresss, etc.
If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio