Encode array to fixed length string - algorithm

I'm trying to implement a similar function to PCPartPicker's list permalink function.
https://au.pcpartpicker.com/list/
basically generate a permalink based on the items in the list. The key part is to generate a string which should be:
unique
persistent
fixed length
I'm thinking about encoding an array contains product id, but can't find the right way to implement it.
Base64 and the similar (like Hashids library) can ensure it's unique and persistent, but it ends up quite long when the array has many items.
Is there other way to encode the array or is there other direction I can implement this function?
Thank you in advance.

One can't generate unique fixed length string for arbitrary length list that will contain all info - there is always some length that can't fit.
Since your site has database, you can generate UUID and store list in DB along with UUID. To save space and efforts you can save it into DB only when user presses "get permalink" button or something like that.

Related

Abbreviating a UUID

What would be a good way to abbreviate UUID for use in a button in a user interface when the id is all we know about the target?
GitHub seems to abbreviate commit ids by taking 7 characters from the beginning. For example b1310ce6bc3cc932ce5cdbe552712b5a3bdcb9e5 would appear in a button as b1310ce. While not perfect this shorter version is sufficient to look unique in the context where it is displayed. I'm looking for a similar solution that would work for UUIDs. I'm wondering is some part of the UUID is more random than another.
The most straight forward option would be splitting at dash and using the first part. The UUID 42e9992a-8324-471d-b7f3-109f6c7df99d would then be abbreviated as 42e9992a. All of the solutions I can come up with seem equally arbitrary. Perhaps there is some outside the box user interface design solution that I didn't think of.
Entropy of a UUID is highest in the first bits for UUID V1 and V2, and evenly distributed for V3, V4 and V5. So, the first N characters are no worse than any other N characters subset.
For N=8, i.e. the group before the first dash, the odds of there being a collision within a list you could reasonably display within a single GUI screen is vanishingly small.
The question is whether you want to show part of the UUID or only ensure that unique strings are presented as shorter unique strings. If you want to focus on the latter, which appears to be the goal you are suggesting in your opening paragraph:
(...) While not perfect this shorter version is sufficient to look unique in
the context where it is displayed. (...)
you can make use of hashing.
Hashing:
Hashing is the transformation of a string of characters into a usually
shorter fixed-length value or key that represents the original string.
Hashing is used to index and retrieve items in a database because it
is faster to find the item using the shorter hashed key than to find
it using the original value.
Hashing is very common and easy to use across many of popular languages; simple approach in Python:
import hashlib
import uuid
encoded_str = uuid.UUID('42e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid = hashlib.sha1(encoded_str).hexdigest()
hash_uuid[:10]
'b6e2a1c885'
Expectedly, a small change in string will result in a different string correctly showing uniqueness.
# Second digit is replaced with 3, rest of the string remains untouched
encoded_str_two = uuid.UUID('43e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid_two = hashlib.sha1(encoded_str_two).hexdigest()
hash_uuid_two[:10]
'406ec3f5ae'
After thinking about this for a while I realised that the short git commit hash is used as part of command line commands. Since this requirement does not exist for UUIDs and graphical user interfaces I simply decided to use ellipsis for the abbreviation. Like so 42e9992...

Encrypt or hash a string to decimal numbers in Ruby

In one of our signup processes, in a Ruby on Rails app, I want to send someone an email with a 6-digit numerical code that they need to copy or type into a box on the page.
How I thought I might do it is to have a string on the page, in a hidden field, say, which, on generating the email, gets combined with some secret key from our codebase, then hashed or encrypted to a series of decimal numbers. I then take the last six of these and put them in the email.
When the form is submitted, with the original string, and the 6 digits the user has typed in, I can repeat the process on the string and test that it produces the same six digits as the user entered.
The question is: how do I hash/encrypt a string to get a series of decimal numbers?  The original string can be anything (including a different set of decimal numbers), it just needs to be something that is randomly generated really.
Note: Do not use this as a security measure. You've stated that you intend to use this only as a kind of diy captcha system, and this answer is provided in that context.
OpenSSL provides good hash functions. Once you have the hash in hex the to_i method takes a base argument, so converting to decimal is simple. Then back to a string because that makes it easier to get the last characters.
require 'openssl'
hash = OpenSSL::Digest::SHA256.hexdigest('user#example.com' + Random.new_seed.to_s)
message = hash.to_i(16).to_s.chars.last(6).join
You could then store any part of this along with an expiration time.

Generate unique alpha-numeric object ID, like Parse

I already asked what I needed at the title. I want to generate using either PHP or javascript.
I think the class name and some properties are used to build the objectId but someone may already know how its done that could share here?
The Parse Server generates the objectId. It is a randomly generated String of 10 chars length. You can see their implementation at cryptoUtils.newObjectId(). From the code we can conclude that they are not enforcing uniqueness.
https://github.com/ParsePlatform/parse-server/blob/master/src/cryptoUtils.js
Parse is probably using ids generated in Mongodb. They are not random and can be potentially predicted :
A BSON ObjectID is a 12-byte value
consisting of a 4-byte timestamp
(seconds since epoch), a 3-byte
machine id, a 2-byte process id, and a
3-byte counter
http://www.mongodb.org/display/DOCS/Object+IDs

Should I just use string as attribute type instead of integer in core-data?

I have a text filed which allows user to input numbers. This is what I did:
[_textfield setText:[NSString stringWithFormat:#"%d", [_textfield.text intValue]]];
Basically, I convert text in text filed to integer, then convert back to string. This will ensure the text is numbers only.
Now I need to store the text in _textfield into core data. I was wondering wether should use string as the attribute type or integer.
I know integer is a more sensible option. But to this case, every time the view is loaded, I need fetch this data and set to the _textfield. If I use integer as the attribute type, I have to convert to string each time. I know what I need to do is simply:
_textfield.text = [numberFromCoreData stringValue];
I don't need to compare, sort or do any arithmetic computation with that number, so should I just use string as the attribute type?
Integer searching is significantly faster than string searching. That is the single most compelling reason to use numbers in your persistence layer. Numbers also can sort differently than strings.
For performance reasons I would never use a string when I know the value is always going to be an integer. Control the input, force it to only accept numbers and keep your data integrity.
It depends how you need to use that field. In almost every case, integer data should be stored as an integer type, but not always. You definitely want an integer type if you'll ever be using that field in a case where its numeric value counts in some way. That includes sorting (because it's a hell of a lot faster with numeric fields), comparing numeric values, or any kind of mathematical operation.
But there's are exceptions. For example, in some cases fields which initially seem to be inherently numeric turn out not to be so. Like a "size" field which is normally an integer. But on closer inspection it turns out that some sizes are specified as "8 - 10", "12 - 14", etc. This happened in one app I worked on a couple of years ago. In that case I ended up using two fields for the data-- a numeric "sortSize" that could be used for sorting, and a string "displaySize" that included the full string.
It's probably not what you want but why don't you use a keyboard type "number Pad" for your textfield?
With that, you would be sure that you have only numbers into your textfield.
Honestly I can't think of a compelling reason. Strings in general take up more storage space than Integers but in the modern world of computing this isn't much of an issue. If you aren't really pushing you processor too hard I'd go with what is convenient.
From the most basic way of thinking about it an integer is a number but for a string the computer needs to know when the string ends, starts, and what is in it so its a little bigger.

What is the length and characters in ids generated by elasticsearch?

The documentation of elastic search states:
The index operation can be executed without specifying the id. In such
a case, an id will be generated automatically.
But it does not provide any information about the properties of the ids.
What is the length (minimun/maximum)?
my guess is 22.
Which characters are used in the id?
My guess is [-_A-Za-z0-9]
Can the properties of the generated ids change at any time (is that part of the API)?
Auto-generated ids are random base64-encoded UUIDs. The base64 algorithm is used in URL-safe mode hence - and _ characters might be present in ids.
Auto-generated ids by elasticsearch are exactly 20 characters length (not 22 characters) and encoded by url-safe base64 algorithm [-_A-Za-z0-9].
Read more in documentation: https://www.elastic.co/guide/en/elasticsearch/guide/master/index-doc.html#_autogenerating_ids

Resources