Abbreviating a UUID - user-interface

What would be a good way to abbreviate UUID for use in a button in a user interface when the id is all we know about the target?
GitHub seems to abbreviate commit ids by taking 7 characters from the beginning. For example b1310ce6bc3cc932ce5cdbe552712b5a3bdcb9e5 would appear in a button as b1310ce. While not perfect this shorter version is sufficient to look unique in the context where it is displayed. I'm looking for a similar solution that would work for UUIDs. I'm wondering is some part of the UUID is more random than another.
The most straight forward option would be splitting at dash and using the first part. The UUID 42e9992a-8324-471d-b7f3-109f6c7df99d would then be abbreviated as 42e9992a. All of the solutions I can come up with seem equally arbitrary. Perhaps there is some outside the box user interface design solution that I didn't think of.

Entropy of a UUID is highest in the first bits for UUID V1 and V2, and evenly distributed for V3, V4 and V5. So, the first N characters are no worse than any other N characters subset.
For N=8, i.e. the group before the first dash, the odds of there being a collision within a list you could reasonably display within a single GUI screen is vanishingly small.

The question is whether you want to show part of the UUID or only ensure that unique strings are presented as shorter unique strings. If you want to focus on the latter, which appears to be the goal you are suggesting in your opening paragraph:
(...) While not perfect this shorter version is sufficient to look unique in
the context where it is displayed. (...)
you can make use of hashing.
Hashing:
Hashing is the transformation of a string of characters into a usually
shorter fixed-length value or key that represents the original string.
Hashing is used to index and retrieve items in a database because it
is faster to find the item using the shorter hashed key than to find
it using the original value.
Hashing is very common and easy to use across many of popular languages; simple approach in Python:
import hashlib
import uuid
encoded_str = uuid.UUID('42e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid = hashlib.sha1(encoded_str).hexdigest()
hash_uuid[:10]
'b6e2a1c885'
Expectedly, a small change in string will result in a different string correctly showing uniqueness.
# Second digit is replaced with 3, rest of the string remains untouched
encoded_str_two = uuid.UUID('43e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid_two = hashlib.sha1(encoded_str_two).hexdigest()
hash_uuid_two[:10]
'406ec3f5ae'

After thinking about this for a while I realised that the short git commit hash is used as part of command line commands. Since this requirement does not exist for UUIDs and graphical user interfaces I simply decided to use ellipsis for the abbreviation. Like so 42e9992...

Related

Is there a way to remove a word from a KeyedVectors vocab?

I need to remove an invalid word from the vocab of a "gensim.models.keyedvectors.Word2VecKeyedVectors".
I tried to remove it using del model.vocab[word], if I print the model.vocab the word disappeared, but when I run model.most_similar using other words the word that I deleted is still appearing as similar.
So how can I delete a word from model.vocab in a way that affect the model.most_similar to not bring it?
There's no existing method supporting the removal of individual words.
A quick-and-dirty workaround might be to, at the same time as removing the vocab entry, noting the index of the existing vector (in the underlying large vector array), and also changing the string in the kv_model.index2entity list at that index to some plug value (like say, '***DELETED***').
Then, after performing any most_similar(), discard any entries matching '***DELETED***'.
Refer to:
How to remove a word completely from a Word2Vec model in gensim?
Possible method 1: I solve it by editing the text model file itself.
Possible method 2: Refer to #zsozso's answer. (Though I didn't get
it to
work).

Encode array to fixed length string

I'm trying to implement a similar function to PCPartPicker's list permalink function.
https://au.pcpartpicker.com/list/
basically generate a permalink based on the items in the list. The key part is to generate a string which should be:
unique
persistent
fixed length
I'm thinking about encoding an array contains product id, but can't find the right way to implement it.
Base64 and the similar (like Hashids library) can ensure it's unique and persistent, but it ends up quite long when the array has many items.
Is there other way to encode the array or is there other direction I can implement this function?
Thank you in advance.
One can't generate unique fixed length string for arbitrary length list that will contain all info - there is always some length that can't fit.
Since your site has database, you can generate UUID and store list in DB along with UUID. To save space and efforts you can save it into DB only when user presses "get permalink" button or something like that.

Can I alpha sort base32/64 encoded MD5 hashes?

I've got a massive file of hex encoded MD5 values that I'm using linux 'sort' utility to sort. The result is that the hashes come out in sequential order (which is what I need for the next stage of processing). E.g:
000001C35AE83CEFE245D255FFC4CE11
000003E4B110FE637E0B4172B386ACAC
000004AAD0EB3D896B654A960B0111FA
In the interest of speeding up the sort operation (and making the files smaller), I was considering encoding the data as base32 or base64.
The question is, would an alpha-sort of the base32/64 data get me the same result? My quick tests seem to indicate that it would work. For example, the above three hex strings correspond 1:1 to these base64 strings:
AAABw1roPO/iRdJV/8TOEQ==
AAAD5LEQ/mN+C0Fys4asrA==
AAAEqtDrPYlrZUqWCwER+g==
But I'm unsure as to the sort order when it comes to special characters used in Base64 like "/" and "+" and how those would be treated in the context of an alpha sort.
Note: I happen to be using the linux sort utility but the question still applies to other alpha-sorting tools. The tool used is not really part of the question.
I've since discovered that this isn't possible with the standard base32/64 implementations. There exists however a base32 variation called "base32hex" which preserves sort ordering, but there is no official "base64hex" equivalent.
Looks like that leaves creating a custom encoding like this.
EDIT:
This turned out to be very trivial to solve. Simply encode in base 64 then translate character to character with a custom table of characters that respects sort order.
Simply map from the standard Mime 64 characters:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
To something like this:
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz|~"
Then sorting will work.

Should I just use string as attribute type instead of integer in core-data?

I have a text filed which allows user to input numbers. This is what I did:
[_textfield setText:[NSString stringWithFormat:#"%d", [_textfield.text intValue]]];
Basically, I convert text in text filed to integer, then convert back to string. This will ensure the text is numbers only.
Now I need to store the text in _textfield into core data. I was wondering wether should use string as the attribute type or integer.
I know integer is a more sensible option. But to this case, every time the view is loaded, I need fetch this data and set to the _textfield. If I use integer as the attribute type, I have to convert to string each time. I know what I need to do is simply:
_textfield.text = [numberFromCoreData stringValue];
I don't need to compare, sort or do any arithmetic computation with that number, so should I just use string as the attribute type?
Integer searching is significantly faster than string searching. That is the single most compelling reason to use numbers in your persistence layer. Numbers also can sort differently than strings.
For performance reasons I would never use a string when I know the value is always going to be an integer. Control the input, force it to only accept numbers and keep your data integrity.
It depends how you need to use that field. In almost every case, integer data should be stored as an integer type, but not always. You definitely want an integer type if you'll ever be using that field in a case where its numeric value counts in some way. That includes sorting (because it's a hell of a lot faster with numeric fields), comparing numeric values, or any kind of mathematical operation.
But there's are exceptions. For example, in some cases fields which initially seem to be inherently numeric turn out not to be so. Like a "size" field which is normally an integer. But on closer inspection it turns out that some sizes are specified as "8 - 10", "12 - 14", etc. This happened in one app I worked on a couple of years ago. In that case I ended up using two fields for the data-- a numeric "sortSize" that could be used for sorting, and a string "displaySize" that included the full string.
It's probably not what you want but why don't you use a keyboard type "number Pad" for your textfield?
With that, you would be sure that you have only numbers into your textfield.
Honestly I can't think of a compelling reason. Strings in general take up more storage space than Integers but in the modern world of computing this isn't much of an issue. If you aren't really pushing you processor too hard I'd go with what is convenient.
From the most basic way of thinking about it an integer is a number but for a string the computer needs to know when the string ends, starts, and what is in it so its a little bigger.

How do you autocomplete names containing spaces?

I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.

Resources