Generating nice looking BETA keys - algorithm

I built a web application that is going to launch a beta test soon. I would really like to hand out beta invites and keys that look nice.
i.e. A3E6-7C24-9876-235B
This is around 16 character, hexadecimal digits.
It looks like the typical beta key you might see.
My question is what is a standard way to generate something like this and make sure that it is unique and that it will not be easy for someone to guess a beta key and generate their own.
I have some ideas that would probably work for beta keys:
MD5 is secure enough for this, but it is long and ugly looking and could cause confusion between 0 and O, or 1 and l.
I could start off with a large hexadecimal number that is 16 digits in length. To prevent people from guessing what the next beta key might be increment the value by a random number each time. The range of numbers between 1111-1111-1111-1111 and eeee-eeee-eeee-eeee will have plenty of room to spare even if I am skipping large quantities of numbers.
I guess I am just wondering if there is a standard way for doing this that I am not finding with google. Is there a better way?

The canonical "unique identifying number" is a uuid. There are various forms - you can generate one from random numbers (version 4) or from a hash of some value (user's email + salt?) (versions 3 and 5), for example.
Libraries for java, python and a bunch more exist.
PS I have to add that when I read your question title I thought you were looking for something cool and different. You might consider using an "interesting" word list and combining words with hyphens to encode a number (based on hash of email + salt). That would be much more attractive imho: "your beta code is secret-wombat-cookie-ninja" (I'm sure I read an article describing an example, but I can't find it now).

One way (C# but the code is simple enough to port to other languages):
private static readonly Random random = new Random(Guid.NewGuid().GetHashCode());
static void Main(string[] args)
{
string x = GenerateBetaString();
}
public static string GenerateBetaString()
{
const string alphabet = "ABCDEF0123456789";
string x = GenerateRandomString(16, alphabet);
return x.Substring(0, 4) + "-" + x.Substring(4, 4) + "-"
+ x.Substring(8, 4) + "-" + x.Substring(12, 4);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder randomChars = new StringBuilder(length);
for (int i = 0; i < length; i++)
{
randomChars.Append(alphabet[random.Next(0, maxlen)]);
}
return randomChars.ToString();
}
Output:
97A8-55E5-C6B8-959E
8C60-6597-B71D-5CAF
8E1B-B625-68ED-107B
A6B5-1D2E-8D77-EB99
5595-E8DC-3A47-0605
Doing this way gives you precise control of the characters in the alphabet. If you need crypto strength randomness (unlikely) use the cryto random class to generate random bytes (possibly mod the alphabet length).

Computing power is cheap, take your idea of the MD5 and run an "aesthetic" of your own devising over the set. The code below generates 2000 unique keys almost instantaneously that do not have a 0,1,L,O character in them. Modify aesthetic to fit any additional criteria:
import random, hashlib
def potential_key():
x = random.random()
m = hashlib.md5()
m.update(str(x))
s = m.hexdigest().upper()[:16]
return "%s-%s-%s-%s" % (s[:4],s[4:8],s[8:12],s[12:])
def aesthetic(s):
bad_chars = ["0","1","L","O"]
for b in bad_chars:
if b in s: return False
return True
key_set = set()
while len(key_set) < 2000:
k = potential_key()
if aesthetic(k):
key_set.add(k)
print key_set
Example keys:
'4297-CAC6-9DA8-625A', '43DD-2ED4-E4F8-3E8D', '4A8D-D5EF-C7A3-E4D5',
'A68D-9986-4489-B66C', '9B23-6259-9832-9639', '2C36-FE65-EDDB-2CF7',
'BFB6-7769-4993-CD86', 'B4F4-E278-D672-3D2C', 'EEC4-3357-2EAB-96F5',
'6B69-C6DA-99C3-7B67', '9ED7-FED5-3CC6-D4C6', 'D3AA-AF48-6379-92EF', ...

Related

What's the best way to compress multiple values into deserializable value?

I'm implementing an openpeeps.com library for Flutter in which user can create their own peeps to use as an avatar within our product.
One of the reasons behind using peeps as avatar is that (in theory) it can be easily stored as a single value within a database.
A Peep within my library contains of up to 6 PeepAtoms:
class Peep {
final PeepAtom head;
final PeepAtom face;
final PeepAtom facialHair;
final PeepAtom? accessories;
final PeepAtom? body;
final PeepAtom? pose;
}
A PeepAtom is currently just a name identifying the underlying image file required to build a Peep:
class PeepAtom {
final String name;
}
How to get a hash?
What I'd like to do now is get a single value from a Peep (int or string) which I can store in a database. If I retrieve the data, I'd like to deconstruct the value into the unique atoms so I can render the appropriate atom images to display the Peep. While I'm not really looking to optimize for storage size, it would be nice if the bytesize would be small.
Since I'm normally not working with such stuff I don't have an idea what's the best option. These are my (naïve) ideas:
do a Peep.toJson and convert the output to base64. Likely inefficient due to a bunch of unnecessary characters.
do a PeepAtom.hashCode for each field within a Peep and upload this. As an array that would be 64bit = 8 Byte * 6 (Atoms). Thats pretty ok but not a single value.
since there are only a limited number of Atoms in each category (less than 100) I could use bitshifts and ^ to put this into one int. However, I think this would not really working because I'd need a unique identifier and since I'm code generating the PeepAtoms within my code that likely would be quite complex.
Any better ideas/algorithms?
I'm not sure what you mean by "quite complex". It looks quite simple to pack your atoms into a double.
Note that this is no way a "hash". A hash is a lossy operation. I presume that you want to recover the original data.
Based on your description, you need seven bits for each atom. They can range in 0..98 (since you said "less than 100"). A double has 52 bits of mantissa. Your six atoms needs 42 bits, so it fits easily. For atoms that can be null, just give that a special unused 7-bit value, like 127.
Now just use multiply and add to combine them. Use modulo and divide to pull them back out. E.g.:
double val = head;
val = val * 128 + face;
val = val * 128 + facialHair;
...
To extract:
int pose = val % 128;
val = (val / 128).floorToDouble();
int body = val % 128;
val = (val / 128).floorToDouble();
...

Hashing a long integer ID into a smaller string

Here is the problem, where I need to transform an ID (defined as a long integer) to a smaller alfanumeric identifier. The details are the following:
Each individual on the problem as an unique ID, a long integer of size 13 (something like 123123412341234).
I need to generate a smaller representation of this unique ID, a alfanumeric string, something like A1CB3X. The problem is that 5 or 6 character length will not be enough to represent such a large integer.
The new ID (eg A1CB3X) should be valid in a context where we know that only a small number of individuals are present (less than 500). The new ID should be unique within that small set of individuals.
The new ID (eg A1CB3X) should be the result of a calculation made over the original ID. This means that taking the original ID elsewhere and applying the same calculation, we should get the same new ID (eg A1CB3X).
This calculation should occur when the individual is added to the set, meaning that not all individuals belonging to that set will be know at that time.
Any directions on how to solve such a problem?
Assuming that you don't need a formula that goes in both directions (which is impossible if you are reducing a 13-digit number to a 5 or 6-character alphanum string):
If you can have up to 6 alphanumeric characters that gives you 366 = 2,176,782,336 possibilities, assuming only numbers and uppercase letters.
To map your larger 13-digit number onto this space, you can take a modulo of some prime number slightly smaller than that, for example 2,176,782,317, the encode it with base-36 encoding.
alphanum_id = base36encode(longnumber_id % 2176782317)
For a set of 500, this gives you a
2176782317P500 / 2176782317500 chance of a collision
(P is permutation)
Best option is to change the base to 62 using case sensitive characters
If you want it to be shorter, you can add unicode characters. See below.
Here is javascript code for you: https://jsfiddle.net/vewmdt85/1/
function compress(n) {
var symbols = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð'.split('');
var d = n;
var compressed = '';
while (d >= 1) {
compressed = symbols[(d - (symbols.length * Math.floor(d / symbols.length)))] + compressed;
d = Math.floor(d / symbols.length);
}
return compressed;
}
$('input').keyup(function() {
$('span').html(compress($(this).val()))
})
$('span').html(compress($('input').val()))
How about using some base-X conversion, for example 123123412341234 becomes 17N644R7CI in base-36 and 9999999999999 becomes 3JLXPT2PR?
If you need a mapping that works both directions, you can simply go for a larger base.
Meaning: using base 16, you can reduce 1 to 16 to a single character.
So, base36 is the "maximum" that allows for shorter strings (when 1-1 mapping is required)!

what does that mean for Text.hashCode() & Interger.MAX_VALUE?

Recently, I am reading the definitive guide of hadoop.
I have two questions:
1.I saw a piece of code of one custom Partitioner:
public class KeyPartitioner extends Partitioner<TextPair, Text>{
#Override
public int getPartition(TextPair key, Text value, int numPartitions){
return (key.getFirst().hashCode()&Interger.MAX_VALUE)%numPartitions;
}
}
what does that mean for &Integer.MAX_VALUE? why should use & operator?
2.I also want write a custom Partitioner for IntWritable. So is it OK and best for key.value%numPartitions directly?
Like I already wrote in the comments, it is used to keep the resulting integer positive.
Let's use a simple example using Strings:
String h = "Hello I'm negative!";
int hashCode = h.hashCode();
hashCode is negative with the value of -1937832979.
If you would mod this with a positive number (>0) that denotes the partition, the resulting number is always negative.
System.out.println(hashCode % 5); // yields -4
Since partitions can never be negative, you need to make sure the number is positive. Here comes a simple bit twiddeling trick into play, because Integer.MAX_VALUE has all-ones execpt the sign bit (MSB in Java as it is big endian) which is only 1 on negative numbers.
So if you have a negative number with the sign bit set, you will always AND it with the zero of the Integer.MAX_VALUE which is always going to be zero.
You can make it more readable though:
return Math.abs(key.getFirst().hashCode() % numPartitions);
For example I have done that in Apache Hama's partitioner for arbitrary objects:
#Override
public int getPartition(K key, V value, int numTasks) {
return Math.abs(key.hashCode() % numTasks);
}

Making a list of integers more human friendly

This is a bit of a side project I have taken on to solve a no-fix issue for work. Our system outputs a code to represent a combination of things on another thing. Some example codes are:
9-9-0-4-4-5-4-0-2-0-0-0-2-0-0-0-0-0-2-1-2-1-2-2-2-4
9-5-0-7-4-3-5-7-4-0-5-1-4-2-1-5-5-4-6-3-7-9-72
9-15-0-9-1-6-2-1-2-0-0-1-6-0-7
The max number in one of the slots I've seen so far is about 150 but they will likely go higher.
When the system was designed there was no requirement for what this code would look like. But now the client wants to be able to type it in by hand from a sheet of paper, something the code above isn't suited for. We've said we won't do anything about it, but it seems like a fun challenge to take on.
My question is where is a good place to start loss-less compressing this code? Obvious solutions such as store this code with a shorter key are not an option; our database is read only. I need to build a two way method to make this code more human friendly.
1) I agree that you definately need a checksum - data entry errors are very common, unless you have really well trained staff and independent duplicate keying with automatic crosss-checking.
2) I suggest http://en.wikipedia.org/wiki/Huffman_coding to turn your list of numbers into a stream of bits. To get the probabilities required for this, you need a decent sized sample of real data, so you can make a count, setting Ni to the number of times number i appears in the data. Then I suggest setting Pi = (Ni + 1) / (Sum_i (Ni + 1)) - which smooths the probabilities a bit. Also, with this method, if you see e.g. numbers 0-150 you could add a bit of slack by entering numbers 151-255 and setting them to Ni = 0. Another way round rare large numbers would be to add some sort of escape sequence.
3) Finding a way for people to type the resulting sequence of bits is really an applied psychology problem but here are some suggestions of ideas to pinch.
3a) Software licences - just encode six bits per character in some 64-character alphabet, but group characters in a way that makes it easier for people to keep place e.g. BC017-06777-14871-160C4
3b) UK car license plates. Use a change of alphabet to show people how to group characters e.g. ABCD0123EFGH4567IJKL...
3c) A really large alphabet - get yourself a list of 2^n words for some decent sized n and encode n bits as a word e.g. GREEN ENCHANTED LOGICIAN... -
i worried about this problem a while back. it turns out that you can't do much better than base64 - trying to squeeze a few more bits per character isn't really worth the effort (once you get into "strange" numbers of bits encoding and decoding becomes more complex). but at the same time, you end up with something that's likely to have errors when entered (confusing a 0 with an O etc). one option is to choose a modified set of characters and letters (so it's still base 64, but, say, you substitute ">" for "0". another is to add a checksum. again, for simplicity of implementation, i felt the checksum approach was better.
unfortunately i never got any further - things changed direction - so i can't offer code or a particular checksum choice.
ps i realised there's a missing step i didn't explain: i was going to compress the text into some binary form before encoding (using some standard compression algorithm). so to summarize: compress, add checksum, base64 encode; base 64 decode, check checksum, decompress.
This is similar to what I have used in the past. There are certainly better ways of doing this, but I used this method because it was easy to mirror in Transact-SQL which was a requirement at the time. You could certainly modify this to incorporate Huffman encoding if the distribution of your id's is non-random, but it's probably unnecessary.
You didn't specify language, so this is in c#, but it should be very easy to transition to any language. In the lookup you'll see commonly confused characters are omitted. This should speed up entry. I also had the requirement to have a fixed length, but it would be easy for you to modify this.
static public class CodeGenerator
{
static Dictionary<int, char> _lookupTable = new Dictionary<int, char>();
static CodeGenerator()
{
PrepLookupTable();
}
private static void PrepLookupTable()
{
_lookupTable.Add(0,'3');
_lookupTable.Add(1,'2');
_lookupTable.Add(2,'5');
_lookupTable.Add(3,'4');
_lookupTable.Add(4,'7');
_lookupTable.Add(5,'6');
_lookupTable.Add(6,'9');
_lookupTable.Add(7,'8');
_lookupTable.Add(8,'W');
_lookupTable.Add(9,'Q');
_lookupTable.Add(10,'E');
_lookupTable.Add(11,'T');
_lookupTable.Add(12,'R');
_lookupTable.Add(13,'Y');
_lookupTable.Add(14,'U');
_lookupTable.Add(15,'A');
_lookupTable.Add(16,'P');
_lookupTable.Add(17,'D');
_lookupTable.Add(18,'S');
_lookupTable.Add(19,'G');
_lookupTable.Add(20,'F');
_lookupTable.Add(21,'J');
_lookupTable.Add(22,'H');
_lookupTable.Add(23,'K');
_lookupTable.Add(24,'L');
_lookupTable.Add(25,'Z');
_lookupTable.Add(26,'X');
_lookupTable.Add(27,'V');
_lookupTable.Add(28,'C');
_lookupTable.Add(29,'N');
_lookupTable.Add(30,'B');
}
public static bool TryPCodeDecrypt(string iPCode, out Int64 oDecryptedInt)
{
//Prep the result so we can exit without having to fiddle with it if we hit an error.
oDecryptedInt = 0;
if (iPCode.Length > 3)
{
Char[] Bits = iPCode.ToCharArray(0,iPCode.Length-2);
int CheckInt7 = 0;
int CheckInt3 = 0;
if (!int.TryParse(iPCode[iPCode.Length-1].ToString(),out CheckInt7) ||
!int.TryParse(iPCode[iPCode.Length-2].ToString(),out CheckInt3))
{
//Unsuccessful -- the last check ints are not integers.
return false;
}
//Adjust the CheckInts to the right values.
CheckInt3 -= 2;
CheckInt7 -= 2;
int COffset = iPCode.LastIndexOf('M')+1;
Int64 tempResult = 0;
int cBPos = 0;
while ((cBPos + COffset) < Bits.Length)
{
//Calculate the current position.
int cNum = 0;
foreach (int cKey in _lookupTable.Keys)
{
if (_lookupTable[cKey] == Bits[cBPos + COffset])
{
cNum = cKey;
}
}
tempResult += cNum * (Int64)Math.Pow((double)31, (double)(Bits.Length - (cBPos + COffset + 1)));
cBPos += 1;
}
if (tempResult % 7 == CheckInt7 && tempResult % 3 == CheckInt3)
{
oDecryptedInt = tempResult;
return true;
}
return false;
}
else
{
//Unsuccessful -- too short.
return false;
}
}
public static string PCodeEncrypt(int iIntToEncrypt, int iMinLength)
{
int Check7 = (iIntToEncrypt % 7) + 2;
int Check3 = (iIntToEncrypt % 3) + 2;
StringBuilder result = new StringBuilder();
result.Insert(0, Check7);
result.Insert(0, Check3);
int workingNum = iIntToEncrypt;
while (workingNum > 0)
{
result.Insert(0, _lookupTable[workingNum % 31]);
workingNum /= 31;
}
if (result.Length < iMinLength)
{
for (int i = result.Length + 1; i <= iMinLength; i++)
{
result.Insert(0, 'M');
}
}
return result.ToString();
}
}

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources