What's the best way to compress multiple values into deserializable value? - algorithm

I'm implementing an openpeeps.com library for Flutter in which user can create their own peeps to use as an avatar within our product.
One of the reasons behind using peeps as avatar is that (in theory) it can be easily stored as a single value within a database.
A Peep within my library contains of up to 6 PeepAtoms:
class Peep {
final PeepAtom head;
final PeepAtom face;
final PeepAtom facialHair;
final PeepAtom? accessories;
final PeepAtom? body;
final PeepAtom? pose;
}
A PeepAtom is currently just a name identifying the underlying image file required to build a Peep:
class PeepAtom {
final String name;
}
How to get a hash?
What I'd like to do now is get a single value from a Peep (int or string) which I can store in a database. If I retrieve the data, I'd like to deconstruct the value into the unique atoms so I can render the appropriate atom images to display the Peep. While I'm not really looking to optimize for storage size, it would be nice if the bytesize would be small.
Since I'm normally not working with such stuff I don't have an idea what's the best option. These are my (naïve) ideas:
do a Peep.toJson and convert the output to base64. Likely inefficient due to a bunch of unnecessary characters.
do a PeepAtom.hashCode for each field within a Peep and upload this. As an array that would be 64bit = 8 Byte * 6 (Atoms). Thats pretty ok but not a single value.
since there are only a limited number of Atoms in each category (less than 100) I could use bitshifts and ^ to put this into one int. However, I think this would not really working because I'd need a unique identifier and since I'm code generating the PeepAtoms within my code that likely would be quite complex.
Any better ideas/algorithms?

I'm not sure what you mean by "quite complex". It looks quite simple to pack your atoms into a double.
Note that this is no way a "hash". A hash is a lossy operation. I presume that you want to recover the original data.
Based on your description, you need seven bits for each atom. They can range in 0..98 (since you said "less than 100"). A double has 52 bits of mantissa. Your six atoms needs 42 bits, so it fits easily. For atoms that can be null, just give that a special unused 7-bit value, like 127.
Now just use multiply and add to combine them. Use modulo and divide to pull them back out. E.g.:
double val = head;
val = val * 128 + face;
val = val * 128 + facialHair;
...
To extract:
int pose = val % 128;
val = (val / 128).floorToDouble();
int body = val % 128;
val = (val / 128).floorToDouble();
...

Related

Kotlin Convert long String letters to a numerical id

I'm trying to find a way to convert a long string ID like "T2hR8VAR4tNULoglmIbpAbyvdRi1y02rBX" to a numerical id.
I thought about getting the ASCII value of each number and then adding them up but I don't think that this is a good way as different numbers can have the same result, for example, "ABC" and "BAC" will have the same result
A = 10, B = 20, C = 50,
ABC = 10 + 20 + 50 = 80
BAC = 20 + 10 + 50 = 80
I also thought about getting each letters ASCII code, then set the numbers next to each other for example "ABC"
so ABC = 102050
this method won't work as having a 20 letter String will result in a huge number, so how can I solve this problem? thank you in advance.
You can use the hashCode() function. "id".hashcode(). All objects implement a variance of this function.
From the documentation:
open fun hashCode(): Int
Returns a hash code value for the object. The general contract of hashCode is:
Whenever it is invoked on the same object more than once, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
If two objects are equal according to the equals() method, then calling the hashCode method on each of the two objects must produce the same integer result.
All platform object implements it by default. There is always a possibility for duplicates if you have lots of ids.
If you use a JVM based kotlin environment the hash will be produced by the
String.hashCode() function from the JVM.
If you need to be 100% confident that there are no possible duplicates, and the input Strings can be up to 20 characters long, then you cannot store the IDs in a 64-bit Long. You will have to use BigInteger:
val id = BigInteger(stringId.toByteArray())
At that point, I question whether there is any point in converting the ID to a numerical format. The String itself can be the ID.

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Hashing a long integer ID into a smaller string

Here is the problem, where I need to transform an ID (defined as a long integer) to a smaller alfanumeric identifier. The details are the following:
Each individual on the problem as an unique ID, a long integer of size 13 (something like 123123412341234).
I need to generate a smaller representation of this unique ID, a alfanumeric string, something like A1CB3X. The problem is that 5 or 6 character length will not be enough to represent such a large integer.
The new ID (eg A1CB3X) should be valid in a context where we know that only a small number of individuals are present (less than 500). The new ID should be unique within that small set of individuals.
The new ID (eg A1CB3X) should be the result of a calculation made over the original ID. This means that taking the original ID elsewhere and applying the same calculation, we should get the same new ID (eg A1CB3X).
This calculation should occur when the individual is added to the set, meaning that not all individuals belonging to that set will be know at that time.
Any directions on how to solve such a problem?
Assuming that you don't need a formula that goes in both directions (which is impossible if you are reducing a 13-digit number to a 5 or 6-character alphanum string):
If you can have up to 6 alphanumeric characters that gives you 366 = 2,176,782,336 possibilities, assuming only numbers and uppercase letters.
To map your larger 13-digit number onto this space, you can take a modulo of some prime number slightly smaller than that, for example 2,176,782,317, the encode it with base-36 encoding.
alphanum_id = base36encode(longnumber_id % 2176782317)
For a set of 500, this gives you a
2176782317P500 / 2176782317500 chance of a collision
(P is permutation)
Best option is to change the base to 62 using case sensitive characters
If you want it to be shorter, you can add unicode characters. See below.
Here is javascript code for you: https://jsfiddle.net/vewmdt85/1/
function compress(n) {
var symbols = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð'.split('');
var d = n;
var compressed = '';
while (d >= 1) {
compressed = symbols[(d - (symbols.length * Math.floor(d / symbols.length)))] + compressed;
d = Math.floor(d / symbols.length);
}
return compressed;
}
$('input').keyup(function() {
$('span').html(compress($(this).val()))
})
$('span').html(compress($('input').val()))
How about using some base-X conversion, for example 123123412341234 becomes 17N644R7CI in base-36 and 9999999999999 becomes 3JLXPT2PR?
If you need a mapping that works both directions, you can simply go for a larger base.
Meaning: using base 16, you can reduce 1 to 16 to a single character.
So, base36 is the "maximum" that allows for shorter strings (when 1-1 mapping is required)!

Generating nice looking BETA keys

I built a web application that is going to launch a beta test soon. I would really like to hand out beta invites and keys that look nice.
i.e. A3E6-7C24-9876-235B
This is around 16 character, hexadecimal digits.
It looks like the typical beta key you might see.
My question is what is a standard way to generate something like this and make sure that it is unique and that it will not be easy for someone to guess a beta key and generate their own.
I have some ideas that would probably work for beta keys:
MD5 is secure enough for this, but it is long and ugly looking and could cause confusion between 0 and O, or 1 and l.
I could start off with a large hexadecimal number that is 16 digits in length. To prevent people from guessing what the next beta key might be increment the value by a random number each time. The range of numbers between 1111-1111-1111-1111 and eeee-eeee-eeee-eeee will have plenty of room to spare even if I am skipping large quantities of numbers.
I guess I am just wondering if there is a standard way for doing this that I am not finding with google. Is there a better way?
The canonical "unique identifying number" is a uuid. There are various forms - you can generate one from random numbers (version 4) or from a hash of some value (user's email + salt?) (versions 3 and 5), for example.
Libraries for java, python and a bunch more exist.
PS I have to add that when I read your question title I thought you were looking for something cool and different. You might consider using an "interesting" word list and combining words with hyphens to encode a number (based on hash of email + salt). That would be much more attractive imho: "your beta code is secret-wombat-cookie-ninja" (I'm sure I read an article describing an example, but I can't find it now).
One way (C# but the code is simple enough to port to other languages):
private static readonly Random random = new Random(Guid.NewGuid().GetHashCode());
static void Main(string[] args)
{
string x = GenerateBetaString();
}
public static string GenerateBetaString()
{
const string alphabet = "ABCDEF0123456789";
string x = GenerateRandomString(16, alphabet);
return x.Substring(0, 4) + "-" + x.Substring(4, 4) + "-"
+ x.Substring(8, 4) + "-" + x.Substring(12, 4);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder randomChars = new StringBuilder(length);
for (int i = 0; i < length; i++)
{
randomChars.Append(alphabet[random.Next(0, maxlen)]);
}
return randomChars.ToString();
}
Output:
97A8-55E5-C6B8-959E
8C60-6597-B71D-5CAF
8E1B-B625-68ED-107B
A6B5-1D2E-8D77-EB99
5595-E8DC-3A47-0605
Doing this way gives you precise control of the characters in the alphabet. If you need crypto strength randomness (unlikely) use the cryto random class to generate random bytes (possibly mod the alphabet length).
Computing power is cheap, take your idea of the MD5 and run an "aesthetic" of your own devising over the set. The code below generates 2000 unique keys almost instantaneously that do not have a 0,1,L,O character in them. Modify aesthetic to fit any additional criteria:
import random, hashlib
def potential_key():
x = random.random()
m = hashlib.md5()
m.update(str(x))
s = m.hexdigest().upper()[:16]
return "%s-%s-%s-%s" % (s[:4],s[4:8],s[8:12],s[12:])
def aesthetic(s):
bad_chars = ["0","1","L","O"]
for b in bad_chars:
if b in s: return False
return True
key_set = set()
while len(key_set) < 2000:
k = potential_key()
if aesthetic(k):
key_set.add(k)
print key_set
Example keys:
'4297-CAC6-9DA8-625A', '43DD-2ED4-E4F8-3E8D', '4A8D-D5EF-C7A3-E4D5',
'A68D-9986-4489-B66C', '9B23-6259-9832-9639', '2C36-FE65-EDDB-2CF7',
'BFB6-7769-4993-CD86', 'B4F4-E278-D672-3D2C', 'EEC4-3357-2EAB-96F5',
'6B69-C6DA-99C3-7B67', '9ED7-FED5-3CC6-D4C6', 'D3AA-AF48-6379-92EF', ...

Generating integer within range from unique string in ruby

I have a code that should get unique string(for example, "d86c52ec8b7e8a2ea315109627888fe6228d") from client and return integer more than 2200000000 and less than 5800000000. It's important, that this generated int is not random, it should be one for one unique string. What is the best way to generate it without using DB?
Now it looks like this:
did = "d86c52ec8b7e8a2ea315109627888fe6228d"
min_cid = 2200000000
max_cid = 5800000000
cid = did.hash.abs.to_s.split.last(10).to_s.to_i
if cid < min_cid
cid += min_cid
else
while cid > max_cid
cid -= 1000000000
end
end
Here's the problem - your range of numbers has only 3.6x10^9 possible values where as your sample unique string (which looks like a hex integer with 36 digits) has 16^32 possible values (i.e. many more). So when mapping your string into your integer range there will be collisions.
The mapping function itself can be pretty straightforward, I would do something such as below (also, consider using only a part of the input string for integer conversion, e.g. the first seven digits, if performance becomes critical):
def my_hash(str, min, max)
range = (max - min).abs
(str.to_i(16) % range) + min
end
my_hash(did, min_cid, max_cid) # => 2461595789
[Edit] If you are using Ruby 1.8 and your adjusted range can be represented as a Fixnum, just use the hash value of the input string object instead of parsing it as a big integer. Note that this strategy might not be safe in Ruby 1.9 (per the comment by #DataWraith) as object hash values may be randomized between invocations of the interpreter so you would not get the same hash number for the same input string when you restart your application:
def hash_range(obj, min, max)
(obj.hash % (max-min).abs) + [min, max].min
end
hash_range(did, min_cid, max_cid) # => 3886226395
And, of course, you'll have to decide what to do about collisions. You'll likely have to persist a bucket of input strings which map to the same value and decide how to resolve the conflicts if you are looking up by the mapped value.
You could generate a 32-bit CRC, drop one bit, and add the result to 2.2M. That gives you a max value of 4.3M.
Alternately you could use all 32 bits of the CRC, but when the result is too large, append a zero to the input string and recalculate, repeating until you get a value in range.

Resources