Split test groups base on GUID - algorithm

Users in the system are identified by GUID, and with a new feature, I want to divide users into two groups - test and control.
Is there a easy way to split users into one of the two group with a 50/50 chance, based on their GUID?
e.g. If the nth character's ascii code is an odd -> test group, otherwise control group.
What about 70/30, or other ratio?
The reason I want to classify users base on GUID, is because later I can easily tell which users are in which group and compare the performance between two groups, without having to keep track of the group assignment - I simply need to calculate it again.

As Derek Li notes, the GUID's bits might be based on a timestamp, so you shouldn't use them directly.
The safest solution is to hash the GUID using a hash function like MurmurHash. This will produce a random number (but the same random number every time for any given GUID) which you can then use to do the split.
For example, you could do a 30/70 split like this:
function isInTestGroup(user) {
var hash = murmurHash(user.guid);
return (hash % 100) < 30;
}

If some character in the GUID has a 1 in 16 change of being one of the following characters: "0123456789ABCEDF", then perhaps you could test a scheme that determines placement by that character.
Say the last character of the guid called c has a 1/16 chance of being any hex digit:
for 50/50 distribution -> c <= 7 for group 1, c > 7 for group 2
for 70/30 c <= A for group 1, c > A for group 2
etc...

Related

I'm trying to add a value to individual characters

I'm working on a Scrabble assignment and I'm trying to assign values to letters. Like in Scrabble, A, E, I, O, U, L, N, S, T, R are all equal to 1. I had some help in figuring out how to add the score up once I assign values, but now I'm trying to figure out how to assign values. Is there a way to create one variable for all the values? That doesn't really make sense to me.
I was also thinking I could do an if-else statement. Like if the letter equals any of those letters, value = 1, else if the letter equals D or G, value = 2 and so on. There are 7 different scores so it's kind of annoying and not efficient, but I'm not really sure what a better way might be. I'm new to programming, a novice, so I'm looking for advice that takes my level into account.
I have started my program by reading words from a text file into an arraylist. I successfully printed the arraylist, so I know that part worked. Next I'm working on how to read each character of each word and assign a value. Last, I will figure out how to sort it.
it's me from the other question again. You can definitely do an if-statement, but if I'm not wrong Scrabble has 8 different values for letters, so you would need 8 “if”s and also since there are around 25 letters (depending on language) you would have to handle all 25 some way in the if-statements which would be quite clunky in my opinion.
I think the best option is to use a Hash-table. A hash-table is basically like a dictionary where you look up a key and get a value. So I would add each letter as a key and keep the corresponding value as the value. It would look like this:
//initialize empty hash map
Hashtable<String, Integer> letterScores = new Hashtable<>();
//now we can add values with "put"
letterScores.put("A",1)
letterScores.put("B",3)
letterScores.put("X",8)
//etc
To access an element from the hash table we can use the "get"-method.
//returns 1
letterScores.get("A")
So when looping through our word we would essentially get something like this to calculate the value of the word:
int sumValue = 0;
for(int i =0; i < word.length(); i++)}
sumValue += letterScores.get(word.charAt(i))
}
For each character we grab the value entry from the letterScores hash table where we have saved all our letter's corresponding values.

How to Randomly Assign to Groups of Different Sizes

Say I have a dataset and I want to assign observations to different groups, the size of groups determined by the data. For example, suppose that this is the data:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
*Create global with regions
global region NE NCntrl South West
*Count the number in each region
bys reg (pop): gen reg_N=_N
tab reg
There are four reg groups, all of different sizes. Now, I want to randomly assign observations to the four groups. This is accomplished below by generating a random number and then assigning observations to one of the groups based on the random number.
*Generate random number
set seed 1
gen random = runiform()
sort random
*Assign observations to number based on random sorting
egen reg_rand = seq(), from(1) to (4)
*Map number to region
gen reg_new = ""
global count 1
foreach i in $region {
replace reg_new = "`i'" if reg_rand==$count
global count = $count + 1
}
bys reg_new: gen reg_new_N = _N
tab reg_new
This is not what I want, though. Instead of using the seq() command, which creates groups of equal sizes (assuming N divided by number of groups is a whole number), I would like to randomly assign based on the size of the original groups. In this case, that is equivalent to reg_N. For example, there would be 12 observations that have a reg_new value of NCntrl.
I might have one solution similar to https://stats.idre.ucla.edu/stata/faq/how-can-i-randomly-assign-observations-to-groups-in-stata/. The idea would be to save the results of tab reg into a macro or matrix, and then use a loop and replace to cycle through the observations, which are sorted by a random number. Assume that there are many, many more groups than the four in this toy example. Is there a more reasonable way to accomplish this?
It looks like you want to shuffle around the values stored in a group variable across observations. You can do this by reducing the data to the group variable, sorting on a variable that contains random values and then using an unmatched merge to associate the random group identifiers to the original observations.
Assuming that the data example is stored in a file called "data_example.dta" and is currently loaded into memory, this would look like:
set seed 234
keep reg
rename reg reg_new
gen double u = runiform()
sort u reg_new
merge 1:1 _n using "data_example.dta", nogen
tab reg reg_new

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Hashing a long integer ID into a smaller string

Here is the problem, where I need to transform an ID (defined as a long integer) to a smaller alfanumeric identifier. The details are the following:
Each individual on the problem as an unique ID, a long integer of size 13 (something like 123123412341234).
I need to generate a smaller representation of this unique ID, a alfanumeric string, something like A1CB3X. The problem is that 5 or 6 character length will not be enough to represent such a large integer.
The new ID (eg A1CB3X) should be valid in a context where we know that only a small number of individuals are present (less than 500). The new ID should be unique within that small set of individuals.
The new ID (eg A1CB3X) should be the result of a calculation made over the original ID. This means that taking the original ID elsewhere and applying the same calculation, we should get the same new ID (eg A1CB3X).
This calculation should occur when the individual is added to the set, meaning that not all individuals belonging to that set will be know at that time.
Any directions on how to solve such a problem?
Assuming that you don't need a formula that goes in both directions (which is impossible if you are reducing a 13-digit number to a 5 or 6-character alphanum string):
If you can have up to 6 alphanumeric characters that gives you 366 = 2,176,782,336 possibilities, assuming only numbers and uppercase letters.
To map your larger 13-digit number onto this space, you can take a modulo of some prime number slightly smaller than that, for example 2,176,782,317, the encode it with base-36 encoding.
alphanum_id = base36encode(longnumber_id % 2176782317)
For a set of 500, this gives you a
2176782317P500 / 2176782317500 chance of a collision
(P is permutation)
Best option is to change the base to 62 using case sensitive characters
If you want it to be shorter, you can add unicode characters. See below.
Here is javascript code for you: https://jsfiddle.net/vewmdt85/1/
function compress(n) {
var symbols = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð'.split('');
var d = n;
var compressed = '';
while (d >= 1) {
compressed = symbols[(d - (symbols.length * Math.floor(d / symbols.length)))] + compressed;
d = Math.floor(d / symbols.length);
}
return compressed;
}
$('input').keyup(function() {
$('span').html(compress($(this).val()))
})
$('span').html(compress($('input').val()))
How about using some base-X conversion, for example 123123412341234 becomes 17N644R7CI in base-36 and 9999999999999 becomes 3JLXPT2PR?
If you need a mapping that works both directions, you can simply go for a larger base.
Meaning: using base 16, you can reduce 1 to 16 to a single character.
So, base36 is the "maximum" that allows for shorter strings (when 1-1 mapping is required)!

Merge sort for GPU

Im trying to implement a merge sort using opencl wrapper.
The problem is, each pass needs a different indexing algorithm for threads' memory access.
Some info about this:
First pass(numbers indicate elements and arrows indicate sorting)
0<-->1 2<--->3 4<--->5 6<--->7
group0 group1 group2 group3 ===>1 thread per group, N/2 groups total
Second pass(all parallel)
0<------>2 4<------>6
1<------->3 5<------->7
group0 group1 ========> 2 threads per group, N/4 groups
Next pass
0<--------------->4 8<------------------>12
1<--------------->5 9<------------------->13
2<--------------->6 10<---------------->14
3<--------------->7 11<--------------->15
group0 group1 ===>4 threads per group
but N/8 groups
So, an element of a sub-group cannot make any comparison between another group's element.
I cannot simply do
A[i]<---->A[i+1] or A[i]<---->A[i+4]
because these cover
A[1]<---->A[2] and A[4]<----->A[8]
which are wrong.
I needed a more complex indexing algorithm which has potential to use same number of threads for all passes.
Pass n: global id(i): 0,1, 2,3 4,5 , ... to compare id 0,1 4,5 8,9
looks like compare id1=(i/2)*4+i%2
Pass n+1: global id(i): 0,1,2,3, 4,5,6,7, ... to compare id 0,1,2,3, 8,9,10,11, 16,17
looks like compare id=(i/4)*8+i%4
Pass n+2: global id(i): 0,1,2,3,... 8,9,10,... to compare id 0,1,2,3,... 16,17,18,...
looks like compare id=(i/8)*16+i%8
compare id1 = ( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) )
compare id2 = compare id1 + pow(2,passN)
so, in the kernel string, can it be
int i=get_global_id(0);
int compareId1=( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) );
int compareId2=compareId1+pow(2,passN);
if(compareId1!=compareId2)
{
if(A[compareId1]>A[compareId2])
{
xorSwapIdiom(A,compareId1,compareId2,B);
}
else
{
streamThrough(A,compareId1,compareId2,B);
}
}
else
{
// this can happen only for the first pass
// needs a different kernel structure for that
}
but Im not sure.
Question: Can you give any directions about which memory access pattern would not leak while satisfying the "no compare between different groups" condition?
Needed to hard-reset my computer many times already, trying different algortihms(memory leaked, black screen, crash, restart), this one is the last and I fear it can crash entire OS.
I already tried a much simpler version with a decreasing number of threads per pass, had some bad performance.
Edit: I tried the upper code, it sorts reversely ordered arrays. But not randomized arrays.

Resources