Find most unique words, penalizing words in common - algorithm

suppose I have n classes like:
A: this,is,a,test,of,the,salmon,system
B: i,like,to,test,the,flounder,system
C: to,test,a,salmon,is,like,to,test,the,iodine,system
I want to get the most unique words for each class, so something with a ranking that gives me
A: salmon
B: flounder
C: iodine, salmon
(as their first elements ; it can be a ranking of all words)
How do I do this? There will be hundreds of input classes each with tens of thousands of tokens.
I'm guessing this is essentially the sort of thing any search engine back-end does, but I'd like a fairly simple standalone thing.

Using a language like Python, you can write this efficiently in 8 lines. For hundreds of groups, each with tens of thousands of tokens, the running time sounds like it will take at most a few minutes (although I haven't tried this on actual input).
Create a hash-based dictionary mapping each word to the number of its occurrences.
Iterate over all groups, and all words in a group, and update this dictionary.
For each group,
a. If you need a total ranking, sort with the value in the dictionary as the critera
b. If you need the top k, use an order statistics type of algorithm again using the value in the dictionary as the criteria
Steps 1 + 2 should have expected linear complexity in the total number of words.
Step 3 is n log(n) per group for total ranking, and linear in the total number of words otherwise.
Here is the Python code for the top k. Assume all_groups is a list of lists of strings, and that k = 10.
from collections import Counter
import heapq
import operator
c = Counter()
for g in all_groups:
c.update(g)
for g in all_groups:
print heapq.nsmallest(k, [(w, c[w]) for w in g], key=operator.itemgetter(1))

What I understand from your question, I come to this solution as the least used words per class comparing with all the other classes.
var a = "this,is,a,test,of,the,salmon,system".split(","),
b = "i,like,to,test,the,flounder,system".split(","),
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(","),
map = {},
min,
key,
parse = function(stringArr) {
var length = stringArr.length,
i,count;
for (i = 0; i< length; i++) {
if (count = map[stringArr[i]]) {
map[stringArr[i]] = count + 1;
}
else {
map[stringArr[i]] = 1;
}
}
},
get = function(stringArr) {
min = Infinity;
stringArr.forEach((item)=>{
if (map[item] < min) {
min = map[item];
key = item
}
});
console.log(key);
};
parse(a);
parse(b);
parse(c);
get(a);
get(b);
get(c);

Ignore the classes, go through all the words and make a frequency table.
Then, for each class select the word with the lowest frequency.
Example in Python (slightly unpythonic solution to maintain readability for non-Python users):
a = "this,is,a,test,of,the,salmon,system".split(",")
b = "i,like,to,test,the,flounder,system".split(",")
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(",")
freq = {}
for word in a + b + c:
freq[word] = (freq[word] if word in freq else 0) + 1
print("a: ", min(a, key=lambda w: freq[w]))
print("b: ", min(b, key=lambda w: freq[w]))
print("c: ", min(c, key=lambda w: freq[w]))

Related

A simple Increasing Mathematical Algorithm

I actually tried to search this, I'm sure this basic algorithm is everywhere on internet, CS textbooks etc, but I cannot find the right words to search it.
What I want from this algorithm to do is write "A" and "B" with the limit always increasing by 2. Like I want it to write A 3 times, then B 5 times, then A 7 times, then B 9 times and so on. And I plan to have 100 elements in total.
Like: AAABBBBBAAAAAAABBBBBBBBB...
I only want to use a single "for loop" for the entire 100 elements starting from 1 to 100. And just direct/sort "A" and "B" through "if/else if/ else".
I'm just asking for the basic mathematical algorithm behind it, showing it through any programming language would be better or redirecting me to such topic would also be fine.
You can do something like this:
There might be shorter answers, but I find this one easy to understand.
Basically, you keep a bool variable that will tell you if it's A's turn or Bs. Then we keep a variable switch that will tell us when we should switch between them. times is being updated with the repeated times we need to print the next character.
A_B = true
times = 3 // 3,5,7,9,...
switch = 3 // 3,8,15,24,...
for (i from 1 to 100)
if (A_B)
print 'A'
else
print 'B'
if (i == switch)
times += 2
switch += times
A_B = !A_B
Python:
for n in range(1, 101):
print "BA"[(int(sqrt(n)) % 2)],
The parity of the square roots of the integers follows that pattern. (Think that (n+1)²-n² = 2n+1.)
If you prefer to avoid the square root, it suffices to use an extra variable that represents the integer square root and keep it updated
r= 1
for n in range(1, 101):
if r * r <= n:
r+= 1
print "AB"[r % 2],
Here is the snippet you can test on this page. It is an example for about 500 letters totally, sure you can modify it for 100 letters. It is quite flexible that you can change the constants to produce lot of different strings in the same manner.
var toRepeat = ['A', 'B'];
var result='', j, i=3;
var sum=i;
var counter = 0;
while (sum < 500) {
j = counter % 2;
result = result + toRepeat[j].repeat(i);
sum = sum + i;
i = i + 2;
counter++;
}
document.getElementById('hLetters').innerHTML=result;
console.log(result);
<div id="hLetters"></div>
If you want it to be exactly 500 / 100 letters, just use a substring function to trim off the extra letters from the end.
To get 100 groups of A and B with increasing length of 3, 5, 7 and so on, you can run this Python code:
''.join(('B' if i % 2 else 'A') * (2 * i + 3) for i in range(100))
The output is a string of 10200 characters.
If you want the output to have only 100 characters, you can use:
import math
''.join(('B' if math.ceil(math.sqrt(i)) % 2 else 'A') for i in range(2, 102))
In js you can start with somethink like this :
$res ="";
count2 = 0;
for (i=2;i<100; i = i+2) {
count = 0;
alert(i);
while (count < i ) {
$res = $res.concat(String.fromCharCode(65+count2));
count++;
}
count2++;
}
alert ($res);

Given a number, produce another random number that is the same every time and distinct from all other results

Basically, I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.
I know it's easy to do by creating an ordered list of numbers, shuffling them randomly, and then returning the input's index. But I want to know if it can be done without any caching at all. Perhaps with some kind of hashing algorithm? Mostly the reason for this is that if the range of possible outputs were much larger, say 10000000000, then it would be ludicrous to generate an entire range of numbers and then shuffle them randomly, if you were only going to get a few results out of it.
Doesn't matter what language it's done in, I just want to know if it's possible. I've been thinking about this problem for a long time and I can't think of a solution besides the one I've already come up with.
Edit: I just had another idea; it would be interesting to have another algorithm that returned the reverse of the first one. Whether or not that's possible would be an interesting challenge to explore.
This sounds like a non-repeating random number generator. There are several possible approaches to this.
As described in this article, we can generate them by selecting a prime number p and satisfies p % 4 = 3 that is large enough (greater than the maximum value in the output range) and generate them this way:
int randomNumberUnique(int range_len , int p , int x)
if(x * 2 < p)
return (x * x) % p
else
return p - (x * x) % p
This algorithm will cover all values in [0 , p) for an input in range [0 , p).
Here's an example in C#:
private void DoIt()
{
const long m = 101;
const long x = 387420489; // must be coprime to m
var multInv = MultiplicativeInverse(x, m);
var nums = new HashSet<long>();
for (long i = 0; i < 100; ++i)
{
var encoded = i*x%m;
var decoded = encoded*multInv%m;
Console.WriteLine("{0} => {1} => {2}", i, encoded, decoded);
if (!nums.Add(encoded))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1%modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a/b;
var r = a%b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q*t);
}
That generates numbers in the range 0-100, from input in the range 0-100. Each input results in a unique output.
It also shows how to reverse the process, using the multiplicative inverse.
You can extend the range by increasing the value of m. x must be coprime with m.
Code cribbed from Eric Lippert's article, A practical use of multiplicative inverses, and a few of the previous articles in that series.
You can not have completely unrelated (particularly if you want the reverse as well).
There is a concept of modulo inverse of a number, but this would work only if the range number is a prime, eg. 100 will not work, you would need 101 (a prime). This can provide you a pseudo random number if you want.
Here is the concept of modulo inverse:
If there are two numbers a and b, such that
(a * b) % p = 1
where p is any number, then
a and b are modular inverses of each other.
For this to be true, if we have to find the modular inverse of a wrt a number p, then a and p must be co-prime, ie. gcd(a,p) = 1
So, for all numbers in a range to have modular inverses, the range bound must be a prime number.
A few outputs for range bound 101 will be:
1 == 1
2 == 51
3 == 34
4 == 76
etc.
EDIT:
Hey...actually you know, you can use the combined approach of modulo inverse and the method as defined by #Paul. Since every pair will be unique and all numbers will be covered, your random number can be:
random(k) = randomUniqueNumber(ModuloInverse(k), p) //this is Paul's function

Finding the index of a given permutation

I'm reading the numbers 0, 1, ..., (N - 1) one by one in some order. My goal is to find the lexicography index of this given permutation, using only O(1) space.
This question was asked before, but all the algorithms I could find used O(N) space. I'm starting to think that it's not possible. But it would really help me a lot with reducing the number of allocations.
Considering the following data:
chars = [a, b, c, d]
perm = [c, d, a, b]
ids = get_indexes(perm, chars) = [2, 3, 0, 1]
A possible solution for permutation with repetitions goes as follows:
len = length(perm) (len = 4)
num_chars = length(chars) (len = 4)
base = num_chars ^ len (base = 4 ^ 4 = 256)
base = base / len (base = 256 / 4 = 64)
id = base * ids[0] (id = 64 * 2 = 128)
base = base / len (base = 64 / 4 = 16)
id = id + (base * ids[1]) (id = 128 + (16 * 3) = 176)
base = base / len (base = 16 / 4 = 4)
id = id + (base * ids[2]) (id = 176 + (4 * 0) = 176)
base = base / len (base = 4 / 4 = 1)
id = id + (base * ids[3]) (id = 176 + (1 * 1) = 177)
Reverse process:
id = 177
(id / (4 ^ 3)) % 4 = (177 / 64) % 4 = 2 % 4 = 2 -> chars[2] -> c
(id / (4 ^ 2)) % 4 = (177 / 16) % 4 = 11 % 4 = 3 -> chars[3] -> d
(id / (4 ^ 1)) % 4 = (177 / 4) % 4 = 44 % 4 = 0 -> chars[0] -> a
(id / (4 ^ 0)) % 4 = (177 / 1) % 4 = 177 % 4 = 1 -> chars[1] -> b
The number of possible permutations is given by num_chars ^ num_perm_digits, having num_chars as the number of possible characters, and num_perm_digits as the number of digits in a permutation.
This requires O(1) in space, considering the initial list as a constant cost; and it requires O(N) in time, considering N as the number of digits your permutation will have.
Based on the steps above, you can do:
function identify_permutation(perm, chars) {
for (i = 0; i < length(perm); i++) {
ids[i] = get_index(perm[i], chars);
}
len = length(perm);
num_chars = length(chars);
index = 0;
base = num_chars ^ len - 1;
base = base / len;
for (i = 0; i < length(perm); i++) {
index += base * ids[i];
base = base / len;
}
}
It's a pseudocode, but it's also quite easy to convert to any language (:
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 10; // Total number of elements in the set.
int K = 5; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to the language of your choice. You probably will not have to port over the generic part of the class to accomplish your goals. Depending on the number of combinations you are working with, you might need to use a bigger word size than 4 byte ints.
There is a java solution to this problem on geekviewpoint. It has a good explanation for why it's true and the code is easy to follow. http://www.geekviewpoint.com/java/numbers/permutation_index. It also has a unit test that runs the code with different inputs.
There are N! permutations. To represent index you need at least N bits.
Here is a way to do it if you want to assume that arithmetic operations are constant time:
def permutationIndex(numbers):
n=len(numbers)
result=0
j=0
while j<n:
# Determine factor, which is the number of possible permutations of
# the remaining digits.
i=1
factor=1
while i<n-j:
factor*=i
i+=1
i=0
# Determine index, which is how many previous digits there were at
# the current position.
index=numbers[j]
while i<j:
# Only the digits that weren't used so far are valid choices, so
# the index gets reduced if the number at the current position
# is greater than one of the previous digits.
if numbers[i]<numbers[j]:
index-=1
i+=1
# Update the result.
result+=index*factor
j+=1
return result
I've purposefully written out certain calculations that could be done more simply using some Python built-in operations, but I wanted to make it more obvious that no extra non-constant amount of space was being used.
As maxim1000 noted, the number of bits required to represent the result will grow quickly as n increases, so eventually big integers will be required, which no longer have constant-time arithmetic, but I think this code addresses the spirit of your question.
Nothing really new in the idea but a fully matricial method with no explicit loop or recursion (using Numpy but easy to adapt):
import numpy as np
import math
vfact = np.vectorize(math.factorial, otypes='O')
def perm_index(p):
return np.dot( vfact(range(len(p)-1, -1, -1)),
p-np.sum(np.triu(p>np.vstack(p)), axis=0) )
I just wrote a code using Visual Basic and my program can directly calculate every index or every corresponding permutation to a given index up to 17 elements (this limit is due to the approximation of the scientific notation of numbers over 17! of my compiler).
If you are interested I can I can send the program or publish it somewhere for download.
It works fine and It can be useful for testing and paragon the output of your codes.
I used the method of James D. McCaffrey called factoradic and you can read about it here and something also here (in the discussion at the end of the page).

Finding the closest number in a random set

Say I got a set of 10 random numbers between 0 and 100.
An operator gives me also a random number between 0 and 100.
Then I got to find the number in the set that is the closest from the number the operator gave me.
example
set = {1,10,34,39,69,89,94,96,98,100}
operator number = 45
return = 39
And how do translate this into code? (javascript or something)
if set is ordered, do a binary search to find the value, (or the 2 values) that are closest. Then distinguish which of 2 is closest by ... subtracting?
If set is not ordered, just iterate through the set, (Sorting it would itself take more than one pass), and for each member, check to see if the difference is smaller than the smallest difference you have seen so far, and if it is, record it as the new smallest difference, and that number as the new candidate answer. .
public int FindClosest(int targetVal, int[] set)
{
int dif = 100, cand = 0;
foreach(int x in set)
if (Math.Abs(x-targetVal) < dif)
{
dif = Math.Abs(x-targetVal);
cand = x;
}
return cand;
}
given an array called input, create another array of the same size
each element of this new array is the Math.abs(input[i] - operatorNumber)
select the index of the mininum element (let's call it k)
your answer is input[k]
NB
sorting is not needed
you can do it without the extra array
Sample implementation in JavaScript
function closestTo(number, set) {
var closest = set[0];
var prev = Math.abs(set[0] - number);
for (var i = 1; i < set.length; i++) {
var diff = Math.abs(set[i] - number);
if (diff < prev) {
prev = diff;
closest = set[i];
}
}
return closest;
}
How about this:
1) Put the set into a binary tree.
2) Insert the operator number into the tree
3) Return the Operators parent
order the set
binary search for the input
if you end up between two elements, check the difference, and return the one with the smallest difference.
Someone tagged this question Mathematica, so here's a Mathematica answer:
set = {1,10,34,39,69,89,94,96,98,100};
opno = 45;
set[[Flatten[
Position[set - opno, i_ /; Abs[i] == Min[Abs[set - opno]]]]]]
It works when there are multiple elements of set equally distant from the operator number.
python example:
#!/usr/bin/env python
import random
from operator import itemgetter
sample = random.sample(range(100), 10)
pivot = random.randint(0, 100)
print 'sample: ', sample
print 'pivot:', pivot
print 'closest:', sample[
sorted(
map(lambda i, e: (i, abs(e - pivot)), range(10), sample),
key=itemgetter(1)
)[1][0]]
# sample: [61, 2, 3, 85, 15, 18, 19, 8, 66, 4]
# pivot: 51
# closest: 66

Efficient algorithm to randomly select items with frequency

Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size m, and populate it so the first f0 entries are w0, the next f1 entries are w1, and so on, so the last fp-1 entries are wp-1.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computingmi = ∑h≤ifh = mi-1 + fi
and after computing mi, use the pRNG to generate a number xk in the range 0...mi-1 for each k in 0...p-1
and select wi for wjk (possibly replacing the current value of wjk) if xk < fi.
This requires O(n + np) work.
Compute mi as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in 0...p-1, use the pRNG to generate a number xk in the range 0...m-1 then do binary search on the array of triples to find the i s.t. mi-fi ≤ xk < mi, and select wi for wjk.
This requires O(n + p log n) work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#:
public class WordFrequency {
public string Word { get; private set; }
public int Frequency { get; private set; }
public WordFrequency(string word, int frequency) {
Word = word;
Frequency = frequency;
}
}
WordFrequency[] words = new WordFrequency[] {
new WordFrequency("Hero", 80),
new WordFrequency("Monkey", 4),
new WordFrequency("Shoe", 13),
new WordFrequency("Highway", 3),
};
int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
sum += wf.Frequency;
for (int i = 0; i < p; i++) {
if (rnd.Next(sum) < wf.Frequency) {
result[i] = wf.Word;
}
}
}
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.
The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.
Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
To construct this partition in O(n) time,
go through the list of the words once, constructing two lists:
one of words with frequency ≤ r
one of words with frequency > r
then pull a word from the first list
if its frequency = r, then make it into a one element partition
otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.
This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.
In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.
In ruby:
def weighted_sample_with_replacement(input, p)
n = input.size
m = input.inject(0) { |sum,(word,freq)| sum + freq }
# find the words with frequency lesser and greater than average
lessers, greaters = input.map do |word,freq|
# pad the frequency so we can keep it integral
# when subdivided
[ word, freq*n ]
end.partition do |word,adj_freq|
adj_freq <= m
end
partitions = Array.new(n) do
word, adj_freq = lessers.shift
other_word = if adj_freq < m
# use part of another word's frequency to pad
# out the partition
other_word, other_adj_freq = greaters.shift
other_adj_freq -= (m - adj_freq)
(other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
other_word
end
[ word, other_word , adj_freq ]
end
(0...p).map do
# pick a partition at random
word, other_word, adj_freq = partitions[ rand(n) ]
# select the first word in the partition with appropriate
# probability
if rand(m) < adj_freq
word
else
other_word
end
end
end

Resources