Related
I'm trying to solve DNA problem which is more of improved(?) version of LCS problem.
In the problem, there is string which is string and semi-substring which allows part of string to have one or no letter skipped. For example, for string "desktop", it has semi-substring {"destop", "dek", "stop", "skop","desk","top"}, all of which has one or no letter skipped.
Now, I am given two DNA strings consisting of {a,t,g,c}. I"m trying to find longest semi-substring, LSS. and if there is more than one LSS, print out the one in the fastest order.
For example, two dnas {attgcgtagcaatg, tctcaggtcgatagtgac} prints out "tctagcaatg"
and aaaattttcccc, cccgggggaatatca prints out "aattc"
I'm trying to use common LCS algorithm but cannot solve it with tables although I did solve the one with no letter skipped. Any advice?
This is a variation on the dynamic programming solution for LCS, written in Python.
First I'm building up a Suffix Tree for all the substrings that can be made from each string with the skip rule. Then I'm intersecting the suffix trees. Then I'm looking for the longest string that can be made from that intersection tree.
Please note that this is technically O(n^2). Its worst case is when both strings are the same character, repeated over and over again. Because you wind up with a lot of what logically is something like, "an 'l' at position 42 in the one string could have matched against position l at position 54 in the other". But in practice it will be O(n).
def find_subtree (text, max_skip=1):
tree = {}
tree_at_position = {}
def subtree_from_position (position):
if position not in tree_at_position:
this_tree = {}
if position < len(text):
char = text[position]
# Make sure that we've populated the further tree.
subtree_from_position(position + 1)
# If this char appeared later, include those possible matches.
if char in tree:
for char2, subtree in tree[char].iteritems():
this_tree[char2] = subtree
# And now update the new choices.
for skip in range(max_skip + 1, 0, -1):
if position + skip < len(text):
this_tree[text[position + skip]] = subtree_from_position(position + skip)
tree[char] = this_tree
tree_at_position[position] = this_tree
return tree_at_position[position]
subtree_from_position(0)
return tree
def find_longest_common_semistring (text1, text2):
tree1 = find_subtree(text1)
tree2 = find_subtree(text2)
answered = {}
def find_intersection (subtree1, subtree2):
unique = (id(subtree1), id(subtree2))
if unique not in answered:
answer = {}
for k, v in subtree1.iteritems():
if k in subtree2:
answer[k] = find_intersection(v, subtree2[k])
answered[unique] = answer
return answered[unique]
found_longest = {}
def find_longest (tree):
if id(tree) not in found_longest:
best_candidate = ''
for char, subtree in tree.iteritems():
candidate = char + find_longest(subtree)
if len(best_candidate) < len(candidate):
best_candidate = candidate
found_longest[id(tree)] = best_candidate
return found_longest[id(tree)]
intersection_tree = find_intersection(tree1, tree2)
return find_longest(intersection_tree)
print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))
Let g(c, rs, rt) represent the longest common semi-substring of strings, S and T, ending at rs and rt, where rs and rt are the ranked occurences of the character, c, in S and T, respectively, and K is the number of skips allowed. Then we can form a recursion which we would be obliged to perform on all pairs of c in S and T.
JavaScript code:
function f(S, T, K){
// mapS maps a char to indexes of its occurrences in S
// rsS maps the index in S to that char's rank (index) in mapS
const [mapS, rsS] = mapString(S)
const [mapT, rsT] = mapString(T)
// h is used to memoize g
const h = {}
function g(c, rs, rt){
if (rs < 0 || rt < 0)
return 0
if (h.hasOwnProperty([c, rs, rt]))
return h[[c, rs, rt]]
// (We are guaranteed to be on
// a match in this state.)
let best = [1, c]
let idxS = mapS[c][rs]
let idxT = mapT[c][rt]
if (idxS == 0 || idxT == 0)
return best
for (let i=idxS-1; i>=Math.max(0, idxS - 1 - K); i--){
for (let j=idxT-1; j>=Math.max(0, idxT - 1 - K); j--){
if (S[i] == T[j]){
const [len, str] = g(S[i], rsS[i], rsT[j])
if (len + 1 >= best[0])
best = [len + 1, str + c]
}
}
}
return h[[c, rs, rt]] = best
}
let best = [0, '']
for (let c of Object.keys(mapS)){
for (let i=0; i<(mapS[c]||[]).length; i++){
for (let j=0; j<(mapT[c]||[]).length; j++){
let [len, str] = g(c, i, j)
if (len > best[0])
best = [len, str]
}
}
}
return best
}
function mapString(s){
let map = {}
let rs = []
for (let i=0; i<s.length; i++){
if (!map[s[i]]){
map[s[i]] = [i]
rs.push(0)
} else {
map[s[i]].push(i)
rs.push(map[s[i]].length - 1)
}
}
return [map, rs]
}
console.log(f('attgcgtagcaatg', 'tctcaggtcgatagtgac', 1))
console.log(f('aaaattttcccc', 'cccgggggaatatca', 1))
console.log(f('abcade', 'axe', 1))
I am trying to load in a .obj-File and draw it with the help of glDrawElements.
Now, with glDrawArrays everything works perfectly, but it is - of course - inefficient.
The problem I have right now is, that an .obj-file uses multiple index-buffers (for each attribute) while OpenGL may only use one. So I need to map them accordingly.
There are a lot of pseudo-algorithms out there and I even found a C++ implementation. I do know quite a bit of C++ but strangely neither helped me with my implementation in Scala.
Let's see:
private def parseObj(path: String): Model =
{
val objSource: List[String] = Source.fromFile(path).getLines.toList
val positions: List[Vector3] = objSource.filter(_.startsWith("v ")).map(_.split(" ")).map(v => new Vector3(v(1).toFloat,v(2).toFloat,v(3).toFloat))//, 1.0f))
val normals: List[Vector4] = objSource.filter(_.startsWith("vn ")).map(_.split(" ")).map(v => new Vector4(v(1)toFloat,v(2).toFloat, v(3).toFloat, 0.0f))
val textureCoordinates: List[Vector2] = objSource.filter(_.startsWith("vt ")).map(_.split(" ")).map(v => new Vector2(v(1).toFloat, 1-v(2).toFloat)) // TODO 1-y because of blender
val faces: List[(Int, Int, Int)] = objSource.filter(_.startsWith("f ")).map(_.split(" ")).flatten.filterNot(_ == "f").map(_.split("/")).map(a => ((a(0).toInt, a(1).toInt, a(2).toInt)))
val vertices: List[Vertex] = for(face <- faces) yield(new Vertex(positions(face._1-1), textureCoordinates(face._2-1)))
val f: List[(Vector3, Vector2, Vector4)] = for(face <- faces) yield((positions(face._1-1), textureCoordinates(face._2-1), normals(face._3-1)))
println(f.mkString("\n"))
val indices: List[Int] = faces.map(f => f._1-1) // Wrong!
new Model(vertices.toArray, indices.toArray)
}
The val indices: List[Int] was my first naive approach and of course is wrong. But let's start at the top:
I load in the file and go through it. (I assume you know how an .obj-file is made up)
I read in the vertices, texture-coordinates and normals. Then I come to the faces.
Now, each face in my example has 3 values v_x, t_y, n_z defining the vertexAtIndexX, textureCoordAtIndexY, normalAtIndexZ. So each of these define one Vertex while a triple of these (or one line in the file) defines a Face/Polygon/Triangle.
in val vertices: List[Vertex] = for(face <- faces) yield(new Vertex(positions(face._1-1), textureCoordinates(face._2-1))) I actually try to create Vertices (a case-class that currently only holds positions and texture-coordinates and neglects normals for now)
The real problem is this line:
val indices: List[Int] = faces.map(f => f._1-1) // Wrong!
To get the real indices I basically need to do this instead of
val vertices: List[Vertex] = for(face <- faces) yield(new Vertex(positions(face._1-1), textureCoordinates(face._2-1)))
and
val indices: List[Int] = faces.map(f => f._1-1) // Wrong!
Pseudo-Code:
Iterate over all faces
Iterate over all vertices in a face
Check if we already have that combination of(position, texturecoordinate, normal) in our newVertices
if(true)
indices.put(indexOfCurrentVertex)
else
create a new Vertex from the face
store the new vertex in the vertex list
indices.put(indexOfNewVertex)
Yet I'm totally stuck. I've tried different things, but can't come up with a nice and clean solution that actually works.
Things like:
val f: List[(Vector3, Vector2, Vector4)] = for(face <- faces) yield((positions(face._1-1), textureCoordinates(face._2-1), normals(face._3-1)))
and trying to f.distinct are not working, because there is nothing to distinct, all the entries there are unique, which totally makes sense if I look at the file and yet that's what the pseudo-code tells me to check.
Of course then I would need to fill the indices accordingly (preferably in a one-liner and with a lot of functional beauty)
But I should try to find duplicates, so... I'm kind of baffled. I guess I mix up the different "vertices" and "positions" too much, with all the referencing.
So, am I thinking wrong, or is the algorithm/thinking right and I just need to implement this in nice, clean (and actually working) Scala code?
Please, enlighten me!
As per comments, I made a little update:
var index: Int = 0
val map: mutable.HashMap[(Int, Int, Int), Int] = new mutable.HashMap[(Int, Int, Int), Int].empty
val combinedIndices: ListBuffer[Int] = new ListBuffer[Int]
for(face <- faces)
{
val vID: Int = face._1-1
val nID: Int = face._2-1
val tID: Int = face._3-1
var combinedIndex: Int = -1
if(map.contains((vID, nID, tID)))
{
println("We have a duplicate, wow!")
combinedIndex = map.get((vID, nID, tID)).get
}
else
{
combinedIndex = index
map.put((vID, nID, tID), combinedIndex)
index += 1
}
combinedIndices += combinedIndex
}
where faces still is:
val faces: List[(Int, Int, Int)] = objSource.filter(_.startsWith("f ")).map(_.split(" ")).flatten.filterNot(_ == "f").map(_.split("/")).map(a => ((a(0).toInt, a(1).toInt, a(2).toInt)))
Fun fact I'm still not understanding it obviously, because that way I never ever get a duplicate!
Meaning that combinedIndices at the end just holds the natural numbers like:
ListBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...)
This is javascript (sorry not scala) but it's commented and shoul be fairly easy to convert.
// bow-tie
var objString = "v 0 0 0\nv 1 1 0\nv 1 -1 0\nv -1 1 0\nv -1 -1 0\n" +
"vt 0 .5\nvt 1 1\nvt 1 0\n" +
"vn 0 0 1\n" +
"f 1/1/1 2/2/1 3/3/1\nf 1/1/1 4/2/1 5/3/1";
// output indices should be [0, 1, 2, 0, 3, 4]
// parse the file
var lines = objString.split("\n");
var data = lines.map(function(line) { return line.split(" "); });
var v = [];
var t = [];
var n = [];
var f = [];
var indexMap = new Map(); // HashMap<face:string, index:integer>
var nextIndex = 0;
var vertices = [];
var indices = [];
// fill vertex, texture and normal arrays
data.filter(function(d) { return d[0] == "v"; }).forEach(function(d) { v.push([parseFloat(d[1]), parseFloat(d[2]), parseFloat(d[3])]); });
data.filter(function(d) { return d[0] == "vt"; }).forEach(function(d) { t.push([parseFloat(d[1]), parseFloat(d[2])]); });
data.filter(function(d) { return d[0] == "vn"; }).forEach(function(d) { n.push([parseFloat(d[1]), parseFloat(d[2]), parseFloat(d[3])]); });
//
console.log("V", v.toString());
console.log("T", t.toString());
console.log("N", n.toString());
// create vertices and indices arrays by parsing faces
data.filter(function(d) { return d[0] == "f"; }).forEach(function(d) {
var f1 = d[1].split("/").map(function(d) { return parseInt(d)-1; });
var f2 = d[2].split("/").map(function(d) { return parseInt(d)-1; });
var f3 = d[3].split("/").map(function(d) { return parseInt(d)-1; });
// 1
if(indexMap.has(d[1].toString())) {
indices.push(indexMap.get(d[1].toString()));
} else {
vertices = vertices.concat(v[f1[0]]).concat(t[f1[1]]).concat(n[f1[2]]);
indexMap.set(d[1].toString(), nextIndex);
indices.push(nextIndex++);
}
// 2
if(indexMap.has(d[2].toString())) {
indices.push(indexMap.get(d[2].toString()));
} else {
vertices = vertices.concat(v[f2[0]]).concat(t[f2[1]]).concat(n[f2[2]]);
indexMap.set(d[2].toString(), nextIndex);
indices.push(nextIndex++);
}
// 3
if(indexMap.has(d[3].toString())) {
indices.push(indexMap.get(d[3].toString()));
} else {
vertices = vertices.concat(v[f3[0]]).concat(t[f3[1]]).concat(n[f3[2]]);
indexMap.set(d[3].toString(), nextIndex);
indices.push(nextIndex++);
}
});
//
console.log("Vertices", vertices.toString());
console.log("Indices", indices.toString());
The output
V 0,0,0,1,1,0,1,-1,0,-1,1,0,-1,-1,0
T 0,0.5,1,1,1,0
N 0,0,1
Vertices 0,0,0,0,0.5,0,0,1,1,1,0,1,1,0,0,1,1,-1,0,1,0,0,0,1,-1,1,0,1,1,0,0,1,-1,-1,0,1,0,0,0,1
Indices 0,1,2,0,3,4
The JSFiddle http://jsfiddle.net/8q7jLvsq/2
The only thing I'm doing ~differently is using the string hat represents one of the parts of a face as the key into my indexMap (ex: "25/32/5").
EDIT JSFiddle http://jsfiddle.net/8q7jLvsq/2/ this version combines repeated values for vertex, texture and normal. This optimizes OBJ files that repeat the same values values making every face unique.
// bow-tie
var objString = "v 0 0 0\nv 1 1 0\nv 1 -1 0\nv 0 0 0\nv -1 1 0\nv -1 -1 0\n" +
"vt 0 .5\nvt 1 1\nvt 1 0\nvt 0 .5\nvt 1 1\nvt 1 0\n" +
"vn 0 0 1\nvn 0 0 1\nvn 0 0 1\nvn 0 0 1\nvn 0 0 1\nvn 0 0 1\n" +
"f 1/1/1 2/2/2 3/3/3\nf 4/4/4 5/5/5 6/6/6";
// output indices should be [0, 1, 2, 0, 3, 4]
// parse the file
var lines = objString.split("\n");
var data = lines.map(function(line) { return line.split(" "); });
var v = [];
var t = [];
var n = [];
var f = [];
var vIndexMap = new Map(); // map to earliest index in the list
var vtIndexMap = new Map();
var vnIndexMap = new Map();
var indexMap = new Map(); // HashMap<face:string, index:integer>
var nextIndex = 0;
var vertices = [];
var indices = [];
// fill vertex, texture and normal arrays
data.filter(function(d) { return d[0] == "v"; }).forEach(function(d, i) {
v[i] = [parseFloat(d[1]), parseFloat(d[2]), parseFloat(d[3])];
var key = [d[1], d[2], d[3]].toString();
if(!vIndexMap.has(key)) {
vIndexMap.set(key, i);
}
});
data.filter(function(d) { return d[0] == "vt"; }).forEach(function(d, i) {
t[i] = [parseFloat(d[1]), parseFloat(d[2])];
var key = [d[1], d[2]].toString();
if(!vtIndexMap.has(key)) {
vtIndexMap.set(key, i);
}
});
data.filter(function(d) { return d[0] == "vn"; }).forEach(function(d, i) {
n[i] = [parseFloat(d[1]), parseFloat(d[2]), parseFloat(d[3])];
var key = [d[1], d[2], d[3]].toString();
if(!vnIndexMap.has(key)) {
vnIndexMap.set(key, i);
}
});
//
console.log("V", v.toString());
console.log("T", t.toString());
console.log("N", n.toString());
// create vertices and indices arrays by parsing faces
data.filter(function(d) { return d[0] == "f"; }).forEach(function(d) {
var f1 = d[1].split("/").map(function(d, i) {
var index = parseInt(d)-1;
if(i == 0) index = vIndexMap.get(v[index].toString());
else if(i == 1) index = vtIndexMap.get(t[index].toString());
else if(i == 2) index = vnIndexMap.get(n[index].toString());
return index;
});
var f2 = d[2].split("/").map(function(d, i) {
var index = parseInt(d)-1;
if(i == 0) index = vIndexMap.get(v[index].toString());
else if(i == 1) index = vtIndexMap.get(t[index].toString());
else if(i == 2) index = vnIndexMap.get(n[index].toString());
return index;
});
var f3 = d[3].split("/").map(function(d, i) {
var index = parseInt(d)-1;
if(i == 0) index = vIndexMap.get(v[index].toString());
else if(i == 1) index = vtIndexMap.get(t[index].toString());
else if(i == 2) index = vnIndexMap.get(n[index].toString());
return index;
});
// 1
if(indexMap.has(f1.toString())) {
indices.push(indexMap.get(f1.toString()));
} else {
vertices = vertices.concat(v[f1[0]]).concat(t[f1[1]]).concat(n[f1[2]]);
indexMap.set(f1.toString(), nextIndex);
indices.push(nextIndex++);
}
// 2
if(indexMap.has(f2.toString())) {
indices.push(indexMap.get(f2.toString()));
} else {
vertices = vertices.concat(v[f2[0]]).concat(t[f2[1]]).concat(n[f2[2]]);
indexMap.set(f2.toString(), nextIndex);
indices.push(nextIndex++);
}
// 3
if(indexMap.has(f3.toString())) {
indices.push(indexMap.get(f3.toString()));
} else {
vertices = vertices.concat(v[f3[0]]).concat(t[f3[1]]).concat(n[f3[2]]);
indexMap.set(f3.toString(), nextIndex);
indices.push(nextIndex++);
}
});
//
console.log("Vertices", vertices.toString());
console.log("Indices", indices.toString());
I have written hash calculate function:
var hash = function (string) {
var h = 7;
var i = 0;
var letters = "acdegilmnoprstuw";
while (i < string.length) {
h = (h * 37 + letters.indexOf( string[i++] ));
}
return h;
};
Where string = "agdpeew" and result is 664804774844. But now I don't know how I can decipher hash. So, If my input is 664804774844, answer will agdpeew.
What algorithm can I use for this?
Maybe I can start with the division 664804774844 / 37 but how I can get letter indexes?
For short strings, you can start by expressing the number in base 37 - but why are you trying to do this? Most of the use cases for hash functions don't require you to invert the function, and many hash functions are designed for it to be difficult or impossible to invert the function, except by evaluating on input after input until you find one that produces the hash value you are looking for.
Below is the code written in Swift language, it has both encrypt & decrypt of the hash value
var letters = "acdegilmnoprstuw";
Hashing / Encrypt
func hash(s:String) -> Int{
var h = 7 as Int;
for (var i = 0; i < s.characters.count; i++) {
// Getting the character at index
let s2: Character = s[s.startIndex.advancedBy(i)];
// Getting index of string 'acdegilmnoprstuw'
let l : Int = letters.startIndex.distanceTo(letters.characters.indexOf(s2)!);
h = (h * 37 + l);
}
return h;
}
Unhashing / Decrypt
func unhash(hashValue:Int) -> String{
var h = hashValue
var unhashedString : String = ""
while(h > 37){
unhashedString.append(letters[letters.startIndex.advancedBy(h % 37)])
h = h / 37
}
return String(unhashedString.characters.reverse())
}
When drawing graphs using SI codes is pretty much what we want. Our y-axis values tend to be large currency values. eg: $10,411,504,201.20
Abbreviating this, at least in a US locale, this should translate to $10.4B.
But using d3.format's 's' type for SI codes this would display as $10.4G. This might be great for some locales and good when dealing with computer-based values (eg: processor speed, memory...), but not so with currency or other non-computer types of values.
Is there a way to get locale-specific functionality similar to SI-codes that would convert billions to B instead of G, etc...?
(I realize this is mostly an SI-codes thing and not specific to D3, but since I'm using D3 this seems the most appropriate tag.)
I prefer overriding d3.formatPrefix. Then you can just forget about replacing strings within your viz code. Simply execute the following code immediately after loading D3.js.
// Change D3's SI prefix to more business friendly units
// K = thousands
// M = millions
// B = billions
// T = trillion
// P = quadrillion
// E = quintillion
// small decimals are handled with e-n formatting.
var d3_formatPrefixes = ["e-24","e-21","e-18","e-15","e-12","e-9","e-6","e-3","","K","M","B","T","P","E","Z","Y"].map(d3_formatPrefix);
// Override d3's formatPrefix function
d3.formatPrefix = function(value, precision) {
var i = 0;
if (value) {
if (value < 0) {
value *= -1;
}
if (precision) {
value = d3.round(value, d3_format_precision(value, precision));
}
i = 1 + Math.floor(1e-12 + Math.log(value) / Math.LN10);
i = Math.max(-24, Math.min(24, Math.floor((i - 1) / 3) * 3));
}
return d3_formatPrefixes[8 + i / 3];
};
function d3_formatPrefix(d, i) {
var k = Math.pow(10, Math.abs(8 - i) * 3);
return {
scale: i > 8 ? function(d) { return d / k; } : function(d) { return d * k; },
symbol: d
};
}
function d3_format_precision(x, p) {
return p - (x ? Math.ceil(Math.log(x) / Math.LN10) : 1);
}
After running this code, try formatting a number with SI prefix:
d3.format(".3s")(1234567890) // 1.23B
You could augment this code pretty simply to support different locales by including locale-specific d3_formatPrefixes values in an object and then select the proper one that matches a locale you need.
I like the answer by #nross83
Just going to paste a variation that I think might be more robust.
Example:
import { formatLocale, formatSpecifier } from "d3";
const baseLocale = {
decimal: ".",
thousands: ",",
grouping: [3],
currency: ["$", ""],
};
// You can define your own si prefix abbr. here
const d3SiPrefixMap = {
y: "e-24",
z: "e-21",
a: "e-18",
f: "e-15",
p: "e-12",
n: "e-9",
µ: "e-6",
m: "e-3",
"": "",
k: "K",
M: "M",
G: "B",
T: "T",
P: "P",
E: "E",
Z: "Z",
Y: "Y",
};
const d3Format = (specifier: string) => {
const locale = formatLocale({ ...baseLocale });
const formattedSpecifier = formatSpecifier(specifier);
const valueFormatter = locale.format(specifier);
return (value: number) => {
const result = valueFormatter(value);
if (formattedSpecifier.type === "s") {
// modify the return value when using si-prefix.
const lastChar = result[result.length - 1];
if (Object.keys(d3SiPrefixMap).includes(lastChar)) {
return result.slice(0, -1) + d3SiPrefixMap[lastChar];
}
}
// return the default result from d3 format in case the format type is not set to `s` (si suffix)
return result;
};
}
And use it like the following:
const value = 1000000000;
const formattedValue = d3Format("~s")(value);
console.log({formattedValue}); // Outputs: {formattedValue: "1B"}
We used the formatSpecifier function from d3-format to check if the format type is s, i.e. si suffix, and only modify the return value in this case.
In the example above, I have not modified the actual d3 function. You can change the code accordingly if you want to do that for the viz stuff.
I hope this answer is helpful. Thank you :)
Given a Map of objects and designated proportions (let's say they add up to 100 to make it easy):
val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)
How can I generate a sequence such that for a subset of size n there are ~42% "A"s, ~32% "B"s and ~26% "C"s? (Obviously, small n will have larger errors).
(Work language is Scala, but I'm just asking for the algorithm.)
UPDATE: I resisted a random approach since, for instance, there's ~16% chance that the sequence would start with AA and ~11% chance it would start with BB and there would be very low odds that for n precisely == (sum of proportions) the distribution would be perfect. So, following #MvG's answer, I implemented as follows:
/**
Returns the key whose achieved proportions are most below desired proportions
*/
def next[T](proportions : Map[T, Double], achievedToDate : Map[T,Double]) : T = {
val proportionsSum = proportions.values.sum
val desiredPercentages = proportions.mapValues(v => v / proportionsSum)
//Initially no achieved percentages, so avoid / 0
val toDateTotal = if(achievedToDate.values.sum == 0.0){
1
}else{
achievedToDate.values.sum
}
val achievedPercentages = achievedToDate.mapValues(v => v / toDateTotal)
val gaps = achievedPercentages.map{ case (k, v) =>
val gap = desiredPercentages(k) - v
(k -> gap)
}
val maxUnder = gaps.values.toList.sortWith(_ > _).head
//println("Max gap is " + maxUnder)
val gapsForMaxUnder = gaps.mapValues{v => Math.abs(v - maxUnder) < Double.Epsilon }
val keysByHasMaxUnder = gapsForMaxUnder.map(_.swap)
keysByHasMaxUnder(true)
}
/**
Stream of most-fair next element
*/
def proportionalStream[T](proportions : Map[T, Double], toDate : Map[T, Double]) : Stream[T] = {
val nextS = next(proportions, toDate)
val tailToDate = toDate + (nextS -> (toDate(nextS) + 1.0))
Stream.cons(
nextS,
proportionalStream(proportions, tailToDate)
)
}
That when used, e.g., :
val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)
val none : Map[String,Double] = ss.mapValues(_ => 0.0)
val mySequence = (proportionalStream(ss, none) take 100).toList
println("Desired : " + ss)
println("Achieved : " + mySequence.groupBy(identity).mapValues(_.size))
mySequence.map(s => print(s))
println
produces :
Desired : Map(A -> 42.0, B -> 32.0, C -> 26.0)
Achieved : Map(C -> 26, A -> 42, B -> 32)
ABCABCABACBACABACBABACABCABACBACABABCABACABCABACBA
CABABCABACBACABACBABACABCABACBACABABCABACABCABACBA
For a deterministic approach, the most obvious solution would probably be this:
Keep track of the number of occurrences of each item in the sequence so far.
For the next item, choose that item for which the difference between intended and actual count (or proportion, if you prefer that) is maximal, but only if the intended count (resp. proportion) is greater than the actual one.
If there is a tie, break it in an arbitrary but deterministic way, e.g. choosing the alphabetically lowest item.
This approach would ensure an optimal adherence to the prescribed ratio for every prefix of the infinite sequence generated in this way.
Quick & dirty python proof of concept (don't expect any of the variable “names” to make any sense):
import sys
p = [0.42, 0.32, 0.26]
c = [0, 0, 0]
a = ['A', 'B', 'C']
n = 0
while n < 70*5:
n += 1
x = 0
s = n*p[0] - c[0]
for i in [1, 2]:
si = n*p[i] - c[i]
if si > s:
x = i
s = si
sys.stdout.write(a[x])
if n % 70 == 0:
sys.stdout.write('\n')
c[x] += 1
Generates
ABCABCABACABACBABCAABCABACBACABACBABCABACABACBACBAABCABCABACABACBABCAB
ACABACBACABACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACABACBABCABA
CABACBACBAABCABCABACABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABAC
ABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABACABACBABCABACABACBACB
AACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACBAACBABCABACABACBACBA
For every item of the sequence, compute a (pseudo-)random number r equidistributed between 0 (inclusive) and 100 (exclusive).
If 0 ≤ r < 42, take A
If 42 ≤ r < (42+32), take B
If (42+32) ≤ r < (42+32+26)=100, take C
The number of each entry in your subset is going to be the same as in your map, but with a scaling factor applied.
The scaling factor is n/100.
So if n was 50, you would have { Ax21, Bx16, Cx13 }.
Randomize the order to your liking.
The simplest "deterministic" [in terms of #elements of each category] solution [IMO] will be: add elements in predefined order, and then shuffle the resulting list.
First, add map(x)/100 * n elements from each element x chose how you handle integer arithmetics to avoid off by one element], and then shuffle the resulting list.
Shuffling a list is simple with fisher-yates shuffle, which is implemented in most languages: for example java has Collections.shuffle(), and C++ has random_shuffle()
In java, it will be as simple as:
int N = 107;
List<String> res = new ArrayList<String>();
for (Entry<String,Integer> e : map.entrySet()) { //map is predefined Map<String,Integer> for frequencies
for (int i = 0; i < Math.round(e.getValue()/100.0 * N); i++) {
res.add(e.getKey());
}
}
Collections.shuffle(res);
This is nondeterministic, but gives a distribution of values close to MvG's. It suffers from the problem that it could give AAA right at the start. I post it here for completeness' sake given how it proves my dissent with MvG was misplaced (and I don't expect any upvotes).
Now, if someone has an idea for an expand function that is deterministic and won't just duplicate MvG's method (rendering the calc function useless), I'm all ears!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>ErikE's answer</title>
</head>
<body>
<div id="output"></div>
<script type="text/javascript">
if (!Array.each) {
Array.prototype.each = function(callback) {
var i, l = this.length;
for (i = 0; i < l; i += 1) {
callback(i, this[i]);
}
};
}
if (!Array.prototype.sum) {
Array.prototype.sum = function() {
var sum = 0;
this.each(function(i, val) {
sum += val;
});
return sum;
};
}
function expand(counts) {
var
result = "",
charlist = [],
l,
index;
counts.each(function(i, val) {
char = String.fromCharCode(i + 65);
for ( ; val > 0; val -= 1) {
charlist.push(char);
}
});
l = charlist.length;
for ( ; l > 0; l -= 1) {
index = Math.floor(Math.random() * l);
result += charlist[index];
charlist.splice(index, 1);
}
return result;
}
function calc(n, proportions) {
var percents = [],
counts = [],
errors = [],
fnmap = [],
errorSum,
worstIndex;
fnmap[1] = "min";
fnmap[-1] = "max";
proportions.each(function(i, val) {
percents[i] = val / proportions.sum() * n;
counts[i] = Math.round(percents[i]);
errors[i] = counts[i] - percents[i];
});
errorSum = counts.sum() - n;
while (errorSum != 0) {
adjust = errorSum < 0 ? 1 : -1;
worstIndex = errors.indexOf(Math[fnmap[adjust]].apply(0, errors));
counts[worstIndex] += adjust;
errors[worstIndex] = counts[worstIndex] - percents[worstIndex];
errorSum += adjust;
}
return expand(counts);
}
document.body.onload = function() {
document.getElementById('output').innerHTML = calc(99, [25.1, 24.9, 25.9, 24.1]);
};
</script>
</body>
</html>