Converting Map[Int, Double] to breeze.linalg.SparseVector - scala-breeze

I am new to the Breeze library and I would like to convert a Map[Int, Double] to breeze.linalg.SparseVector, and ideally without having to specify a fixed length of the SparseVector. I managed to achieve the goal with this clumsy code:
import breeze.linalg.{SparseVector => SBV}
val mySparseVector: SBV[Double] = new SBV[Double](Array.empty, Array.empty, 10000)
myMap foreach { e => mySparseVector(e._1) = e._2 }
Not only I have to specify a fixed length of 10,000, but the code runs in O(n), where n is the size of the map. Is there a better way?

You can use VectorBuilder. There's a (sadly) undocumented feature where if you tell it the length is -1, it will happily let you add things. You will have to (annoyingly) set the length before you construct the result...
val vb = new VectorBuilder(length = -1)
myMap foreach { e => vb.add(e._1, e._2) }
vb.length = myMap.keys.max + 1
vb.toSparseVector
(Your code is actually n^2 because SparseVector has to be sorted so you're repeatedly moving elements around in an array. VectorBuilder gives you n log n, which is the best you can do.)

Related

Need solution to rand_range providing only floats and not integers - GDScript

For a project, I need to make use of a random number generator to provide random numbers as part of a Fisher-Yates Shuffle. I am using both randomize() and randi()%1+50 methods to get my random numbers.
However doing it this way does not let my Fisher-Yates shuffler count down to n=0, because if it does, it kicks out an error:
Division By Zero in operator %
Pointing right at my generator range where the % resides. I can work around this by using rand_range(), however, because rand_range() returns a float and not an integer, I am forced to round the result, which sometimes results in duplicate numbers, which I cannot have.
ceil() and floor() are also out of the question for the same reason. This, of course, could all be solved if GDScript included something like randi_range() but I have seen on other forums that this has been a problem since 2014, and to which there still is no solution.
Q: Given that this is not an option, does anyone know a way to return a random number, within a range, that does not include 0 and is a positive integer that doesn't require the % operator to dictate the range?
Don't get me wrong, I love the Godot engine, and appreciate it for what it is, but sometimes the code requires a tad too much "wrestling" for my mental sanity to handle.
Any help would be greatly appreciated.
Thanks in advance...
Godot has a class RandomNumberGenerator that does implement randi_range().
Here's a FisherYates Shuffle to prove to myself that it should work:
extends Node2D
var deck = []
func init_deck():
var arr = []
for i in range(1,52):
arr.append(i)
return arr
# Called when the node enters the scene tree for the first time.
func _ready():
deck = init_deck()
var n = len(deck)
print(fy_shuffle(deck, n))
pass # Replace with function body.
#Fisher Yates Shuffle
func fy_shuffle (arr, n):
# Start from the last element and swap one by one.
var rng = RandomNumberGenerator.new()
for i in range(n-1,0,-1):
# Pick a random index from 0 to i
var j = rng.randi_range(0,i+1)
# Swap arr[i] with the element at random index
var t1 = arr[i]
var t2 = arr[j]
arr[i] = t2
arr[j] = t1
return arr
const START = 10
const STOP = 20
var random = RandomNumberGenerator.new()
random.randomize()
var n = random.randi_range(START, STOP)

Find most unique words, penalizing words in common

suppose I have n classes like:
A: this,is,a,test,of,the,salmon,system
B: i,like,to,test,the,flounder,system
C: to,test,a,salmon,is,like,to,test,the,iodine,system
I want to get the most unique words for each class, so something with a ranking that gives me
A: salmon
B: flounder
C: iodine, salmon
(as their first elements ; it can be a ranking of all words)
How do I do this? There will be hundreds of input classes each with tens of thousands of tokens.
I'm guessing this is essentially the sort of thing any search engine back-end does, but I'd like a fairly simple standalone thing.
Using a language like Python, you can write this efficiently in 8 lines. For hundreds of groups, each with tens of thousands of tokens, the running time sounds like it will take at most a few minutes (although I haven't tried this on actual input).
Create a hash-based dictionary mapping each word to the number of its occurrences.
Iterate over all groups, and all words in a group, and update this dictionary.
For each group,
a. If you need a total ranking, sort with the value in the dictionary as the critera
b. If you need the top k, use an order statistics type of algorithm again using the value in the dictionary as the criteria
Steps 1 + 2 should have expected linear complexity in the total number of words.
Step 3 is n log(n) per group for total ranking, and linear in the total number of words otherwise.
Here is the Python code for the top k. Assume all_groups is a list of lists of strings, and that k = 10.
from collections import Counter
import heapq
import operator
c = Counter()
for g in all_groups:
c.update(g)
for g in all_groups:
print heapq.nsmallest(k, [(w, c[w]) for w in g], key=operator.itemgetter(1))
What I understand from your question, I come to this solution as the least used words per class comparing with all the other classes.
var a = "this,is,a,test,of,the,salmon,system".split(","),
b = "i,like,to,test,the,flounder,system".split(","),
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(","),
map = {},
min,
key,
parse = function(stringArr) {
var length = stringArr.length,
i,count;
for (i = 0; i< length; i++) {
if (count = map[stringArr[i]]) {
map[stringArr[i]] = count + 1;
}
else {
map[stringArr[i]] = 1;
}
}
},
get = function(stringArr) {
min = Infinity;
stringArr.forEach((item)=>{
if (map[item] < min) {
min = map[item];
key = item
}
});
console.log(key);
};
parse(a);
parse(b);
parse(c);
get(a);
get(b);
get(c);
Ignore the classes, go through all the words and make a frequency table.
Then, for each class select the word with the lowest frequency.
Example in Python (slightly unpythonic solution to maintain readability for non-Python users):
a = "this,is,a,test,of,the,salmon,system".split(",")
b = "i,like,to,test,the,flounder,system".split(",")
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(",")
freq = {}
for word in a + b + c:
freq[word] = (freq[word] if word in freq else 0) + 1
print("a: ", min(a, key=lambda w: freq[w]))
print("b: ", min(b, key=lambda w: freq[w]))
print("c: ", min(c, key=lambda w: freq[w]))

Efficiently randomly sampling List while maintaining order

I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.
Is there a clean and more efficient way of doing this?
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random
.shuffle(ls.indices.toList)
.take(sampleSize)
.sorted
.map(idx => ls(idx))
}
val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though
sampledList(sampleList, 5) // List(e, u, i, r, n)
EDIT:
It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.
If by
maintaining the order of the values
you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}
This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).
Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.
import scala.util.Random
import scala.annotation.tailrec
def sampledList[T](ls: List[T], sampleSize: Int) = {
#tailrec
def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
require(listSize >= sampleSize,
s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
)
list match {
case hd :: tl =>
if (Random.nextInt(listSize) < sampleSize)
rec(tl, listSize-1, hd :: sample, sampleSize-1)
else rec(tl, listSize-1, sample, sampleSize)
case Nil =>
require(sampleSize == 0, // Should never happen
s"sampleSize must be zero at the end of processing, but got $sampleSize"
)
sample
}
}
rec(ls, ls.size, Nil, sampleSize).reverse
}
The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.
Here's another O(n) implementation that should have a uniform probability for each element:
implicit class SampleSeqOps[T](s: Seq[T]) {
def sample(n: Int, r: Random = Random): Seq[T] = {
assert(n >= 0 && n <= s.length)
val res = ListBuffer[T]()
val length = s.length
var samplesNeeded = n
for { (e, i) <- s.zipWithIndex } {
val p = samplesNeeded.toDouble / (length - i)
if (p >= r.nextDouble()) {
res += e
samplesNeeded -= 1
}
}
res.toSeq
}
}
I'm using it frequently with collections > 100'000 elements and the performance seems reasonable.
It's probably the same idea as in Régis Jean-Gilles's answer but I think the imperative solution is slightly more readable in this case.
Perhaps I don't quite understand, but since Lists are immutable you don't really need to worry about 'maintaining the order' since the original List is never touched. Wouldn't the following suffice?
def sampledList[T](ls: List[T], sampleSize: Int) =
Random.shuffle(ls).take(sampleSize)
While my previous answer has linear complexity, it does have the drawback of requiring two passes, the first one corresponding to the need to compute the length before doing anything else. Besides affecting the running time, we might want to sample a very large collection for which it is not practical nor efficient to load the whole collection in memory at once, in which case we'd like to be able to work with a simple iterator.
As it happens, we don't need to invent anything to fix this. There is simple and clever algorithm called reservoir sampling which does exactly this (building a sample as we iterate over a collection, all in one pass). With a minor modification we can also preserve the order, as required:
import scala.util.Random
def sampledList[T](ls: TraversableOnce[T], sampleSize: Int, preserveOrder: Boolean = false, rng: Random = new Random): Iterable[T] = {
val result = collection.mutable.Buffer.empty[(T, Int)]
for ((item, n) <- ls.toIterator.zipWithIndex) {
if (n < sampleSize) result += (item -> n)
else {
val s = rng.nextInt(n)
if (s < sampleSize) {
result(s) = (item -> n)
}
}
}
if (preserveOrder) {
result.sortBy(_._2).map(_._1)
}
else result.map(_._1)
}

Cycle using method in CoffeeScript

Given I have two objects lower and upper of same type and they return successive value using method succ (as in ruby) and can be compared using <.
In plain javascript I can write:
for (var i = lower; i <= upper; i = i.succ()) {
// …
}
Using prototype I can write shorter:
$R(lower, upper).each(function(i){
// …
}, this)
Using prototype in coffeescript I can write even shorter:
$R(lower, upper).each (i)->
# …
, this
But without prototype, I found only this way to do same thing:
i = lower
while i <= upper
# …
i = i.succ()
Is there anything shorter?
I think you are correct that
i = lower
while i < upper
# …
i = i.succ()
is the shortest way to write this without using a function. Of course, you could write such a function without using Prototype:
eachSucc = (lower, upper, func) ->
i = lower
while i < upper
func i
i = i.succ()
Then you can call it like so:
eachSucc lower, upper, (i) -> ...
How about:
while upper >= n = i.succ()
alert n
Try it here, for the example I used the following fixture:
upper = 3
lower = 0
counter = (l) ->
_ = l
-> _++
i = succ: counter(lower)
/me still wishing for widespread generator support in Javascript..

Algorithm Issue: letter combinations

I'm trying to write a piece of code that will do the following:
Take the numbers 0 to 9 and assign one or more letters to this number. For example:
0 = N,
1 = L,
2 = T,
3 = D,
4 = R,
5 = V or F,
6 = B or P,
7 = Z,
8 = H or CH or J,
9 = G
When I have a code like 0123, it's an easy job to encode it. It will obviously make up the code NLTD. When a number like 5,6 or 8 is introduced, things get different. A number like 051 would result in more than one possibility:
NVL and NFL
It should be obvious that this gets even "worse" with longer numbers that include several digits like 5,6 or 8.
Being pretty bad at mathematics, I have not yet been able to come up with a decent solution that will allow me to feed the program a bunch of numbers and have it spit out all the possible letter combinations. So I'd love some help with it, 'cause I can't seem to figure it out. Dug up some information about permutations and combinations, but no luck.
Thanks for any suggestions/clues. The language I need to write the code in is PHP, but any general hints would be highly appreciated.
Update:
Some more background: (and thanks a lot for the quick responses!)
The idea behind my question is to build a script that will help people to easily convert numbers they want to remember to words that are far more easily remembered. This is sometimes referred to as "pseudo-numerology".
I want the script to give me all the possible combinations that are then held against a database of stripped words. These stripped words just come from a dictionary and have all the letters I mentioned in my question stripped out of them. That way, the number to be encoded can usually easily be related to a one or more database records. And when that happens, you end up with a list of words that you can use to remember the number you wanted to remember.
It can be done easily recursively.
The idea is that to handle the whole code of size n, you must handle first the n - 1 digits.
Once you have all answers for n-1 digits, the answers for the whole are deduced by appending to them the correct(s) char(s) for the last one.
There's actually a much better solution than enumerating all the possible translations of a number and looking them up: Simply do the reverse computation on every word in your dictionary, and store the string of digits in another field. So if your mapping is:
0 = N,
1 = L,
2 = T,
3 = D,
4 = R,
5 = V or F,
6 = B or P,
7 = Z,
8 = H or CH or J,
9 = G
your reverse mapping is:
N = 0,
L = 1,
T = 2,
D = 3,
R = 4,
V = 5,
F = 5,
B = 6,
P = 6,
Z = 7,
H = 8,
J = 8,
G = 9
Note there's no mapping for 'ch', because the 'c' will be dropped, and the 'h' will be converted to 8 anyway.
Then, all you have to do is iterate through each letter in the dictionary word, output the appropriate digit if there's a match, and do nothing if there isn't.
Store all the generated digit strings as another field in the database. When you want to look something up, just perform a simple query for the number entered, instead of having to do tens (or hundreds, or thousands) of lookups of potential words.
The general structure you want to hold your number -> letter assignments is an array or arrays, similar to:
// 0 = N, 1 = L, 2 = T, 3 = D, 4 = R, 5 = V or F, 6 = B or P, 7 = Z,
// 8 = H or CH or J, 9 = G
$numberMap = new Array (
0 => new Array("N"),
1 => new Array("L"),
2 => new Array("T"),
3 => new Array("D"),
4 => new Array("R"),
5 => new Array("V", "F"),
6 => new Array("B", "P"),
7 => new Array("Z"),
8 => new Array("H", "CH", "J"),
9 => new Array("G"),
);
Then, a bit of recursive logic gives us a function similar to:
function GetEncoding($number) {
$ret = new Array();
for ($i = 0; $i < strlen($number); $i++) {
// We're just translating here, nothing special.
// $var + 0 is a cheap way of forcing a variable to be numeric
$ret[] = $numberMap[$number[$i]+0];
}
}
function PrintEncoding($enc, $string = "") {
// If we're at the end of the line, then print!
if (count($enc) === 0) {
print $string."\n";
return;
}
// Otherwise, soldier on through the possible values.
// Grab the next 'letter' and cycle through the possibilities for it.
foreach ($enc[0] as $letter) {
// And call this function again with it!
PrintEncoding(array_slice($enc, 1), $string.$letter);
}
}
Three cheers for recursion! This would be used via:
PrintEncoding(GetEncoding("052384"));
And if you really want it as an array, play with output buffering and explode using "\n" as your split string.
This kind of problem are usually resolved with recursion. In ruby, one (quick and dirty) solution would be
#values = Hash.new([])
#values["0"] = ["N"]
#values["1"] = ["L"]
#values["2"] = ["T"]
#values["3"] = ["D"]
#values["4"] = ["R"]
#values["5"] = ["V","F"]
#values["6"] = ["B","P"]
#values["7"] = ["Z"]
#values["8"] = ["H","CH","J"]
#values["9"] = ["G"]
def find_valid_combinations(buffer,number)
first_char = number.shift
#values[first_char].each do |key|
if(number.length == 0) then
puts buffer + key
else
find_valid_combinations(buffer + key,number.dup)
end
end
end
find_valid_combinations("",ARGV[0].split(""))
And if you run this from the command line you will get:
$ ruby r.rb 051
NVL
NFL
This is related to brute-force search and backtracking
Here is a recursive solution in Python.
#!/usr/bin/env/python
import sys
ENCODING = {'0':['N'],
'1':['L'],
'2':['T'],
'3':['D'],
'4':['R'],
'5':['V', 'F'],
'6':['B', 'P'],
'7':['Z'],
'8':['H', 'CH', 'J'],
'9':['G']
}
def decode(str):
if len(str) == 0:
return ''
elif len(str) == 1:
return ENCODING[str]
else:
result = []
for prefix in ENCODING[str[0]]:
result.extend([prefix + suffix for suffix in decode(str[1:])])
return result
if __name__ == '__main__':
print decode(sys.argv[1])
Example output:
$ ./demo 1
['L']
$ ./demo 051
['NVL', 'NFL']
$ ./demo 0518
['NVLH', 'NVLCH', 'NVLJ', 'NFLH', 'NFLCH', 'NFLJ']
Could you do the following:
Create a results array.
Create an item in the array with value ""
Loop through the numbers, say 051 analyzing each one individually.
Each time a 1 to 1 match between a number is found add the correct value to all items in the results array.
So "" becomes N.
Each time a 1 to many match is found, add new rows to the results array with one option, and update the existing results with the other option.
So N becomes NV and a new item is created NF
Then the last number is a 1 to 1 match so the items in the results array become
NVL and NFL
To produce the results loop through the results array, printing them, or whatever.
Let pn be a list of all possible letter combinations of a given number string s up to the nth digit.
Then, the following algorithm will generate pn+1:
digit = s[n+1];
foreach(letter l that digit maps to)
{
foreach(entry e in p(n))
{
newEntry = append l to e;
add newEntry to p(n+1);
}
}
The first iteration is somewhat of a special case, since p-1 is undefined. You can simply initialize p0 as the list of all possible characters for the first character.
So, your 051 example:
Iteration 0:
p(0) = {N}
Iteration 1:
digit = 5
foreach({V, F})
{
foreach(p(0) = {N})
{
newEntry = N + V or N + F
p(1) = {NV, NF}
}
}
Iteration 2:
digit = 1
foreach({L})
{
foreach(p(1) = {NV, NF})
{
newEntry = NV + L or NF + L
p(2) = {NVL, NFL}
}
}
The form you want is probably something like:
function combinations( $str ){
$l = len( $str );
$results = array( );
if ($l == 0) { return $results; }
if ($l == 1)
{
foreach( $codes[ $str[0] ] as $code )
{
$results[] = $code;
}
return $results;
}
$cur = $str[0];
$combs = combinations( substr( $str, 1, $l ) );
foreach ($codes[ $cur ] as $code)
{
foreach ($combs as $comb)
{
$results[] = $code.$comb;
}
}
return $results;}
This is ugly, pidgin-php so please verify it first. The basic idea is to generate every combination of the string from [1..n] and then prepend to the front of all those combinations each possible code for str[0]. Bear in mind that in the worst case this will have performance exponential in the length of your string, because that much ambiguity is actually present in your coding scheme.
The trick is not only to generate all possible letter combinations that match a given number, but to select the letter sequence that is most easy to remember. A suggestion would be to run the soundex algorithm on each of the sequence and try to match against an English language dictionary such as Wordnet to find the most 'real-word-sounding' sequences.

Resources