Efficient knn algorithm - algorithm

I'm trying to implement knn algorithm which operates on one dimensional vectors in R, but one which differs from the standard one just a bit, in that that it takes the smaller element in case of a tie (so the distance is just the absolute value of the difference between the attributes). More precisely, I'm trying to find k numbers which are the closest to a given number, and if there are ties I want the smaller number to be chosen.
Sounds simple, but my algorithm takes couple of seconds to finish whilst the one that's in the class package (knn) outputs an answer immediately (though it takes all elements in case of a tie or random elements)... Mine's the following:
I sample a training sample and order it increasingly.
I take an element (a number)
2.5. and search for the first place in which it becomes less than some number in the training sample.
I take 2k+1 numbers from the training sample -- k to the left of a number found in 2.5 and k to the right (if there are less than k such numbers, I take as much as I can).
Finally I calculate the distances of chosen elements to the one I took in 2 and order them along with the corresponding elements increasingly (so that the elements and their distances are ordered increasingly)
Then I take k first elements from the list created in 4. (so that no two have the same distance)
But boy, it takes 6 or 7 seconds to finish... Do you have any ideas for an improvement? (It's not an R specific question, it just happened I do it in R).
Edit. The code:
dec <- function(u, x, k) {
## u is the training sample sorted increasingly
## x is an object for classification
## k is a knn parameter
knn <- list()
i <- 1
div <- 0
for (j in u) {
if (x < j) {
div <- 0
break
}
i <- i+1
}
if (div == 0) {
distances <- array(0,dim=c(2,k))
z <- 1
for (j in 1:k) {
distances[1,z] <- u[10000-j]
distances[2,z] <- abs(u[10000-j]-x)
}
} else {
end1 <- div+k
end2 <- div-k
if (div<k) {
distances <- array(0,dim=c(2,(div+k)))
a <- 1
for (j in u[1:end1]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
} else if (10000-div<k) {
distances <- array(0,dim=c(2,(1000-div+k)))
a <- 1
for (j in u[end2:10000]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
} else {
a <- 1
distances <- array(0,dim=c(2,(2*k+1)))
for (j in u[end1:end2]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
}
distances <- t(distances)
distances <- distances[ order( distances[,2], distances[,1]), ]
distances <- t(distances)
}
for (i in 1:k) {
if (i>1 && distances[1,i-1] != distances[1,i])
knn[i] <- distances[1,i]
}
## and sth later...
}

kNN in 1D is straightforward.
Sort the values increasingly. To perform a query, locate the value in the sorted sequence by dichotomic search. Then find the k closest values by stepping to the closest on either side (smaller or larger) k times.

Related

Algorithm to form nested pairs from a number list

For example, given a numbers list
sample output:
0120383****2919
ie. 1,1 2,2 3,3 max. number is 3
How to use an algorithm to form a maxmum number of nested paris?
Similar to Longest Palindromic Subsequence, I guess .
O(n*n) solution is out there . Is that what you want ?
The exact procedure is =>
The problem can be solved using a recurrence relation as follows,
T(i,j) => what is the length of longest nested pair in the sub-array [i,j].
Now your answer is t(0,n-1) assume array has N elements, indexed from 0 to n-1.
T(i,j) = >
If(i>=j) T(i,j) = 0
If(arr[i] == arr[j]) T(i,j) = 2+T(i+1,j-1)
else T(i,j) = max(T(i+1,j),T(i,j-1))
Now you can either write a recursive or a bottom up DP to solve the recurrence. NOTE that while solving the recurrence you will also have to trace which Segment gave the maximum answer, and then you just need to traverse that segment and collect all matching pairs.
Here is a working algorithm I wrote in R. While this works it's overly verbose because I'm tired.
I can come back tomorrow and make it shorter if you need, but hopefully you can just see the logic and then make your own version in whatever language.
# Example data
num_list <- c(0,1,2,0,3,8,3,2,9,1,9)
# Declare empty vector for later
tmp <- numeric()
# Find out which numbers can be ruled out based on frequency
cnt <- as.data.frame(table(num_list))
# Keep pairs, fix data classes
for(i in unique(cnt$num_list)){
if(cnt$Freq[cnt$num_list==i] == 2){
tmp <- c(as.numeric(as.character(
cnt$num_list[cnt$num_list == i])), tmp)
}
}
num_list <- num_list[num_list%in%tmp]
# Figure out where the max (peak) number is, to cut the data
peak <- numeric()
for(i in 1:(length(num_list)-1)){
if(!is.na(num_list[i]) & num_list[i] == num_list[i+1]){
peak <- num_list[i]
}
}
# Apply nesting filter to first half of data
drop <- numeric()
for(i in 1:(length(num_list)-1)){
if(!is.na(num_list[i]) & num_list[i] == peak){
break
} else if(!is.na(num_list[i]) & num_list[i] > num_list[i+1]){
num_list[i+1] <- NA
}
}
num_list <- num_list[!is.na(num_list)]
num_list <- num_list[!num_list %in%
unique(num_list)[table(num_list)==1]]
num_list.beg <- num_list[1:(max(which(num_list==peak)))]
num_list.end <- num_list[(max(which(num_list==peak))+1):length(num_list)]
# Apply nesting filter to second half of data
for(i in 1:(length(num_list.end)-1)){
if(!is.na(num_list.end[i]) & num_list.end[i] <= num_list.end[i+1]){
num_list.end[i+1] <- NA
}
}
num_list.end <- num_list.end[!is.na(num_list.end)]
num_list <- c(num_list.beg, num_list.end)
# Sort them like you did in your desired output
sort(num_list)
1 1 2 2 3 3

What's a more efficient implementation of this puzzle?

The puzzle
For every input number n (n < 10) there is an output number m such that:
m's first digit is n
m is an n digit number
every 2 digit sequence inside m must be a different prime number
The output should be m where m is the smallest number that fulfils the conditions above. If there is no such number, the output should be -1;
Examples
n = 3 -> m = 311
n = 4 -> m = 4113 (note that this is not 4111 as that would be repeating 11)
n = 9 -> m = 971131737
My somewhat working solution
Here's my first stab at this, the "brute force" approach. I am looking for a more elegant solution as this is very inefficient as n grows larger.
public long GetM(int n)
{
long start = n * (long)Math.Pow((double)10, (double)n - 1);
long end = n * (long)Math.Pow((double)10, (double)n);
for (long x = start; x < end; x++)
{
long xCopy = x;
bool allDigitsPrime = true;
List<int> allPrimeNumbers = new List<int>();
while (xCopy >= 10)
{
long lastDigitsLong = xCopy % 100;
int lastDigits = (int)lastDigitsLong;
bool lastDigitsSame = allPrimeNumbers.Count != 0 && allPrimeNumbers.Contains(lastDigits);
if (!IsPrime(lastDigits) || lastDigitsSame)
{
allDigitsPrime = false;
break;
}
xCopy /= 10;
allPrimeNumbers.Add(lastDigits);
}
if (n != 1 && allDigitsPrime)
{
return x;
}
}
return -1;
}
Initial thoughts on how this could be made more efficient
So, clearly the bottleneck here is traversing through the whole list of numbers that could fulfil this condition from n.... to (n+1).... . Instead of simply incrementing the number of every iteration of the loop, there must be some clever way of skipping numbers based on the requirement that the 2 digit sequences must be prime. For instance for n = 5, there is no point going through 50000 - 50999 (50 isn't prime), 51200 - 51299 (12 isn't prime), but I wasn't quite sure how this could be implemented or if it would be enough of an optimization to make the algorithm run for n=9.
Any ideas on this approach or a different optimization approach?
You don't have to try all numbers. You can instead use a different strategy, summed up as "try appending a digit".
Which digit? Well, a digit such that
it forms a prime together with your current last digit
the prime formed has not occurred in the number before
This should be done recursively (not iteratively), because you may run out of options and then you'd have to backtrack and try a different digit earlier in the number.
This is still an exponential time algorithm, but it avoids most of the search space because it never tries any numbers that don't fit the rule that every pair of adjacent digits must form a prime number.
Here's a possible solution, in R, using recursion . It would be interesting to build a tree of all the possible paths
# For every input number n (n < 10)
# there is an output number m such that:
# m's first digit is n
# m is an n digit number
# every 2 digit sequence inside m must be a different prime number
# Need to select the smallest m that meets the criteria
library('numbers')
mNumHelper <- function(cn,n,pr,cm=NULL) {
if (cn == 1) {
if (n==1) {
return(1)
}
firstDigit <- n
} else {
firstDigit <- mod(cm,10)
}
possibleNextNumbers <- pr[floor(pr/10) == firstDigit]
nPossible = length(possibleNextNumbers)
if (nPossible == 1) {
nextPrime <- possibleNextNumbers
} else{
# nextPrime <- sample(possibleNextNumbers,1)
nextPrime <- min(possibleNextNumbers)
}
pr <- pr[which(pr!=nextPrime)]
if (is.null(cm)) {
cm <- nextPrime
} else {
cm = cm * 10 + mod(nextPrime,10)
}
cn = cn + 1
if (cn < n) {
cm = mNumHelper(cn,n,pr,cm)
}
return(cm)
}
mNum <- function(n) {
pr<-Primes(10,100)
m <- mNumHelper(1,n,pr)
}
for (i in seq(1,9)) {
print(paste('i',i,'m',mNum(i)))
}
Sample output
[1] "i 1 m 1"
[1] "i 2 m 23"
[1] "i 3 m 311"
[1] "i 4 m 4113"
[1] "i 5 m 53113"
[1] "i 6 m 611317"
[1] "i 7 m 7113173"
[1] "i 8 m 83113717"
[1] "i 9 m 971131737"
Solution updated to select the smallest prime from the set of available primes, and remove bad path check since it's not required.
I just made a list of the two-digit prime numbers, then solved the problem by hand; it took only a few minues. Not every problem requires a computer!

Scala fast generation of upper triangular matrix coordinates

As a first attempt consider
for (a <- 1 to 5; b <- 1 to 5; if a < b) yield (a,b)
which gives
Vector((1,2), (1,3), (1,4), (1,5),
(2,3), (2,4), (2,5),
(3,4), (3,5),
(4,5))
Only half of the values for b have effect, hence
for (a <- 1 to 5; b <- a+1 to 5) yield (a,b)
also delivers the same upper triangular matrix coordinates.
To ask though on faster approaches to generate this vector of coordinates.
Many Thanks.
The best you can do is stick everything in an Array and and create the elements in a while loop (or recursively) to avoid any overhead from the generic machinery of for. (Actually, you'd be even faster with two arrays, one for each index.)
val a = {
val temp = new Array[(Int, Int)](5*4/2)
var k = 0
var i = 1
while (i <= 5) {
var j = i+1
while (j <= 5) {
temp(k) = (i,j)
j += 1
k += 1
}
i += 1
}
temp
}
But you shouldn't go to all this trouble unless you have good reason to believe that your other method isn't working adequately.
You've titled this "parallel processing", but you're probably going to tax your memory subsystem so heavily that parallelization isn't going to help you much. But of course you can always split up some of the lines onto different processors. You need something way, way larger than 5 for that to be a good idea.

Optimal way of iterating through a list of lists to maximise unique outputs

I have got a list of lists where the content is a vector of characters. For example:
yoda <- list(a=list(c("A","B","C"), c("B","C","D")), b=list(c("D","C"), c("B","C","D","E","F")))
This is a much shorter version that what I am actually trying to do it on, for me there is 11 list members each having about 12 sublists. For each of the list members I need to pick one sub-member liste.g. one list for "a" and one list for "b". I would like to find which combination of sublists gives the greatest number of unique values, in this simple example it would be the first sublist in "a" and the second sublist in "b" giving a final answer of:
c("A","B","C","D","E","F")
At the moment I have just got a huge number of nested loops and it seems to be taking for ever. Here is the poor bit of code:
res <- list()
for (a in 1:length(extra.pats[[1]])) {
for (b in 1:length(extra.pats[[2]])) {
for (c in 1:length(extra.pats[[3]])) {
for (d in 1:length(extra.pats[[4]])) {
for (e in 1:length(extra.pats[[5]])) {
for (f in 1:length(extra.pats[[6]])) {
for (g in 1:length(extra.pats[[7]])) {
for (h in 1:length(extra.pats[[8]])) {
for (i in 1:length(extra.pats[[9]])) {
for (j in 1:length(extra.pats[[10]])) {
for (k in 1:length(extra.pats[[11]])) {
res[[paste(a,b,c,d,e,f,g,h,i,j,k, sep="_")]] <- unique(extra.pats[[1]][[a]], extra.pats[[2]][[b]], extra.pats[[3]][[c]], extra.pats[[4]][[d]], extra.pats[[5]][[e]], extra.pats[[6]][[f]], extra.pats[[7]][[g]], extra.pats[[8]][[h]], extra.pats[[9]][[i]], extra.pats[[10]][[j]], extra.pats[[11]][[k]])
}
}
}
}
}
}
}
}
}
}
}
If anyone has got any ideas how to do this properly that would be great.
Here's a proposal:
# create all possible combinations
comb <- expand.grid(yoda)
# find unique values for each combination
uni <- lapply(seq(nrow(comb)), function(x) unique(unlist(comb[x, ])))
# count the unique values
len <- lapply(uni, length)
# extract longest combination
uni[which.max(len)]
[[1]]
[1] "A" "B" "C" "D" "E" "F"
Your current problem dimensions prohibit an exhaustive search. Here is an example of a suboptimal algorithm. While simple, maybe you'll find that it gives you "good enough" results.
The algorithm goes as follows:
Look at your first list: pick the item with the highest number of unique values.
Look at the second list: pick the item that brings the highest number of new unique values in addition to what you already selected in step 1.
repeat until you have reached the end of your list.
The code:
good.cover <- function(top.list) {
selection <- vector("list", length(top.list))
num.new.unique <- function(x, y) length(setdiff(y, x))
for (i in seq_along(top.list)) {
score <- sapply(top.list[[i]], num.new.unique, x = unlist(selection))
selection[[i]] <- top.list[[i]][which.max(score)]
}
selection
}
Let's make up some data:
items.universe <- apply(expand.grid(list(LETTERS, 0:9)), 1, paste, collapse = "")
random.length <- function()sample(3:6, 1)
random.sample <- function(i)sample(items.universe, random.length())
random.list <- function(i)lapply(letters[1:12], random.sample)
initial.list <- lapply(1:11, random.list)
Now run it:
system.time(final.list <- good.cover(initial.list))
# user system elapsed
# 0.004 0.000 0.004

Generate Random(a, b) making calls to Random(0, 1)

There is known Random(0,1) function, it is a uniformed random function, which means, it will give 0 or 1, with probability 50%. Implement Random(a, b) that only makes calls to Random(0,1)
What I though so far is, put the range a-b in a 0 based array, then I have index 0, 1, 2...b-a.
then call the RANDOM(0,1) b-a times, sum the results as generated idx. and return the element.
However since there is no answer in the book, I don't know if this way is correct or the best. How to prove that the probability of returning each element is exactly same and is 1/(b-a+1) ?
And what is the right/better way to do this?
If your RANDOM(0, 1) returns either 0 or 1, each with probability 0.5 then you can generate bits until you have enough to represent the number (b-a+1) in binary. This gives you a random number in a slightly too large range: you can test and repeat if it fails. Something like this (in Python).
def rand_pow2(bit_count):
"""Return a random number with the given number of bits."""
result = 0
for i in xrange(bit_count):
result = 2 * result + RANDOM(0, 1)
return result
def random_range(a, b):
"""Return a random integer in the closed interval [a, b]."""
bit_count = math.ceil(math.log2(b - a + 1))
while True:
r = rand_pow2(bit_count)
if a + r <= b:
return a + r
When you sum random numbers, the result is not longer evenly distributed - it looks like a Gaussian function. Look up "law of large numbers" or read any probability book / article. Just like flipping coins 100 times is highly highly unlikely to give 100 heads. It's likely to give close to 50 heads and 50 tails.
Your inclination to put the range from 0 to a-b first is correct. However, you cannot do it as you stated. This question asks exactly how to do that, and the answer utilizes unique factorization. Write m=a-b in base 2, keeping track of the largest needed exponent, say e. Then, find the biggest multiple of m that is smaller than 2^e, call it k. Finally, generate e numbers with RANDOM(0,1), take them as the base 2 expansion of some number x, if x < k*m, return x, otherwise try again. The program looks something like this (simple case when m<2^2):
int RANDOM(0,m) {
// find largest power of n needed to write m in base 2
int e=0;
while (m > 2^e) {
++e;
}
// find largest multiple of m less than 2^e
int k=1;
while (k*m < 2^2) {
++k
}
--k; // we went one too far
while (1) {
// generate a random number in base 2
int x = 0;
for (int i=0; i<e; ++i) {
x = x*2 + RANDOM(0,1);
}
// if x isn't too large, return it x modulo m
if (x < m*k)
return (x % m);
}
}
Now you can simply add a to the result to get uniformly distributed numbers between a and b.
Divide and conquer could help us in generating a random number in range [a,b] using random(0,1). The idea is
if a is equal to b, then random number is a
Find mid of the range [a,b]
Generate random(0,1)
If above is 0, return a random number in range [a,mid] using recursion
else return a random number in range [mid+1, b] using recursion
The working 'C' code is as follows.
int random(int a, int b)
{
if(a == b)
return a;
int c = RANDOM(0,1); // Returns 0 or 1 with probability 0.5
int mid = a + (b-a)/2;
if(c == 0)
return random(a, mid);
else
return random(mid + 1, b);
}
If you have a RNG that returns {0, 1} with equal probability, you can easily create a RNG that returns numbers {0, 2^n} with equal probability.
To do this you just use your original RNG n times and get a binary number like 0010110111. Each of the numbers are (from 0 to 2^n) are equally likely.
Now it is easy to get a RNG from a to b, where b - a = 2^n. You just create a previous RNG and add a to it.
Now the last question is what should you do if b-a is not 2^n?
Good thing that you have to do almost nothing. Relying on rejection sampling technique. It tells you that if you have a big set and have a RNG over that set and need to select an element from a subset of this set, you can just keep selecting an element from a bigger set and discarding them till they exist in your subset.
So all you do, is find b-a and find the first n such that b-a <= 2^n. Then using rejection sampling till you picked an element smaller b-a. Than you just add a.

Resources