SVM training performance - performance

I'm using SVMLib to train a simple SVM over the MNIST dataset. It contains 60.000 training data. However, I have several performance issues: the training seems to be endless (after a few hours, I had to shut it down by hand, because it doesn't respond). My code is very simple, I just call ovrtrain on the dataset without any kernel and any special constants:
function features = readFeatures(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 4, "int32" , 0, "ieee-be");
if header(1) ~= 2051
fprintf("Wrong magic number!");
end
M = header(2);
rows = header(3);
columns = header(4);
features = fread(fid, [M, rows*columns], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
function labels = readLabels(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 2, "int32" , 0, "ieee-be");
if header(1) ~= 2049
fprintf("Wrong magic number!");
end
M = header(2);
labels = fread(fid, [M, 1], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
labels = readLabels("train-labels.idx1-ubyte");
features = readFeatures("train-images.idx3-ubyte");
model = ovrtrain(labels, features, "-t 0"); % doesn't respond...
My question: is it normal? I'm running it on Ubuntu, a virtual machine. Should I wait longer?

I don't know whether you took your answer or not, but let me tell you what I predict about your situation. 60.000 examples is not a lot for a power trainer like LibSVM. Currently, I am working on a training set of 6000 examples and it takes 3-to-5 seconds to train. However, the parameter selection is important and that is the one probably taking long time. If the number of unique features in your data set is too high, then for any example, there will be lots of zero feature values for non-existing features. If the tool is implementing data scaling on your training set, then most probably those lots of zero feature values will be scaled to a certain non-zero value, leaving you astronomic number of unique and non-zero valued features for each and every example. This is very very complicated for a SVM tool to get in and extract efficient parameter values.
Long story short, if you had enough research on SVM tools and understand what I mean, you either assign parameter values in the training command before executing it or find a way to decrease the number of unique features. If you haven't, go on and download the latest version of LibSVM, read the ReadME files as well as the FAQ from the website of the tool.
If non of these is the case, then sorry for taking your time:) Good luck.

It might be an issue of convergence given the characteristics of your data.
Check the kernel you have as default selection and change it. Also, check the stopping criterion of the package. Additionally, if you are looking for faster implementation, check MSVMpack which is a parallel implementation of SVM.
Finally, feature selection in your case is desired. You can end up with a good feature subset of almost half of what you have. In addition, you need only a portion of data for training e.g. 60~70 % are sufficient.

First of all 60k is huge data for training.Training that much data with linear kernel will take hell of time unless you have a supercomputing. Also you have selected a linear kernel function of degree 1. Its better to use Gaussian or higher degree polynomial kernel (deg 4 used with the same dataset showed a good tranning accuracy). Try to add the LIBSVM options for -c cost -m memory cachesize -e epsilon tolerance of termination criterion (default 0.001). First run 1000 samples with Gaussian/ polynomial of deg 4 and compare the accuracy.

Related

how to limit the amount of digits of a function in Julia

my function has to work with very big numbers, so in order to do that I used parts in my code such as big() . Unfortunately this resulted in giving me a result that is too precise (in other words its slowing the entire code down).
This is how the result looks like.
ΔE = 0.08298347005140644564908076516066986088852555871299296314640532293721884964540988
If possible I would like to limit the result to 4 digits
ΔE = 0.0829
If performance is a concern, probably the best way to do this is with https://github.com/dzhang314/MultiFloats.jl, e.g.
using MultiFloats
x = Float64x4(2.0)
# Calculations performed on x will have Float64x4 precision subsequently...
MultiFloats.jl appears to be the fastest package around at present for such calculations, and will let you choose from precision levels between Float64x2 and Float64x8. In any event, this will be dramatically faster than the BigFloats used in the example above.

My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

I'm training a Doc2Vec model using the below code, where tagged_data is a list of TaggedDocument instances I set up before:
max_epochs = 40
model = Doc2Vec(alpha=0.025,
min_alpha=0.001)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.001
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
When I later check the model results, they're not good. What might have gone wrong?
Do not call .train() multiple times in your own loop that tries to do alpha arithmetic.
It's unnecessary, and it's error-prone.
Specifically, in the above code, decrementing the original 0.025 alpha by 0.001 forty times results in (0.025 - 40*0.001) -0.015 final alpha, which would also have been negative for many of the training epochs. But a negative alpha learning-rate is nonsensical: it essentially asks the model to nudge its predictions a little bit in the wrong direction, rather than a little bit in the right direction, on every bulk training update. (Further, since model.iter is by default 5, the above code actually performs 40 * 5 training passes – 200 – which probably isn't the conscious intent. But that will just confuse readers of the code & slow training, not totally sabotage results, like the alpha mishandling.)
There are other variants of error that are common here, as well. If the alpha were instead decremented by 0.0001, the 40 decrements would only reduce the final alpha to 0.021 – whereas the proper practice for this style of SGD (Stochastic Gradient Descent) with linear learning-rate decay is for the value to end "very close to 0.000"). If users start tinkering with max_epochs – it is, after all, a parameter pulled out on top! – but don't also adjust the decrement every time, they are likely to far-undershoot or far-overshoot 0.000.
So don't use this pattern.
Unfortunately, many bad online examples have copied this anti-pattern from each other, and make serious errors in their own epochs and alpha handling. Please don't copy their error, and please let their authors know they're misleading people wherever this problem appears.
The above code can be improved with the much-simpler replacement:
max_epochs = 40
model = Doc2Vec() # of course, if non-default parameters needed, use them here
# most users won't need to change alpha/min_alpha at all
# but many will want to use more than default `epochs=5`
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=max_epochs)
model.save("d2v.model")
Here, the .train() method will do exactly the requested number of epochs, smoothly reducing the internal effective alpha from its default starting value to near-zero. (It's rare to need to change the starting alpha, but even if you wanted to, just setting a new non-default value at initial model-creation is enough.)
Also: note that later calls to infer_vector() will reuse the epochs specified at the time of model-creation. If nothing is specified, the default epochs=5 will be used - which is often smaller than is best for training or inference. So if you find a larger number of epochs (such as 10, 20 or more) is better for training, remember to also use at least the same number of epochs for inference. (.infer_vector() takes an optional epochs parameter whihc can override any value set at model-contruction.

Single-community detection algorithm

Here is a presentation of my dataset :
Large social-network composed of Twitter accounts followers of very large related accounts, followers of this followers, and followers of these followers, at every iteration cleaned for bot accounts, private-accounts, etc.
Total nodes : around 500,000
Total connections : 95 millions
4 nodes have more than 3 millions connections
567 nodes have more than 100,000 connections
Half of the dataset have 3 or less connections
This said, I want to clean this network in order to get the "best" single-community coming out of the raw initial graph before further clustering in sub communities. Keep in mind these few facts:
Due to the way the data is collected, I know there is One Large Community common to most of these nodes that is more optimal than the whole network.
I would like to get an optimal single sub-network of the initial network, getting rid of all the nodes that don't belong to the largest possible common-community.
Further study will constitute in splitting this community in several communities, following the general community-detection literature, but this is not what I want to do here.
I have used community-detection algorithms such as louvain or modularity-optimization (in a smaller subsample for the too computational second one), but the goals of these algos are to have the best split, while my goal in some ways is to have the best merge.
The main problema can be summarized by this idea: I was considering using the following algo. Starting with the large network ; removing the "weakest" node at every iteration ; while the modularity of the whole improves. But this will lead to a very tiny community at the end.
Do you have any directions where to look for ? A way to change the methodology of an existing algo ? Or even a paper that is related to this issue even if pretty different ?
Thank you.
Here you can try several approaches. The size of your network is challenging, not all community detection methods are capable to run in reasonable time on such a large network. You could try those methods which have adjustable parameters, and empirically find out how these parameters affect their resolution. At certain values you can expect the core network covered by one cluster. For example there are the walktrap and spinglass method in igraph. If you change the number of steps at walktrap, you can observe a change in the size of the largest community:
g <- barabasi.game(n = 10000, m = 2)
steps <- seq(1, 10, 1)
steps <- c(steps, seq(11, 200, 10))
w <- list()
ccount <- NULL
clargest <- NULL
for(s in steps){
cat(paste('Running walktrap with steps =', s, '\n'))
w0 <- walktrap.community(g, steps = s)
ccount <- c(ccount, length(levels(as.factor(w0$membership))))
clargest <- c(clargest, max(tapply(w0$membership, w0$membership, length)))
w[[s]] <- w0
}
plot(ccount ~ steps,
xlab = 'Number of steps',
ylab = 'Number of communities',
main = 'Walktrap with increasing number of steps')
plot(clargest ~ steps,
xlab = 'Number of steps',
ylab = 'Size of largest community',
main = 'Walktrap with increasing number of steps')
Similarly with changing the gamma parameter for spinglass:
gamma <- c(0.05, 0.1, 0.2, 0.5, 0.7, 0.85, 1.0, 1.2, 1.5, 1.8, 2.0, 2.5, 3.0, 3.5, 5.0, 10.0, 20.0, 50.0, 100.0, 500.0)
sg <- list()
sgsize <- NULL
for(gm in gamma){
cat(paste('Running spinglass with gamma =', gm, '\n'))
sg0 <- spinglass.community(g, vertex = 1, gamma = gm)
sgsize <- c(sgsize, length(sg0$community))
sg[[as.character(gm)]] <- sg0
}
plot(sgsize ~ log10(gamma),
xlab = 'Gamma (log)',
ylab = 'Size of the community',
main = 'Spinglass with increasing value of gamma',
xaxt = 'n'
)
Another method infomap, according to its description is designed exactly to problems like yours. You may want to use not the igraph implementation, but the original, as the latter gives more freedom in setting parameters. There are more Python implementations, but I don't know how flexible those are.
You can also try the moduland method family. Here you can chose between four landscape building method: nodeland, linkland, perturland and edgeweight; and two hill detection methods: total_hill and proportional_hill; in addition, at perturland you can set a parameter x. Please read the papers for more info. As I mentioned in a comment, you could inspect the affinities and set a threshold to select your core network. These methods have no Python interface, but you can simple export a text file and call the binaries by subprocess, and read their output back to Python.
For an overview of a large number of other methods see here from page 52 ---this is already not up to date, but comprehensive.
Another idea is that you could run a number of methods, and comparing their results, find core network as a large partition delimited by cluster boundaries constant accross different methods. It is also a question how exact solution you need. Considering your data is quite noisy, likely you can expect thousands of nodes misclassified by any method. For a comparison of different clusterings, you could use something like normalized mutual information, which is implemented in igraph (see more here.

I need help optimizing this compression algorithm I came up with on my own

I tried coming up with a compression algorithm. I do little bit about compression theories and so am aware that this scheme that I have come up with could very well never achieve compression at all.
Currently it works only for a string with no consecutive repeating letters/digits/symbols. Once properly established I hope to extrapolate it to binary data etc. But first the algorithm:
Assuming there are only 4 letters: a,b,c,d; we create a matrix/array corresponding to the letters. Whenever a letter is encountered, the corresponding index is incremented so that the index of the last letter encountered is always largest. We incremement an index by 2 if it was originally zero. If it was not originally zero then we increment it by 2+(the second largest element in the matrix). An example to clarify:
Array = [a,b,c,d]
Initial state = [0,0,0,0]
Letter = a
New state = [2,0,0,0]
Letter = b
New state = [2,4,0,0]
.
.c
.d
.
New state = [2,4,6,8]
Letter = a
New state = [12,4,6,8]
//Explanation for the above state: 12 because Largest - Second Largest - 2 = Old value
Letter = d
New state = [12,4,6,22]
and so on...
Decompression is just this logic in reverse.
A rudimentary implementation of compression (in python):
(This function is very rudimentary so not the best kind of code...I know. I can optimize it once I get the core algorithm correct.)
def compress(text):
matrix = [0]*95 #we are concerned with 95 printable chars for now
for i in text:
temp = copy.deepcopy(matrix)
temp.sort()
largest = temp[-1]
if matrix[ord(i)-32] == 0:
matrix[ord(i)-32] = largest+2
else:
matrix[ord(i)-32] = largest+matrix[ord(i)-32]+2
return matrix
The returned matrix is then used for decompression. Now comes the tricky part:
I can't really call this compression at all because each number in the matrix generated from the function are of the order of 10**200 for a string of length 50000. So storing the matrix actually takes more space than storing the original string. I know...totally useless. But I had hoped prior to doing all this that I can use the mathematical properties of a matrix to effectively represent it in some kind of mathematical shorthand. I have tried many possibilities and failed. Some things that I tried:
Rank of the matrix. Failed because not unique.
Denote using the mod function. Failed because either the quotient or the remainder
Store each integer as a generator using pickle.
Store the matrix as a bitmap file but then the integers are too large to be able to store as color codes.
Let me iterate again that the algorithm could be optimized. e.g. instead of adding 2 we could add 1 and proceed. But don't really result in any compression. Same for the code. Minor optimizations later...first I want to improve the main algorithm.
Furthermore, it is very likely that this product of a mediocre and idle mind like myself could never be able to achieve compression after all. In which case, I would then like your help and ideas on what this could probably be useful in.
TL;DR: Check coded parts which depict a compression algorithm. The compressed result is longer than the original string. Can this be fixed? If yes, how?
PS: I have the entire code on my PC. Will create a repo on github and upload in some time.
Compression is essentially a predictive process. Look for patterns in the input and use them to encode the more likely next character(s) more efficiently than the less likely. I can't see anything in your algorithm that tries to build a predictive model.

Time delay estimation between two audio signals

I have two audio recordings of a same signal by 2 different microphones (for example, in a WAV format), but one of them is recorded with delay, for example, several seconds.
It's easy to identify such a delay visually when viewing these signals in some kind of waveform viewer - i.e. just spotting first visible peak in every signal and ensuring that they're the same shape:
(source: greycat.ru)
But how do I do it programmatically - find out what this delay (t) is? Two digitized signals are slightly different (because microphones are different, were at different positions, due to ADC setups, etc).
I've digged around a bit and found out that this problem is usually called "time-delay estimation" and it has myriads of approaches to it - for example, one of them.
But are there any simple and ready-made solutions, such as command-line utility, library or straight-forward algorithm available?
Conclusion: I've found no simple implementation and done a simple command-line utility myself - available at https://bitbucket.org/GreyCat/calc-sound-delay (GPLv3-licensed). It implements a very simple search-for-maximum algorithm described at Wikipedia.
The technique you're looking for is called cross correlation. It's a very simple, if somewhat compute intensive technique which can be used for solving various problems, including measuring the time difference (aka lag) between two similar signals (the signals do not need to be identical).
If you have a reasonable idea of your lag value (or at least the range of lag values that are expected) then you can reduce the total amount of computation considerably. Ditto if you can put a definite limit on how much accuracy you need.
Having had the same problem and without success to find a tool to sync the start of video/audio recordings automatically,
I decided to make syncstart (github).
It is a command line tool. The basic code behind it is this:
import numpy as np
from scipy import fft
from scipy.io import wavfile
r1,s1 = wavfile.read(in1)
r2,s2 = wavfile.read(in2)
assert r1==r2, "syncstart normalizes using ffmpeg"
fs = r1
ls1 = len(s1)
ls2 = len(s2)
padsize = ls1+ls2+1
padsize = 2**(int(np.log(padsize)/np.log(2))+1)
s1pad = np.zeros(padsize)
s1pad[:ls1] = s1
s2pad = np.zeros(padsize)
s2pad[:ls2] = s2
corr = fft.ifft(fft.fft(s1pad)*np.conj(fft.fft(s2pad)))
ca = np.absolute(corr)
xmax = np.argmax(ca)
if xmax > padsize // 2:
file,offset = in2,(padsize-xmax)/fs
else:
file,offset = in1,xmax/fs
A very straight forward thing todo is just to check if the peaks exceed some threshold, the time between high-peak on line A and high-peak on line B is probably your delay. Just try tinkering a bit with the thresholds and if the graphs are usually as clear as the picture you posted, then you should be fine.

Resources