Theano Classification Task always gives 50% validation error and test error? - sentiment-analysis

I am doing a text classification experiment with Theano's DBN (Deep Belief Network) and SDA (Stacked Denoising Autoencoder) examples.
I have produced a feature/label dataset just as Theano's MINST dataset is produced and changed the feature length and output values of those examples to adopt to my dataset (2 outputs instead of 10 outputs, and the number of features is adopted to my dataset).
Every time i run the experiments (both DBN and SDA) i get an exact 50% validation error and test error.
Do you have any ideas what i'm doing wrong? because i have just produced a dataset out of Movie Review Dataset as MINST dataset format and pickled it.
my code is the same code you can find in http://www.deeplearning.net/tutorial/DBN.html
and my SDA code is the same code you can find in
http://www.deeplearning.net/tutorial/SdA.html
The only difference is that i have made my own dataset instead of MINST digit recognition dataset. My dataset is Bag of Words features from Movie Review Dataset which of course has different number of features and output classes so i just have made tiny modifications in function parameters number of inputs and output classes.
The code runs beautifully but the results are always 50%.
This is a sample output:
Pre-training layer 2, epoch 77, cost -11.8415031463
Pre-training layer 2, epoch 78, cost -11.8225591118
Pre-training layer 2, epoch 79, cost -11.8309999005
Pre-training layer 2, epoch 80, cost -11.8362189546
Pre-training layer 2, epoch 81, cost -11.8251214285
Pre-training layer 2, epoch 82, cost -11.8333494168
Pre-training layer 2, epoch 83, cost -11.8564580976
Pre-training layer 2, epoch 84, cost -11.8243052414
Pre-training layer 2, epoch 85, cost -11.8373403275
Pre-training layer 2, epoch 86, cost -11.8341470443
Pre-training layer 2, epoch 87, cost -11.8272021013
Pre-training layer 2, epoch 88, cost -11.8403720434
Pre-training layer 2, epoch 89, cost -11.8393612003
Pre-training layer 2, epoch 90, cost -11.828745041
Pre-training layer 2, epoch 91, cost -11.8300890796
Pre-training layer 2, epoch 92, cost -11.8209189065
Pre-training layer 2, epoch 93, cost -11.8263340225
Pre-training layer 2, epoch 94, cost -11.8348454378
Pre-training layer 2, epoch 95, cost -11.8288419285
Pre-training layer 2, epoch 96, cost -11.8366522357
Pre-training layer 2, epoch 97, cost -11.840142131
Pre-training layer 2, epoch 98, cost -11.8334445128
Pre-training layer 2, epoch 99, cost -11.8523094141
The pretraining code for file DBN_MovieReview.py ran for 430.33m
... getting the finetuning functions
... finetuning the model
epoch 1, minibatch 140/140, validation error 50.000000 %
epoch 1, minibatch 140/140, test error of best model 50.000000 %
epoch 2, minibatch 140/140, validation error 50.000000 %
epoch 3, minibatch 140/140, validation error 50.000000 %
epoch 4, minibatch 140/140, validation error 50.000000 %
Optimization complete with best validation score of 50.000000 %,with test performance 50.000000 %
The fine tuning code for file DBN_MovieReview.py ran for 5.48m
I ran both SDA and DBN with two different feature sets. So i got this exact 50% accuracy on all these 4 experiments.

I asked the same question in Theano's user groups and they answered that feature values should be between 0 and 1.
So i used a normalizer to normalize feature values and it solved the problem.

I had same problem. I think this problem is because of Overshooting.
So I decreased learning rate 0.1 to 0.013 and increase epoch.
Then it works.
But I'm not sure your problem is same.

Related

Numeric Values in C4.5 algorithm

Threshold value Z:
–The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. –Any threshold value lying between viand vi+1will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.
It is usual to choose the midpoint of each interval: (vi+vi+1)/2 as the representative threshold. –C4.5 chooses as the threshold a smaller value vifor every interval {vi, vi+1}, rather than the midpoint itself
I just want to know if get this right.
Lets say I have:
{65, 70, 75, 78, 80, 85, 90, 95, 96}.
I must do m-1 calculations to find the optimal value so
{65, 70, 75, 78, 80, 85, 90, 95}.
For each split (ex. 65 and >= 65 , <70 and >=70 and so on). I must calculate
the Gain ratio, and choose the split that gives me the higher gain. Am I right?

Color image compression, steganalysis and inter-color correlations

I am trying to implement an LSB embedding steganalysis algorithm.
The database of images consists of 24-bit bmp color images (a few thousand).
Almost all steganalysis of LSB embedding steganography research work focuses on grayscale images.
I want to try to use inter-color correlations within the image for hidden message presence detection. Not just as three time grayscale.
I searched works on this topic and found a very simple algorithm (I can't provide the link).
It uses data compression for detecting hidden messages.
In short:
It states that the more information is hidden in the file, the greater the size of its archive (for data compression methods such as rar, nanzip, png etc), because data compression algorithms use inter-color correlations.
There was no proof link for the statements. I don't know how data compression algorithms work, intuitively I agree with those statements but I want to know for sure is it true or not, for example, for gzip or zip algorithms.
I don't know how any of these compression algoritms work, but I'll take a general stab at it.
Compression works by removing redundancy and generally uncompressed images have lots of it, because neighbouring pixels are correlated. That is, the colour in an area of an image tends to be similar because colour changes are slow and smooth.
However, the bit stream you embed tends to have no pattern and represents more of a random distribution of 1s and 0s. Random/noisy data have fewer patterns to take advantage of for compression. So, while an image can be compressed quite nicely, the compression won't reduce the file size as much after you hide your information, i.e., make the pixels more random.
To demonstrate this let's assume a basic and probably naive compression algorithm which detects whether you have a pixel value, P, repeated N time, and stores it as N, P. For example:
200, 200, 200, 200, 200, 200, 200, 201, 201, 201, 201 # uncompressed
7, 200, 4, 201 # compressed
Now, embed 1 bit in each of these pixels. The probability of switching the bit from 0 to 1 or vice versa is 50%, so approximately half of them will change. You could end up with something like this:
200, 200, 201, 200, 201, 201, 200, 200, 201, 200, 201 # uncompressed
2, 200, 1, 201, 1, 200, 2, 201, 2, 200, 1, 201, 1, 200, 1, 201 # compressed
Notice how much more data to the original case are required for compression. For comparison, embed 1 bit in only the first two pixels.
201, 200, 200, 200, 200, 200, 200, 201, 201, 201, 201 # uncompressed
1, 201, 6, 200, 4, 201 # compressed

How to compute the variances in Expectation Maximization with n dimensions?

I have been reviewing Expectation Maximization (EM) in research papers such as this one:
http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf
I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?
For example I have the following dataset with 6 datapoints and 4 dimensions:
>D1 D2 D3 D4
5, 19, 72, 5
6, 18, 14, 1
7, 22, 29, 4
3, 22, 51, 1
2, 21, 89, 2
1, 12, 28, 1
It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?
Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?
Usually, you use a Covariance matrix, which also includes variances.
But it really depends on your chosen model. The simplest model does not use variances at all.
A more complex model has a single variance value, the average variance over all dimensions.
Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.
Depending on your implementation, there can be many more.
From R's mclust documentation:
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = elliposidal multivariate normal

Word2Vec: Effect of window size used

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.
I am using the following test to evaluate the accuracy of model:
https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)
[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
{'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
{'correct': 0, 'incorrect': 86, 'section': 'currency'},
{'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
{'correct': 123, 'incorrect': 183, 'section': 'family'},
{'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
{'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
{'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
{'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
{'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
{'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
{'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
{'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
{'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
{'correct': 835, 'incorrect': 9973, 'section': 'total'}]
I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:
Sample output:
capital-common-countries:
ACCURACY TOP1: 19.37 % (98 / 506)
Total accuracy: 19.37 % Semantic accuracy: 19.37 % Syntactic accuracy: -nan %
capital-world:
ACCURACY TOP1: 10.26 % (149 / 1452)
Total accuracy: 12.61 % Semantic accuracy: 12.61 % Syntactic accuracy: -nan %
currency:
ACCURACY TOP1: 6.34 % (17 / 268)
Total accuracy: 11.86 % Semantic accuracy: 11.86 % Syntactic accuracy: -nan %
city-in-state:
ACCURACY TOP1: 11.78 % (185 / 1571)
Total accuracy: 11.83 % Semantic accuracy: 11.83 % Syntactic accuracy: -nan %
family:
ACCURACY TOP1: 57.19 % (175 / 306)
Total accuracy: 15.21 % Semantic accuracy: 15.21 % Syntactic accuracy: -nan %
gram1-adjective-to-adverb:
ACCURACY TOP1: 6.48 % (49 / 756)
Total accuracy: 13.85 % Semantic accuracy: 15.21 % Syntactic accuracy: 6.48 %
gram2-opposite:
ACCURACY TOP1: 17.97 % (55 / 306)
Total accuracy: 14.09 % Semantic accuracy: 15.21 % Syntactic accuracy: 9.79 %
gram3-comparative:
ACCURACY TOP1: 34.68 % (437 / 1260)
Total accuracy: 18.13 % Semantic accuracy: 15.21 % Syntactic accuracy: 23.30 %
gram4-superlative:
ACCURACY TOP1: 14.82 % (75 / 506)
Total accuracy: 17.89 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.78 %
gram5-present-participle:
ACCURACY TOP1: 19.96 % (198 / 992)
Total accuracy: 18.15 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.31 %
gram6-nationality-adjective:
ACCURACY TOP1: 35.81 % (491 / 1371)
Total accuracy: 20.76 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.14 %
gram7-past-tense:
ACCURACY TOP1: 19.67 % (262 / 1332)
Total accuracy: 20.62 % Semantic accuracy: 15.21 % Syntactic accuracy: 24.02 %
gram8-plural:
ACCURACY TOP1: 35.38 % (351 / 992)
Total accuracy: 21.88 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.52 %
gram9-plural-verbs:
ACCURACY TOP1: 20.00 % (130 / 650)
Total accuracy: 21.78 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.08 %
Questions seen / total: 12268 19544 62.77 %
However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks.
Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.
Very low scores on the analogy-questions are more likely due to limitations in the amount or quality of your training data, rather than mistuned parameters. (If your training phrases are really only 5 words each, they may not capture the same rich relations as can be discovered from datasets with full sentences.)
You could use a window of 5 on your phrases – the training code trims the window to what's available on either side – but then every word of each phrase affects all of the other words. That might be OK: one of the Google word2vec papers ("Distributed Representations of Words and Phrases
and their Compositionality", https://arxiv.org/abs/1310.4546) mentions that to get the best accuracy on one of their phrase tasks, they used "the entire sentence for the context". (On the other hand, on one English corpus of short messages, I found a window size of just 2 created the vectors that scored best on the analogies-evaluation, so larger isn't necessarily better.)
A paper by Levy & Goldberg, "Dependency-Based Word Embeddings", speaks a bit about the qualitative effect of window-size:
https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf
They find:
Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar? (Their own extension, the dependency-based embeddings, seems best at finding most-similar words, synonyms or obvious-alternatives that could drop-in as replacements of the origin word.)
To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".
For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here)
if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.
So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.

Validating fractal dimension computation in Mathematica

I've written an implementation of the standard box-counting algorithm for determining the fractal dimension of an image or a set in Mathematica, and I'm trying to validate it. I've generated a Sierpinski triangle matrix using the CellularAutomaton function, and computed its fractal dimension to be 1.58496 with a statistical error of about 10^-15. This matches the expected value of log(3)/log(2) = 1.58496 incredibly well.
The problem arises when I try to test my algorithm against a randomly-generated matrix. The fractal dimension in this case should be exactly 2, but I get about 1.994, with a statistical error of about 0.004. Hence, my box-counting algorithm seems to work perfectly fine for the Sierpinski triangle, but not quite so well for the random distribution. Any ideas why not?
Code below:
sierpinski512 = CellularAutomaton[90, {{1}, 0}, 512];
ArrayPlot[%]
d512 = FractalDimension[sierpinski512, {512, 256, 128, 64, 32, 16, 8, 4, 2}]
rtable = Table[Round[RandomReal[]], {i, 1, 512}, {j, 1, 1024}];
ArrayPlot[%]
drand = FractalDimension[rtable, {512, 256, 128, 64, 32, 16, 8, 4, 2}]
I can post the FractalDimension code if anybody really needs it, but I think the solution (if any) is not to do with the FractalDimension algorithm, but the rtable I'm generating above.
I have studied this problem a little empirically in consultation with a well-known physicist, and we believe that the fractal dimension of a random point process goes to 2 (in the limit, I think) as the number of points grows large. I can't provide an exact definition of "large" but it can't be less than a few thousand points. So, I think you should expect to get D < 2 unless the number of points is quite large, theoretically, large enough to tile the plane.
I would be grateful for your FractalDimension code!

Resources