How to bootstrap if sample size is very large - statistics-bootstrap

Normally when we want to calculate the variance of the sample median, we would use bootstrap. But if the sample size is very very large, should we still bootstrap the whole sample? If not, is there a better method to use?

Related

"Snapping" the Results of t-sne to a Regular Grid - Scalability Issues

I'm trying to use t-sne to arrange images based on their visual similarity, similar to this cool example for emojies (source):
but the output of t-sne is just a "point cloud", while my goal is to display the images in a regular, near-square, dense grid. So I need to somehow convert the output of t-sne to (x,y) locations on a grid.
So far, I've followed the suggestion in this great blog post: I formulated it as a linear assignment problem to find the best embedding into a regular grid. I'm pleased with the results, for example:
My Problem is that the "snapping to grid" stage turns out to be a huge bottleneck, and I need my method to scale well for a large number of images (10K). To solve the linear assignment problem I'm using a Java implementation of the Jonker-Volgenant algorithm, whose time complexity is O(n^3). So while t-sne is nlogn and can scale well up to 10K images, the part of aligning to a regular grid can only deal with up to 2K images.
Potential Solutions, as I see it:
Randomly sample 2K images out of the total 10K
Divide the 10K images into 5 and create 5 maps. This is problematic because there's a "chicken and egg" problem, how do I do the division well?
Trade accuracy for performance: Solve the Linear assignment problem approximately in a near-linear time. I want to try this but I couldn't find any existing implementations for me to use.
Implement the "snap to grid" part in a different, more efficient way.
I'm working with Java but solutions in cpp are also good. I'm guessing I'm not the first to try this. Any suggestions? Thoughts?
Thanks!

Is it good to do Cross Validation with the exact same dataset that is used in training phase?

I am using Weka API to test the performance of some algorithms. If I want to divide the dataset as following:
70% for training
10% for validation
20% for testing
For the validation phase, should I use the cross validation method on 10% divided fresh data? Or is it better to apply cross validation on the 70% data that has already trained? And why?
It is actually very problem specific, but in general - it depends on the size of the dataset. If you have big dataset then even a subsample is representative, thus you can split everything once to train/valid/test and just run a typical optimization and testing routine. On the other hand, if you have rather small datast (~1000 samples) then actually both testing and validation require CV (or other technique, like err 0.632). It is all about statistical significance of obtained error estimates. If data is small - you need to generate multiple experiments (CV) to get a reasonable estimator; if you have 100,000 samples then even 10% should be enough to use as a valid estimator of error.

Reduce time needed for SIFT Brute-force Matcher Algorithm in OpenCV

I've been wondering how can SIFT Brute-force Matcher (for object identfication) time be reduced. My program iterates over some trainsets of images and returns the maximum number of matches found in each case. Is there a way (except for resizing the images to a smaller scale) to shorten this time interval because I think that the greater becomes my database it will be rendered useless if I have a huge identification time?
Thanks in advance
PS I use Python Bindings for OpenCV 2.4.9
why dont you try flann matcher...... thats much faster than bruteforce......
http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_feature_homography/py_feature_homography.html#py-feature-homography

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

Calculating a histogram on a streaming data - Online histogram calculation

I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.
I appreciate your ideas.
Cheers,
I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :
A generic, re-usable histogram class
that supports partial aggregations.
The algorithm is a heuristic adapted
from the following paper: Yael
Ben-Haim and Elad Tom-Tov, "A
streaming parallel decision tree
algorithm", J. Machine Learning
Research 11 (2010), pp. 849--872.
Although there are no approximation
guarantees, it appears to work well
with adequate data and a large (e.g.,
20-80) number of histogram bins.
I use a package called "GoHistogram" which provides two streaming approximattion histograms (NumericHistogram and Weighted Numeric Histogram). It is implemented in Golang (https://code.google.com). Here is the link:
https://github.com/VividCortex/gohistogram
Standard deviation and mean do not matter for a histogram. Simply choose your resolution and draw a bar as high as you have hits for its range. This will, of course, get more expensive with a higher resolution. You can try adjusting the resolution by trying to fit the existing data into a normal curve (or whatever model you like) and finding the standard deviation to choose a reasonable granularity.
Edit: Read it wrong the first time around. If you know the approximate standard deviation, you can choose reasonable sizes for your histogram groups from the get-go. Just compare every new entry to your current min and max and adjust your range accordingly.

Resources