Building the dataset for Random Forest training procedure

Building the dataset for Random Forest training procedure - algorithm

I should use the bagging (abbreviation for bootstrap aggregating) technique in order to train a random forest classifier. I read here the description of this learning technique, but I have not figured out how I initially organize the dataset.
Currently I first load all the positive examples and immediately after the negative ones. Moreover, positive examples are less than half of the negative ones, so by sampling from the dataset uniformly, the probability of obtaining a negative example is greater than that of obtaining a positive example.
How should I build the initial dataset?
Should I shuffle the initial dataset containing positive and negative examples?

Bagging depends on using bootstrap samples to train the different predictors, and aggregating their results. See the above link for the full details, but in short - you need to sample from your data with repetitions (i.e. if you have N elements numbered 1 through N, pick K random integers between 1 and N, and pick those N elements to be a training set), usually creating samples of the same size as the original dataset each (i.e. K=N).
One more thing you should probably bear in mind - random forests are more than just bootstrap aggregations over the original data - there is also a random selection of a subset of the features to use in each individual tree.

Related

Pointwise vs. pairwise Learning-to-rank on DATA WITH BINARY RELEVANCE VALUES

I have two question about the differences between pointwise and pairwise learning-to-rank algorithms on DATA WITH BINARY RELEVANCE VALUES (0s and 1s). Suppose the loss function for a pairwise algorithm calculates the number of times an entry with label 0 gets ranked before an entry with label 1, and that for a pointwise algorithm calculates the overall differences between the estimated relevance values and the actual relevance values.
So my questions are: 1) theoretically, will the two groups of algorithms perform significantly differently? 2) will a pairwise algorithm degrade to pointwise algorithm in such settings?
thanks!

In point wise estimation the errors across rows in your data (rows with items and users, you want to rank items within each user/query) are assumed to be independent sort of like normally distributed errors. Whereas in pair wise evaluation the algorithm loss function often used is cross entropy - a relative measure of accurately classifying 1's as 1's and 0's as 0s in each pair (with information - i.e. one of the item is better than other within the pair).
So changes are that the pair wise is likely to learn better than point-wise.
Only exception I could see is a business scenario when users click items without evaluating/comparing items from one another per-say. This is highly unlikely though.

Puzzle: for boolean matrix find permutation of rows and colums that allows decomposistion into minimal set of covering rectangles

Suppose I know an algorithm, that partitions a boolean matrix into a minimal set of disjoint rectangles that cover all "ones" ("trues").
The task is to find a permutation of rows and columns of the matrix, such that a matrix built by shuffling the columns and rows according to the permutations can be partitioned into a minimal set of rectangles.
For illustration, one can think about the problem this way:
Suppose I have a set of objects and a set of properties. Each object can have any number of (distinct) properties. The task is to summarize (report) this mapping using the least amount of sentences. Each sentence has a form "<list of objects> have properties <list of properties>".
I know I can brute-force the solution by applying the permutations and run the algorithm on each try. But the time complexity explodes exponentially making this approach non-practical for matrices bigger than 15×15.
I know I can simplify the matrices before running the algorithm by removing duplicated rows and columns.
This problem feels like it is NP-hard, and there might be no fast (polynomial in time) solutions. If that is so, I'd be interested to learn about some approximate solutions.

This is isomorphic to reducing logic circuits, given the full set of inputs (features) and the required truth table (which rows have which feature). You can solve the problem with classic Boolean algebra. The process is called logic optimization.
When I was in school, we drew Karnaugh maps on the board and drew colored boundaries to form our rectangles. However, it sounds as if you have something larger than one would handle on the board; try the QM algorithm and the cited heuristics for a "good enough" solution for many applications.

My solution so far:
First let us acknowledge, that the problem is symmetric with respect to swapping rows with columns (features with objects).
Let us represent the problem with the binary matrix, where rows are objects and columns are features and ones in the matrix represent matched pairs (object, feature).
My idea so far is to run two steps in sequence until there is no 1s left in the matrix:
Heuristically find a good unshuffling permutation of rows and columns on which I can run 2D maximal rectangle
Find the maximal rectangle, save it to the answer list and zero all 1s belonging to it.
Maximal rectangle problem
It can be simply any of the implementations of the maximal rectangle problem found on the net, for instance https://www.geeksforgeeks.org/maximum-size-rectangle-binary-sub-matrix-1s/
Unshuffling the rows (and columns)
Unshuffling rows are independent of unshuffling columns and both tasks can be run separately (concurrently). Let us assume I am looking for the unshuffling permutation of columns.
Also, it is worth noting, that unshuffling a matrix should yield the same results if we swap ones with zeroes.
Build a distance matrix of columns. A distance between two columns is defined as Manhattan distance between the two columns represented numerically (i.e. 0 - the absence of a relationship between object and feature, 1 - presence)
Run hierarchical clustering using the distance matrix. The complexity is O(n^2), as I believe single linkage should be good enough.
The order of objects returned from the hierarchical clustering is the unshuffling permutation.
The algorithm works good enough for my use cases. The implementation in R can be found in https://github.com/adamryczkowski/rectpartitions

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?

Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.

Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.

I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

How to pick base samples deterministically in the particle filter algorithm?

The particle filter algorithm is known for its use in tracking objects in a video sequence: at each iteration, the algorithm generates hypotheses (or samples) about the motion of the object. In order to generate a new hypothesis, the first step of the condensation algorithm involves the selection of a sample: the example, provided in this web page, shows an implementation of the selection step, which uses the binary search in order to pick a base sample; the comment in support of the pick_base_sample() function explains that
The use of this routine makes Condensation O(NlogN) where N is the number of samples. It is probably better to pick base samples
deterministically, since then the algorithm is O(N) and probably
marginally more efficient, but this routine is kept here for
conceptual simplicity and because it maps better to the published
literature.
What it means to pick base samples deterministically?
How to pick base samples deterministically?

The condensation algorithm makes use of multiple samples to represent the current estimated state, each sample has an associated weight (that estimates the probability that the sample is correct).
The selection step chooses N samples from this set (with replacement, so the same sample can appear multiple times).
To explain the selection step, imagine drawing the samples as a series of line segments. Let the width of each line segment equal the weight of that sample.
For example, suppose we had samples A (weight 0.1) B (weight 0.3) and C (weight 0.6).
We would draw:
ABBBCCCCCC
The normal random selection process involves drawing samples by picking a random point along this line and seeing which sample appears at that position. The perceived problem with this approach is that it takes O(logN) operations to work out which sample appears at a particular location when using a tree data structure to hold the weights. (Although in practice I would not expect this to be the main processing bottleneck in an implementation)
An alternative deterministic (basically think "repeatable" and "not involving random numbers") approach is to simply choose samples by picking N regularly spaced points along the same line. The advantage of this is that the algorithm to do this takes time O(N) instead of O(NlogN).
(The deterministic algorithm is to loop over all the samples keeping track of the total weight seen so far. Whenever the total weight reaches the next regularly spaced point you collect a new sample. This only requires a single pass over the samples so is O(N).)

conjugate of an integer partition

Is the conjugate of an integer partition, selected at random from the set of all partitions for n, also a uniform random sample? My results suggest yes, which is encouraging for the sake of quickly generating random partitions of n that are of length s, but I can't explain why that should or shouldn't be.
By the way, my results are based on 1.) generating all partitions for a small n (<70) of a specific length (s) 2.) calculating the variance of each partition as a macrostate descriptor and 3.) comparing the kernel density curve for the variance across the entire feasible set (all partitions for n of length s) against small random samples (i.e. <500 randomly generated partitions of n whose lengths either match s or whose conjugate lengths match s). Kernel density curves for random samples closely match the curve for the entire feasible set (i.e. all partitions of n matching s). This visually illustrates that random samples, the majority of which are conjugate partitions, capture the distribution of variance among partitions of the n and s based feasible set. I just can't explain why it should work as it appears to do; downfall of making a creative leap.
Note: Many other procedures for producing random samples yield a clearly biased sample (i.e. a differently shaped and highly non-overlapping kernel density curve).

Yes. Conjugation is a bijective operation, so each partition maps to a unique conjugate, which in turn maps back to the original partition. Therefore, there can't be any bias introduced by taking the conjugate of a partition selected uniformly at random.
I don't think this helps you generate fixed length partitions at random though - you should probably adapt Nijenhuis & Wilf's algorithm to do this correctly. This shouldn't be very hard to do, since the numbers of partitions of n into k parts can be computed easily, and the random generation algorithm really only depends on this.
Knuth includes an exercise (47) on generating random partitions in section 7.2.4.1 of TAOCP volume 4A. This would be an excellent starting point for an efficient algorithm to generate fixed length partitions uniformly at random.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio