I can't find the detailed description of how bin mapping is constructed in lightgbm paper.
I have several questions about bin mapping.
Is it static or dynamic? That is, during the growth of nodes, does the bin mapping change?
Does the number of bins of each feature dimension equal? For example, for one hot feature, does the number of bins equal to 2?
For real-valued feature, are the split points of bins uniformly distributed? Or any principles to find the split points of bins?
1: Bins are a form of preprocessing : each variable is converted to discrete values before the optimization. It is specific to your training data and does not change.
2: There is a parameter you can tune to set the maximum number of bin. But of course if your feature only has 5 different values there will be only 5 bins. So you can have different number of bins per feature.
3: The split points for the bins are not chosen by equal width, they are chosen by frequency : If you set 100 bins, the split points will be chosen such as each bin contains approximately 1% of all your training points (it could be more or less depending if you have equal values). This process is similar to the pandas qcut function.
Hope I have covered your questions.
Related
I have a spreadsheet of data in which the first 12 rows in the leftmost column have 12 names in alphabetical order (descending) and the first 12 columns in the topmost row have the same names in alphabetical order (left-to-right). These names represent the names of people who ranked something, and the values in the cells of this spreadsheet are the Kendall's Tau Similarity Coefficient between the names in the leftmost column and topmost row adjacent with the cell. How can I use Constrained K-Means Clustering to find the similarity between these names?
Image:
K-mrans clustering does not work on similarity matrixes.
It needs Euclidean space vector data, in order to compute the means (hence the name). It cannot maximize similarities, but it minimizes the sum-of-squares of coordinate differences.
Also, your question is off-topic, as it is not a programming questions but you only want to use an existing program.
Since your data is so tiny it fits on a single screen, I suggest you simply brute-force test all possible solutions. Then it's trivial to add your constraints (skip candidates that don't meet your size requirements). Even without constrains, if you want 4 clusters you have much less than 4^11 possibilities, that is 4 million minus plenty of redundant permutations minus all those where clusters are too small or too large.
I have a csv file with the following format:
thing1_id, thing2_id, similarity
The similarity is between 50 and 100. I've filtered out all pairs with similarity less than 50, but I do have the full set where the lowest is around 25. There are duplicate comparisons at the moment, i.e. thing1-thing2 is a separate entry from thing2-thing1.
I'm interested in writing a program that will take in a similarity threshold (s) and minimum number of items per set (n), and give me all sets of size n or greater with things that are all at least s% similar to all other elements in that set.
I was thinking a graph might be the best data structure to do this? Where each thing is a node, and the similarity is a weighted edge. I'm not too sure where to go from here without taking way too much memory. This is a set of around 400 things.
I am trying to use KNN algorithm from spark 2.2.0. I am wondering how I should set my bucket length. The record count/number of features varies, so I think it is better to set length by some conditions. How should I set the bucket length for better performance? I rescaled all the features in vector into 0 to 1.
Also, is there any way to guarantee KNN algorithm to return minimum number of elemnets? I found out that sometimes number of elements inside the bucket is smaller than queried k, and I might want at least one or two neighbors as result.
Thanks~
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH
According to Scaladocs
If input vectors are normalized, 1-10 times of pow(numRecords,
-1/inputDim) would be a reasonable value
There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.
I should use the bagging (abbreviation for bootstrap aggregating) technique in order to train a random forest classifier. I read here the description of this learning technique, but I have not figured out how I initially organize the dataset.
Currently I first load all the positive examples and immediately after the negative ones. Moreover, positive examples are less than half of the negative ones, so by sampling from the dataset uniformly, the probability of obtaining a negative example is greater than that of obtaining a positive example.
How should I build the initial dataset?
Should I shuffle the initial dataset containing positive and negative examples?
Bagging depends on using bootstrap samples to train the different predictors, and aggregating their results. See the above link for the full details, but in short - you need to sample from your data with repetitions (i.e. if you have N elements numbered 1 through N, pick K random integers between 1 and N, and pick those N elements to be a training set), usually creating samples of the same size as the original dataset each (i.e. K=N).
One more thing you should probably bear in mind - random forests are more than just bootstrap aggregations over the original data - there is also a random selection of a subset of the features to use in each individual tree.