Random Decision Forest implementation with C# - random

Hi I'm trying to use ALGLIB to calculate rdf. Unfortunately each of my training samples have more variables than two. I should use the function below but my training data has 7 variables. I think implementation of ALGLIB is implemented for just 2 variables. How can I achive to use it for 7 variabled training samples ?
public static void alglib.dfbuildrandomdecisionforestx1(
double[,] xy,
int npoints,
int nvars,
int nclasses,
int ntrees,
int nrndvars,
double r,
out int info,
out decisionforest df,
out dfreport rep)
Thanks in advance. I'm not insisting on using ALGLIB if there is another library which is proper for my training set and has implementation of random decision forest I can use it.

You should put all your training samples into two dimensional array of size [npoints, nvars+1] where npoints = number of training examples, nvars = number of variables(7 in your case) and the last column(+1) is for class label.
More information on parameteres you can find near the function definition.
For more information on dataset format read this -> dataset format

Related

What's the best way to compress multiple values into deserializable value?

I'm implementing an openpeeps.com library for Flutter in which user can create their own peeps to use as an avatar within our product.
One of the reasons behind using peeps as avatar is that (in theory) it can be easily stored as a single value within a database.
A Peep within my library contains of up to 6 PeepAtoms:
class Peep {
final PeepAtom head;
final PeepAtom face;
final PeepAtom facialHair;
final PeepAtom? accessories;
final PeepAtom? body;
final PeepAtom? pose;
}
A PeepAtom is currently just a name identifying the underlying image file required to build a Peep:
class PeepAtom {
final String name;
}
How to get a hash?
What I'd like to do now is get a single value from a Peep (int or string) which I can store in a database. If I retrieve the data, I'd like to deconstruct the value into the unique atoms so I can render the appropriate atom images to display the Peep. While I'm not really looking to optimize for storage size, it would be nice if the bytesize would be small.
Since I'm normally not working with such stuff I don't have an idea what's the best option. These are my (naïve) ideas:
do a Peep.toJson and convert the output to base64. Likely inefficient due to a bunch of unnecessary characters.
do a PeepAtom.hashCode for each field within a Peep and upload this. As an array that would be 64bit = 8 Byte * 6 (Atoms). Thats pretty ok but not a single value.
since there are only a limited number of Atoms in each category (less than 100) I could use bitshifts and ^ to put this into one int. However, I think this would not really working because I'd need a unique identifier and since I'm code generating the PeepAtoms within my code that likely would be quite complex.
Any better ideas/algorithms?
I'm not sure what you mean by "quite complex". It looks quite simple to pack your atoms into a double.
Note that this is no way a "hash". A hash is a lossy operation. I presume that you want to recover the original data.
Based on your description, you need seven bits for each atom. They can range in 0..98 (since you said "less than 100"). A double has 52 bits of mantissa. Your six atoms needs 42 bits, so it fits easily. For atoms that can be null, just give that a special unused 7-bit value, like 127.
Now just use multiply and add to combine them. Use modulo and divide to pull them back out. E.g.:
double val = head;
val = val * 128 + face;
val = val * 128 + facialHair;
...
To extract:
int pose = val % 128;
val = (val / 128).floorToDouble();
int body = val % 128;
val = (val / 128).floorToDouble();
...

How to average several columns at once in Scalding?

As the final step on some computations with Scalding I want to compute several averages of the columns in a pipe. But the following code doesn't work
myPipe.groupAll { _average('col1,'col2, 'col3) }
Is there any way to compute such functions sum, max, average without doing several passes? I'm concerned about performance but maybe Scalding is smart enough to detect that programmatically.
This question was answered in the cascading-user forum. Leaving an answer here as a reference
myPipe.groupAll { _.average('col1).average('col2).average('col3) }
you can do size (aka count), average, and standardDev in one go using the function below.
// Find the count of boys vs. girls, their mean age and standard deviation.
// The new pipe contains "sex", "count", "meanAge" and "stdevAge" fields.
val demographics = people.groupBy('sex) { _.sizeAveStdev('age -> ('count, 'meanAge, 'stdevAge) ) }
finding max would require another pass though.

How to choose all possible combinations?

Let's assume that we have the list of loans user has like below:
loan1
loan2
loan3
...
loan10
And we have the function which can accept from 2 to 10 loans:
function(loans).
For ex., the following is possible:
function(loan1, loan2)
function(loan1, loan3)
function(loan1, loan4)
function(loan1, loan2, loan3)
function(loan1, loan2, loan4)
function(loan1, loan2, loan3, loan4, loan5, loan6, loan7, loan8, loan9, loan10)
How to write the code to pass all possible combinations to that function?
On RosettaCode you have implemented generating combinations in many languages, choose yourself.
Here's how we could do it in ruby :
loans= ['loan1','loan2', ... , 'loan10']
def my_function(loans)
array_of_loan_combinations = (0..arr.length).to_a.combination(2).map{|i,j| arr[i...j]}
array_of_loan_combinations.each do |combination|
//do something
end
end
To call :
my_function(loans);
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it might be faster than the link you have found.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
To solve your problem, you might want to write a new loans function that takes as input an array of loan objects and works on those objects with the BinCoeff class. In C#, to obtain the array of loans for each unique combination, something like the following example code could be used:
void LoanCombinations(Loan[] Loans)
{
// The Loans array contains all of the loan objects that need
// to be handled.
int LoansCount = Loans.Length;
// Loop though all possible combinations of loan objects.
// Start with 2 loan objects, then 3, 4, and so forth.
for (int N = 2; N <= N; N++)
{
// Loop thru all the possible groups of combinations.
for (int K = N - 1; K < N; K++)
{
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
int[] KIndexes = new int[K];
// Loop thru all the combinations for this N choose K.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination, which in this case
// are the indexes to each loan in Loans.
BC.GetKIndexes(Loop, KIndexes);
// Create a new array of Loan objects that correspond to
// this combination group.
Loan[] ComboLoans = new Loan[K];
for (int Loop = 0; Loop < K; Loop++)
ComboLoans[Loop] = Loans[KIndexes[Loop]];
// Call the ProcessLoans function with the loans to be processed.
ProcessLoans(ComboLoans);
}
}
}
}
I have not tested the above code, but in general it should solve your problem.

In Stata, how do I manipulate matrix elements by their name?

In Stata, after a regression I know it is possible to call the elements of stored results by name. For example, if I want to manipulate the coefficient on the variable precip, I just type _b[precip]. My question is how do I do the same after the tabstat command? For example, say I want to multiply the coefficient on precip by the sample mean of precip:
reg --variables in regression--
tabstat --variables in regression--
mat X=r(StatTotal)
mat Y=_b[precip]*X[1,precip]
Ah, if only it were that simple. But alas, in the last line X[1, precip] is invalid syntax. Oddly, Stata does recognize display X[1, precip]. And Stata would know what I'm trying to do if instead of precip I used the column number where precip appears in the X vector. If I were just doing this operation once, no problem. But I need to do this operation several times (for several different model specifications) and for several variables which change position in the vector from one model to the next, so I cannot just use the column number.
I am not yet sure I understand exactly what you want to do, but here's my attempt to reproduce what you are doing:
sysuse auto, clear
regress price mpg foreign weight
tabstat mpg foreign weight, save
matrix X = r(StatTotal)
matrix Y = _b[mpg]*X[1, colnumb(X, "mpg") ]
If you need to put this into a cycle, that's doable, too:
matrix bb = e(b)
local explvar : colnames bb
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix Y_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix Y_`x' = _b[`x']
}
}
You'd probably want to put this into a program that you will call after each regression model estimation call, e.g.:
program define reg2mat , prefix( name )
if "`e(cmd)'" != "regress" {
// this will intentionally produce an error
regress
}
tempname bb
matrix `bb' = e(b)
local explvar : colnames `bb'
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix `prefix'_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix `prefix'_`x' = _b[`x']
}
}
end // of reg2mat
At many levels, it is not ideal, as it manipulates with the (global) matrices in Stata memory; most of the time, it is a bad idea, as the programs should only manipulate with objects local to them.
I suspect that what you want to do is addressed, in one way or another, by either omnipowerful margins command, or by an appropriate predict, or by matrix score (which is the low level version of predict). Attributing the effects to a variable only makes sense when your regressors are orthogonal, which only happens in carefully designed and conducted experiments.

algorithm to get US zip codes from gis x,y coordinates

I have a database of many tens of thousands of events that occurred at specific geographic locations within the United States. The data include x,y coodinates for each event, encoded using the NAD83 reference system. I want to write or use an algorithm to reliably get the US zip code associated with each NAD83 x,y coordinate.
I do not yet have zip code definitions using the NAD83 reference system. And I have never done this kind of programming before. But it just seems like it would be intuitively simple to find out whether a given x,y coordinate is located within a geometric shape of a US zip code defined using the same NAD83 reference system.
Can anyone help me with the following:
1.) Where do I get reliable US Zip Code definitions in the NAD83 reference system format?
2.) Where can I find example code for an algorithm to find the zip code given an x,y coordinate?
Any links you can send to instructional articles/tutorials, example code, and NAD83 zip code boundary definition data would be really helpful. I am doing google searches, but I figured that people on this site might be able to give me more of an expert's guide.
I code in Java every day. But, if the code you provide is not written in java, I could take code written in another language and adapt it to java for my purposes. I do not have database software installed in my computer because I just use csv or text files as inputs into my java applications. If you have some database that you suggest I use, I would need links to instructions for how to get the data into a format that I can import into a programming language such as java.
Finally, the street addresses in my dataset do not include zip codes, and the street addresses are written haphazardly, so that it would be very difficult to try to clean the address data up enough to try to get zip codes from the addresses. I can isolate the data to several adjacent cities, in perhaps a couple hundred zip codes, but I think that the NAD83 x,y coordinates are my best shot at deriving the zip code in which each event in my dataset occurred. I want to link my resulting zip code by zip code analyses with other data that I get about each zip code from sources like the US Census, etc.
Thank you in advance to anyone who is willing to help.
You can use GeoTools in java. Here is a an example the searches for a point in a shapefile.
// projection/datum in SR-ORG:7169 (GCS NAD83)
File shapeFile = new File("zt08_d00.shp");
FileDataStore store = FileDataStoreFinder.getDataStore(shapeFile);
SimpleFeatureSource featureSource = store.getFeatureSource();
// Boulder, CO
Filter filter = CQL.toFilter("CONTAINS(the_geom, POINT(-105.292778 40.019444))");
SimpleFeatureCollection features = featureSource.getFeatures(filter);
for (SimpleFeature f : features) {
System.out.println(f.getAttribute('NAME'));
}
I grabbed a shapefile from the U.S. Census Bureau's collection of 5-Digit ZIP Code Tabulation Areas from the 2000 Census. I just used a single file for the state of colorado. You would need merge these into a single FeatureSource. Running this outputs 80302 for Boulder, CO.
GeoTools also allow you to convert between projections if needed. Luckily these shapefiles are already in NAD83.
i don't know where to get the ZIP code, but i think you can google it out, the ZIP code of each state.
and to question (2), first you'll need the geographic information, i.e. the boundary of each state. then you just enumerate all the points(x,y) and determine which polygon it's in.
Here is a sample code, it was written for SGU124.
#include <map>
#include <cstdio>
#include <cstring>
#include <algorithm>
#define MAXN 10005
using namespace std;
struct pnt{
int x,y;
};
struct seg{
pnt a,b;
} s[MAXN];
int n;
pnt p;
int h[MAXN<<1];
int k[MAXN<<1];
void work(){
int i,x,y,c = 0;
memset(h,0,sizeof(h));
memset(k,0,sizeof(k));
for (i=0;i<n;i++){
if (s[i].a.x<=p.x && p.x<=s[i].b.x && s[i].a.y<=p.y && p.y<=s[i].b.y){
printf("BORDER\n");
return;
}
if (s[i].a.x==s[i].b.x){
x = s[i].a.x;
y = p.y - p.x + x;
if (x<=p.x && s[i].a.y<=y && y<=s[i].b.y){
h[x+MAXN] = 1;
if (y==s[i].a.y) k[x+MAXN] |= 1;
else if (y==s[i].b.y) k[x+MAXN] |= 2;
}
}
else{
y = s[i].a.y;
x = p.x - p.y + y;
if (x<=p.x && s[i].a.x<=x && x<=s[i].b.x){
//printf("%d %d %d %d\n",s[i].a.x,s[i].a.y,s[i].b.x,s[i].b.y);
h[x+MAXN] = 1;
if (x==s[i].a.x) k[x+MAXN] |= 4;
else if (x==s[i].b.x) k[x+MAXN] |= 8;
}
}
}
for (i=p.x;i>=-10000;i--){
//if (h[i+MAXN]>0) printf("# %d %d\n",i,k[i+MAXN]);
if (k[i+MAXN]!=9 && k[i+MAXN]!=6) c += h[i+MAXN];
}
//printf("p # %d %d ",p.x,p.y);
if (c%2) printf("INSIDE\n");
else printf("OUTSIDE\n");
}
int main(){
freopen("sgu124.in","r",stdin);
int i;
while (~scanf("%d",&n)){
for (i=0;i<n;i++){
scanf("%d%d",&s[i].a.x,&s[i].a.y);
scanf("%d%d",&s[i].b.x,&s[i].b.y);
if (s[i].a.x>s[i].b.x || s[i].a.y>s[i].b.y) swap(s[i].a,s[i].b);
}
scanf("%d%d",&p.x,&p.y);
work();
//break;
}
return 0;
}
You mentioned that you have addresses that you might be able to use. In that case, an address verification service will allow you to programmatically find the ZIP codes based on the the address and city/state. Even if poorly formatted, the address data could likely get you to 90 or 95% of your goal, leaving the remainder to either clean up and reprocess or try to use the coordinates to determine.
SmartyStreets will take an uploaded CSV file with your data and perform address validation (correct and standardize the address) and then verify the addresses using data from the USPS. One unique feature of SmartyStreets is that they don't charge anything for bad addresses. This would allow you to format and process various permutations of each address (to try to account for the haphazard data) and only pay for it if a positive match is resolved.
In the interest of full disclosure, I am the founder of SmartyStreets. We provide street address verification.

Resources