I using Stanford NLP Text Classifier (ColumnDataClassifier) from my Java code. I have two main questions.
1-) How do I print more detailed evaluation information such as a confusion matrix.
2-) My code already, does the pre-processing and extracts numeric features (vectors) for terms, such as binary features or TF-IDF values. How can I use those features to train and test the classifier.
I asked a related question in here. ColumnDataClassifier does not have an option to output the metrics in a confusion matrix. However, if you look at the code in at ColumnDataClassifier.java you can see where the TP, FP, TN, FN are output to the stdin. This place has the raw values that you need. It could be used for a method that aggregates these into a confusion matrix and outputs it after the run, but you would have to write this code yourself.
The wiki has an example of how to use numerical features with the ColumnDataClassifier. If you use numerical features, take a look at these options from the API that allow you do apply some transformations:
realValued boolean false Treat this column as real-valued and do not perform any transforms on the feature value. Value
logTransform boolean false Treat this column as real-valued and use the log of the value as the feature value. Log
logitTransform boolean false Treat this column as real-valued and use the logit of the value as the feature value. Logit
sqrtTransform boolean false Treat this column as real-valued and use the square root of the value as the feature value. Sqrt
Related
I want to use the K-prototype algorithm (a type of KNN algorithm used for mixed data :numerical and categorical data) for a clustering problem.
The algorithm handles the categorical values without numerical encoding, so I don't need to encode them to numerical values.
My question is : do we need to standardize the numerical columns before applying k-prototypes?
For example, I have the following columns: age(float), salary(float), gender(object), city(object), profession(object).
Do I need to apply standardization like this?
from sklearn.preprocessing import StandardScaler
scaled_X = StandardScaler().fit_transform(X[['salary', 'age']])
X[['salary', 'age']] = scaled_X
But I think that standardization has no value if it is not applied to all columns,because its goal is to make all variables on the same scale and not just some columns!
so in this case, we do not need to apply it!
I hope I explained my question well, Thank you.
"The lasso method requires initial standardization of the regressors,
so that the penalization scheme is fair to all regressors. For
categorical regressors, one codes the regressor with dummy variables
and then standardizes the dummy variables" (p. 394).
Tibshirani, R. (1997). The lasso method for variable selection in the Cox model.
Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf
H2O:
Similar to package ‘glmnet,’ the h2o.glm function includes a ‘standardize’ parameter that is true by default. However, if predictors are stored as factors within the input H2OFrame, H2O does not appear to standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:
For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).
GLMnet:
In contrast, package 'glmnet' seems to expect categorical variables to be dummy-coded prior to fitting a model, using a function like model.matrix. The dummy variables are then standardized along with the continuous variables. It seems like the only way to avoid this would be to pre-standardize the continuous predictors only, concatenate them with the dummy variables, and then run glmnet with standardize=FALSE.
Statistical Considerations:
For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.
It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.
A commonly expressed concern is that standardizing dummy variables removes their normal interpretation. To avoid this issue, while still placing continuous and categorical predictors on an equal footing, Gelman (2008) suggested standardizing continuous predictors by dividing by 2 SD, rather than 1, resulting in standardized predictors with SD = 0.5. However, it seems like this would still be biased for class-imbalanced dummy variables, for which the SD might be substantially less than 0.5.
Gelman, A. (2008). Scaling regression inputs by dividing by two
standard deviations. Statistics in medicine, 27(15), 2865-2873.
http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf
Question:
Is H2O's approach of not standardizing one-hot vectors for regularized regression correct? Could this lead to a bias toward over-penalizing dummy variables / one-hot vectors? Or has Tibshirani (1997)'s recommendation since been revised for some reason?
Personally, I rather keep the binary features untouched and apply MinMaxScalar between 0 and 1 to the numeric features instead of the normalization. This puts the numeric features on a similar standard deviation scale as those of binaries.
I would like to use machine learning techniques such as Naive Bayes and SVM in Weka to identify species using DNA Sequence data.
The Issue is that I have to convert the DNA sequences into numerical vectors.
MY sequences are like this:
------------------------------------------------G
------------------------------------------GGAGATG
------------------------------------------GGAGATG
------------------------------------------GGAGATG
TTATTAATTCGAGCAGAATTAGGAAATCCTGGATCTTTAATTGGTGATG
----------------------------------------------ATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG
-----------------------TCAACCTGGGGCCCTACTCGGAGACG
----TAATCCGAGCAGAATTAAGCCAACCTGGCGCCCTACTAGGGGATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATG
TTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATA
How can I do this? Any suggestion of other programs for doing ML with DNA sequences besides Weka?
This answer makes use of R.
You can use R's Biostrings package for this.
Install package first:
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biostrings"))
Convert character string to DNAstring:
dna1 <- DNAString("------------------------------------------------G------------------------------------------GGAGATG")
Alternatively,
dna2 <- DNAStringSet(c("ACGT", "GTCA", "GCTA"))
alphabetFrequency(dna1)
letterFrequency(dna1, "GC")
....
Then (if you must) you can call Weka functions from R, e.g. Naive Bayes with NB <- make_Weka_classifier("weka/classifiers/bayes/NaiveBayes")
; NB(colx ~ . , data=mydata), or convert your data as you wish and/or export to other types of files that Weka understands. The foreign::write.arff() function comes to mind. But I wouldn't use Weka for this.
Needless to say, you can simply also enter these sequences into a website performing a BLAST search and get likely species candidates.
For CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG I get mitochondrial DNA from "banded rock lizard" (Petrosaurus mearnsi) with 91% probability.
My question is Can I generate a random number in Uppaal?
I would like to generate a number from a range of values. Even more, I would like to generate not just integers I would like to generate double values as well.
for example: double [7.25,18.3]
I found this question that were talking about the same. I tried it.
However, I got this error: syntax error unexpected T_SELECT.
It doesn't work. I'm pretty new in Uppaal world, I would appreciate any help that you can provide me.
Regards,
This is a common and misunderstood question in Uppaal.
Simple answer:
double val; // declaration
val = random(18.3-7.25)+7.25; // use in update, works in SMC (Uppaal v4.1)
Verbose answer:
Uppaal supports symbolic analysis as well as statistical and the treatment and possibilities are radically different. So one has to decide first what kind of analysis is needed. Usually one starts with simple symbolic analysis and then augment with stochastic features, sometimes stochastic behavior needs also to be checked symbolically.
In symbolic analysis (queries A[], A<>, E<>, E[] etc), random is synonymous with non-deterministic, i.e. if the model contains some "random" behavior, then verification should check all of them any way. Therefore such behavior is modelled as non-deterministic choices between edges. It is easy to setup a set of edges over an integer range by using select statement on the edge where a temporary variable is declared and its value can be used in guards, synchronization and update. Symbolic analysis supports only integer data types (no floating point types like double) and continuous ranges over clocks (specified by constraints in guards and invariants).
Statistical analysis (via Monte-Carlo simulations, queries like Pr[...](<> p), E[...](max: var), simulate, etc) supports double types and floating point functions like sin, cos, sqrt, random(MAX) (uniform distribution over [0, MAX)), random_normal(mean, dev) etc. in addition to int data types. Clock variables can also be treated as floating point type, except that their derivative is set to 1 by default (can be changed in the invariants which allow ODEs -- ordinary differential equations).
It is possible to create models with floating point operations (including random) and still apply symbolic analysis provided that the floating point variables do not influence/constrain the model behavior, and act merely as a cost function over the state space. Here are systematic rules to achieve this:
a) the clocks used in ODEs must be declared of hybrid clock type.
b) hybrid clock and double type variables cannot appear in guard and invariant constraints. Only ODEs are allowed over the hybrid clocks in the invariant.
A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)