How to handle the Nominal Data by Weka J48 - binary-tree

When I ran J48 of weka with binary split option, such decision tree was built.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
Input explanation variable is 1 nominal data which was made by question id + answer id.
1 nominal data, 1 transaction.
I'm wondering why the tree is at only one side.
Is it caused by my data set or table definition or original binary splits way?
I'd like the tree to have node on both sides.
If you know such a option please show me.
!Sample Data! Please ignore dot '・'
usr,qa,class
A,11,1
A,21,1
A,31,1
B,12,2
B,22,2
B,32,2
C,13,3
C,23,3
C,33,3
D,11,4
D,22,4
D,31,4
E,11,1
E,23,1
E,31,1
F,12,2
F,22,2
F,33,2
G,13,3
G,22,3
G,32,3
H,12,4
H,21,4
H,33,4

There's no error in the tree built and no option would really modify it. If your question is related to your same Akinator project, please reformat your data to get all questions (ie. 11,21,31) on the same instance/line and the answer as target class.
PS: if you import those data as CSV, Weka will take those data as numerical (not as as nominal). You should then add a non digit character (ie. #1,#2,#3...) so that Weka will take those data as nominal.

Related

Creating balanced bootstrap resamples in caret

I'm using caret to compare models for a classification problem with nested CV. Vfold in the outer loop and bootstrap (500 replicates) in the inner loop. I get this error after training knn:
Warning: There were missing values in resampled performance measures.
Which I believe comes from the fact that some resamples have zero items of the class of interest in the holdout sample, yielding NA for Sensitivity and ROC. My question is: Is there any way to ensure that items from this class are present in every bootstrap resample? Kind of what the CreateDataPartition function does (I believe this is also called stratified bootstrap?).
If not, how should we proceed with this? (In terms of comparing model performance on the same resamples)
Thanks!
So I couldn't find a way to do this within caret but here is a workaround using rsample package. The point is to compute the resamples before and feed this information to trainControl function via index and indexOut arguments, previous conversion to caret format.
indices=bootstraps(train,times=50,strata="class_of_interest")
indices=rsample2caret(indices)
train_control <- trainControl(method="boot",number=50,index=indices$index,indexOut = indices$indexOut)
Hope this helps.

Correlating multiple dynamic values

How can I get the value of important id and ValueType?
I have tried using web_save_param_regexp (but unfortunately I don't fully understand how the function works).
I have also tried using web_save_param (with the help of offset and length).
unfortunately once again I cannot get the accurate value some values change in length specially when the total amount values dynamically changes per run.
<important id=\"insertsomevalueshere\" record=\"1\" nucTotal=\"NUC609.40\"><total amount=\"68.75\" currency=\"USD\"/><total amount=\"609.40\" currency=\"USD\"/><out avgsomecost=\"540.65\" ValueType=\"insertsomevalueshere\" containsawesomeness=\"1\" Score=\"-97961\" somedatatype=\"1\" typeofData=\"VAL\" web=\"1\">
Put these lines of code before the line of code which does your web request:
web_reg_save_param_regexp("ParamName=importantid","Regexp=<important id=\\\"(.*?)\\\"",LAST);
web_reg_save_param_regexp("ParamName=ValueType","Regexp= ValueType=\\\"(.*?)\\\"",LAST);
You will then have two stored parameters 'importantid' and 'ValueType'
Dynamic number of elements to correlate? Your path for resubmission is through web_custom_request(). You will need to build the string you need dynamically with the name:value pairs for all of the data which needs to be included.
This path will place a premium on your string manipulation skills in the language of the tool. The default path is through C, but you have other language options if your skills are more refined in another language.

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

R: Which heatmap/image to get row-sorted plot without any dendrogram?

Which package is best for a heatmap/image with sorting on rows only, but don't show any dendrogram or other visual clutter (just a 2D colored grid with automatic named labels on both axes). I don't need fancy clustering beyond basic numeric sorting. The data is a 39x10 table of numerics in the range (0,0.21) which I want to visualize.
I searched SO (see this) and the R sites, and tried a few out. Check out R Graphical Manual to see an excellent searchable list of screenshots and corresponding packages.
The range of packages is confusing - which one is the preferred heatmap (like ggplot2 is for most other plotting)? Here is what I found out so far:
base::image - bad, no name labels on axes, no sorting/clustering
base::heatmap - options are far less intelligible than the following:
pheatmap::pheatmap - fantastic but can't seem to turn off the
dendrograms? (any hacks?)
ggplot2 people use geom_tile, as Andrie points out
gplots::heatmap.2 , ref - seems
to be favored by biotech people, but way overkill for my purposes. (no
relation to ggplot* or Prof Wickham)
plotrix::color2D.matplot also exists
base::heatmap is annoying, even with args heatmap(..., Colv=NA, keep.dendro=FALSE) it still plots the unwanted dendrogram on rows.
For now I'm going with pheatmap(..., cluster_cols=FALSE, cluster_rows=FALSE) and manually presorting my table, like this guy: Order of rows in heatmap?
Addendum: to display the value inside each cell, see: display a matrix, including the values, as a heatmap . I didn't need that but it's nice-to-have.
With pheatmap you can use options treeheight_row and treeheight_col and set these to 0.
just another option you have not mentioned...package bipartite as it is as simple as you say
library(bipartite)
mat<-matrix(c(1,2,3,1,2,3,1,2,3),byrow=TRUE,nrow=3)
rownames(mat)<-c("a","b","c")
colnames(mat)<-c("a","b","c")
visweb(mat,type="nested")

data structure to support lookup based on full key or part of key

I need to be able to lookup based on the full key or part of the key..
e.g. I might store keys like 10,20,30,40 11,12,30,40, 12,20,30,40
I want to be able to search for 10,20,30,40 or 20,30,40
What is the best data structure for achieving this..best for time.
our programming language is Java..any pointers for open source projects will be appreciated..
Thanks in advance..
If those were the actual numbers I'd be working with, I'd use an array where a given index contains an array of all records that contain the index. If the actual numbers were larger, I'd use a hash table employed the same way.
So the structure would look like (empty indexes elided, in the case of the array implementation):
10 => ((10,20,30,40)),
11 => ((11,12,30,40)),
12 => ((11,12,30,40), (12,20,30,40)),
20 => ((10,20,30,40), (12,20,30,40)),
30 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
40 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
It's not clear to me whether your searches are inclusive (OR-based) or exclusive (AND-based), but either way you look up the record groups for each element of the search set; for the inclusive search you find their union, and for the exclusive search you find their intersection.
Since you seen to care about retrieval time over other concerns (such as space), I suggest you use a hashtable and you enter your items several times, once per subkey. So you'd put("10,20,30,40",mydata), then put("20,30,40",mydata) and so on (of course this would be a method, you're not going to manually call put so many times).
Use a tree structure. Here is an open source project that might help ... written in Java :-)
http://suggesttree.sourceforge.net/

Resources