SystemML Decision Tree - "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10" - algorithm

I am trying to run a decision tree on SystemML standalone version on Windows (https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/decision-tree.dml) but I keep receiving the error "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10. THIS NODE IS DECLAR ED AS LEAF!". It seems like the code is not computing any split, although I am able to perform tree via R. Has anyone used this algorithm before and has some tips on how to solve the error?
Thank you

This message generally indicates that a split on the best categorical or scale features would not give any additional gain.
I would recommend to
Investigate the computed gain (best_cat_gain, best_scale_gain)
Double check that the meta data (num_cat_features,
num_scale_features) is correctly recognised.
You could simply put additional print statements into the script to do that. In case the meta data is invalid, you might want to check that the optional input R has the right layout as described in the header of the script.
If this does not help, please share the input arguments, format of input data, etc and we'll have a closer look.

Related

H2O document question for stopping_tolerance, score_each_iteration, score_tree_interval, etc

I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.

How to find ALL optimal local alignments using Smith-Waterman?

If I got this right then it is possible that there is more than one max value in the local alignment matrix. So in order to get all optimal local alignments, instead of only one, I would have to find the location of all these maximum values in the matrix and trace each of them back individually, right?
Example:
XGTCXXGTCX
|||
AGTCA
XGTCXXGTCX
|||
AGTCA
There is no such thing as ALL optimal alignments. There should only be one optimal alignment. I guess there could be multiple paths for the same alignment but they would have the same overall score and it doesn't look like that's the kind of question you are asking.
What your diagram in your post shows is multiple (primer?) hits. In such a case, what I do is I run smith-waterman once, get the optimal alignment. Then I generate a new alignment where the subject sequence has been trimmed to only include the downstream sequence. The advantage to this way is I don't have to modify any S-W code or have to dig in the internals of 3rd party code.
So it would look like this:
Alignment 1
XGTCXXGTCX
|||
AGTCA
Delete upstream subject sequence:
XGTCXXGTCX => XGTCX
Alignment 2
XGTCX
|||
AGTCA
The only tricky part is you have to keep track of how many bases have been deleted from the alignment so you can correctly adjust the match coordinates.
I know this post is pretty old nowadays but since I found that, other People might also find this while looking for help and in my opinion, the correct answer has not been given, yet. So:
Clearly, there can be MULTIPLE optimal local alignments. You've just shown an example of such. Yet, there is EXACTLY ONE optimal local alignment SCORE. Looking at the original paper that presented the SmithWaterman-Algorithm, Smith and Waterman already indicate how to find the second best alignment, third best alignment...
here's a Reprint to read that stuff (for your Problem, check page 196):
https://pdfs.semanticscholar.org/40c5/441aad96b366996e6af163ca9473a19bb9ad.pdf
So (in contrast to other answers on here), the SmithWaterman-Algorithm also gives second best local alignments and so on.
Just check for the second best score within your Scoringmatrix (in your case there'll be several entries with the same best score), that is not associated with your best local alignment, do the usual backtracking and you solved your Problem. :)

Need some help understanding this problem about maximizing graph connectivity

I was wondering if someone could help me understand this problem. I prepared a small diagram because it is much easier to explain it visually.
alt text http://img179.imageshack.us/img179/4315/pon.jpg
Problem I am trying to solve:
1. Constructing the dependency graph
Given the connectivity of the graph and a metric that determines how well a node depends on the other, order the dependencies. For instance, I could put in a few rules saying that
node 3 depends on node 4
node 2 depends on node 3
node 3 depends on node 5
But because the final rule is not "valuable" (again based on the same metric), I will not add the rule to my system.
2. Execute the request order
Once I built a dependency graph, execute the list in an order that maximizes the final connectivity. I am not sure if this is a really a problem but I somehow have a feeling that there might exist more than one order in which case, it is required to choose the best order.
First and foremost, I am wondering if I constructed the problem correctly and if I should be aware of any corner cases. Secondly, is there a closely related algorithm that I can look at? Currently, I am thinking of something like Feedback Arc Set or the Secretary Problem but I am a little confused at the moment. Any suggestions?
PS: I am a little confused about the problem myself so please don't flame on me for that. If any clarifications are needed, I will try to update the question.
It looks like you are trying to determine an ordering on requests you send to nodes with dependencies (or "partial ordering" for google) between nodes.
If you google "partial order dependency graph", you get a link to here, which should give you enough information to figure out a good solution.
In general, you want to sort the nodes in such a way that nodes come after their dependencies; AKA topological sort.
I'm a bit confused by your ordering constraints vs. the graphs that you picture: nothing matches up. That said, it sounds like you have soft ordering constraints (A should come before B, but doesn't have to) with costs for violating the constraint. An optimal algorithm for scheduling that is NP-hard, but I bet you could get a pretty good schedule using a DFS biased towards large-weight edges, then deleting all the back edges.
If you know in advance the dependencies of each node, you can easily build layers.
It's amusing, but I faced the very same problem when organizing... the compilation of the different modules of my application :)
The idea is simple:
def buildLayers(nodes):
layers = []
n = nodes[:] # copy the list
while not len(n) == 0:
layer = _buildRec(layers, n)
if len(layer) == 0: raise RuntimeError('Cyclic Dependency')
for l in layer: n.remove(l)
layers.append(layer)
return layers
def _buildRec(layers, nodes):
"""Build the next layer by selecting nodes whose dependencies
already appear in `layers`
"""
result = []
for n in nodes:
if n.dependencies in flatten(layers): result.append(n) # not truly python
return result
Then you can pop the layers one at a time, and each time you'll be able to send the request to each of the nodes of this layer in parallel.
If you keep a set of the already selected nodes and the dependencies are also represented as a set the check is more efficient. Other implementations would use event propagations to avoid all those nested loops...
Notice in the worst case you have O(n3), but I only had some thirty components and there are not THAT related :p

Divide by Zero Display Values

What is the best way (most intuitive to users) or best practice for displaying the results of a divide by 0 error when doing reporting? Within the report, I capture this error, however, when displaying it on a human readable report; I am not sure how to note this.
An example would be something like Weight / Revenue ratio. For a given terminal, on a given day, there may have been no revenue, but some shipments (which would have weight) may have been shipped.
The current reports that I am looking at, handle this by placing a 0 in the column, however, this could be misleading, as this is not technically true.
Another thought would be to leave it blank; however, it would be unknown to the user why the field was left blank.
I also considered the standard Excel error, #DIV/0! however, this tends to make the report looked clutter.
I am curious what others have done in the past for this situation.
On ours, we use either a blank space or "NaN" (for Not a Number) and sometimes an asterisk "*" depending on what the end user prefers. (We give them a choice in the planning stage.)
I've used a single "-" in the past, especially when doing excel work. From a best practices point of view I think "0*" with a *This division has no revenue" note at the bottom.
If clutter is a concern, how about an error symbol instead? If color is available, a red "X" could work. If usually black and white, perhaps an "E". Include a legend in the header and footer to indicate what the symbol symbolizes.
We've got two different policies for that sort of case, depending of the context. Either "N/A" or "Error".
The best practice depends on what the divide by zero means in context.
The purpose of any report is communication. To the business side, nothing is communicated by NaN, or #DIV/0. They need to know what's actually happening.
If there's a legitimate reason for the value to be zero, it means the calculated metric is irrelevant. You point out that sometimes, revenue is legitimately zero, and it's reasonable to show something like N/A (which, by the way, should be familiar to just about everyone on the business side - it's a very common abbreviation).
However, if there's no legitimate reason, then it's an error, and should either be shown as such or excluded altogether. In your situation, weight also might be zero, but let's pretend it's not - that a weight of zero means there's an error in the source data. In that situation, your choice is either to drop that item(day,whatever) altogether from the report, or or show it with something which marks it as an error (like "Error").
Other options you might like to consider are
N/A - not applicable
N/R - no result
NRP - no result possible
and similar, assuming that your target audience are not programmers.
You should only have to explain the meaning of N/A to each user 5 or 6 times before they start to remember it.
You could try one of the following:
E/0
Err/0
I like the 2nd one because IMO it actually reads, in only five characters: Error Divide by Zero

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
test-1
test-2
test-3
Excel will continue it with:
test-4
test-5
test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
test-blue-somethingelse
test-yellow-somethingelse
test-red-somethingelse
Now based on this entries I want say that the pattern is:
test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
if (s1.charAt(s1I) != s2.charAt(s2I))
break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
if (s1.charAt(s1E) != s2.charAt(s2E))
break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

Resources