ELKI and ARFF files - arff

I compare my results against a result-base, but keep on getting different results than the resultbase, even though I have their data.
I wonder if it is hard to get the same results and why, maybe because they invoked it from a java programme and I do it in the GUI and with ARFF files, which should be troublesome and not developed at right now.
My question is - The results of a AUCROCcurve made from ELKI - would the result vary, if I invoked it from a java programme and not as I do now from the GUI. I would like to get precise results and know I do it right.

Results from the MiniGUI are precise.
The UI is an assistant for building a command line, but that doesn't introduce any imprecision. It may introduce some performance cost when e.g. -verbose is used. The visualization may cause memory problems after the algorithm has finished.
Obviously, the input format (CSV; ARFF) shouldn't have any impact on the outcome. Unless you introduce incorrect additional columns, e.g. an id column that should not be used for analysis...

Related

Is it possible to apply Machine Learning algorithm to predict Failure in large HPC systems based on a years of systematic data collection?

The provided CSV dataset categories look like the following:
DATE | Hardware Identifier | What Failed | Description of Failure | Action Taken
The complete data can be easily downloaded from the Dropbox service using this link: data.csv
The data is very systematic, the input is very consistent and nicely structured. This data comes from a Computer Failure Data Repository. Additional details can be found on this link at USENIX: PNNL
About the data:
There are somewhat little over 2800 entries of single failure events that were collected over 4 years. Each event is described by the exact date and time when the event took place, what Node in the system failed, what hardware component of that node failed.
About the system:
Consists of 980 nodes processing some heavy calculation for the Molecular Science Computing Facility. Each node is designated by its own, unique ID.
My question:
Is it possible to perform any meaningful Machine Learning technique on such dataset, that would, in the end, be capable of predicting future failures in the system?
For example, would it be possible to train the ML algorithm on the provided dataset in order to predict either:
What node might fail soon (based on Hardware Identifier field)
What (node-piece of hardware) combination might fail soon (based on Hardware Identifier and either What failed or Description of Failure field)
What kind of failure might occur next anywhere in the system (based on What Failed field)
To me, this sounds like a huge classification problem. For example, in the case of (node-piece of hardware that failed), there are several thousands of different possibilities (classes). Having in mind that there are only little over 2800 single failure events described in the table, I don't feel like this would work.
Also, I am confused about how I should feed the data into the algorithm. Should the only input to the algorithm be the DATE field (converted to numeric linear growing time)? That doesn't seem right. Is it possible to feed the algorithm somehow with the time variable combined with some history of recent failure events? Should I restructure data to feed the algorithm with time variable + failure history (that might be limited, for example, to the last 30 days, or to feed the whole failure history of the system)?
May I hear your opinion? Is it possible to train an algorithm from this dataset that could predict any of the above-mentioned failure events (like, i.e. What node will fail next) given some input of the system (I can only think of time as an input for now, but that sounds wrong).
Since I am just starting to get involved with the ML algorithms, my thinking on the topic is probably very narrow and limited, so please feel free to suggest if you feel I should take a completely different approach on this.
Before we go on, remember that these failures are generally considered fairly random, so any results you get will likely be fairly unreliable.
The main problem to consider, is that you have very little data compared to the amount of nodes, slightly less than 3 on average, which means that you have to use some incredibly simple models, that would not give you much advantage over a random guess, for you to even have any certainty in your variables (separate mean time between failure would not have a determinable error, if it is even calculateable). For this I would probably treat each node as a separate test point, and then train a tree based algorithm to try to predict when the last failure in the nodes sequence of failures is, but that also mean that it would only be applicable to a subset of the database. This might be able to vaguely predict whether the node will fall in the near future and what type it would most likely be, but it like be fairly close to the estimate of mean time to failure and most common failure for all nodes.
If you want some meaningful results, you will need to have some attributes of the nodes that you can do the machine learning on, such as hardware components and when they were installed, and then have that as input in the classification. Since the problem will likely behave fairly randomly, you would get more information from trying to solve the regression problem instead of the classification problem, since you can still get good precision on a probabilistic model, even though the classification itself would be highly uncertain.

Data dependency and consistency

I'm developing a quite large (for me) ruby script for engineering calculations. The script creates a few objects that are interconnected in a hierarchical fashion.
For example one object (Inp) contains the input parameters for a set of simulations. Other objects (SimA, SimB, SimC) are used to actually perform the simulations and each of them may generate one or more output objects (OutA, OutB, OutC) that contain the results and produce the actual files used for the visualization or analysis by other objects and so on.
The first time I perform and complete all the simulations all the objects will be fully defined and I will have a series or files that represent the outputs for the user.
Now suppose that the user needs to change one of the attributes of Inp. Depending on which attribute has been modified some simulations will have to be re-run and some object OutX will be rendered invalid otherwise the consistency would be loss as the outputs would not correspond to the inputs anymore.
I would like to know whether there is a design pattern that would facilitate this process. Also I was wondering whether some sort of graph could be used to visually represents the various dependencies between objects in a clear way.
From what I have been reading (this question is a year old) I think that the Ruby Observable class could be used for this purpose. Every time a parent object changes, it should send a message to its children so that they can update their state.
Is this the recommended approach?
I hope this makes the question a bit clearer.
I'm not sure that I fully understand your question, but the problem of stages which depend on results of previous stages which in turn depend on results from previous stages which themselves depend on result from previous stages, and every one of those stages can fail or take an arbitrary amount of time, is as old as programming itself and has been solved a number of times.
Tools which do this are typically called "build tools", because this is a problem that often occurs when building complex software systems, but they are in no way limited to building software. A more fitting term would be "dependency-oriented programming". Examples include make, ant, or Ruby's own rake.

My Algorithm only fails for large values - How do I debug this?

I'm working on transcribing as3delaunay to Objective-C. For the most part, the entire algorithm works and creates graphs exactly as they should be. However, for large values (thousands of points), the algorithm mostly works, but creates some incorrect graphs.
I've been going back through and checking the most obvious places for error, and I haven't been able to actually find anything. For smaller values I ran the output of the original algorithm and placed it into JSON files. I then read that output in to my own tests (tests with 3 or 4 points only), and debugged until the output matched; I checked the output of the two algorithms line for line, and found the discrepancies. But I can't feasibly do that for 1000 points.
Answers don't need to be specific to my situation (although suggesting tools I can use would be excellent).
How can I debug algorithms that only fail for large values?
If you are transcribing an existing algorithm to Objective-C, do you have a working original in some other language? In that case, I would be inclined to put in print statements in both versions and debug the first discrepancy (the first, because later discrepancies could be knock-on errors).
I think it is very likely that the program also makes mistakes for smaller graphs, but more rarely. My first step would in fact be to use the working original (or some other means) to run a large number of automatically checked test runs on small graphs, hoping to find the bug on some more manageable input size.
Find the threshold
If it works for 3 or 4 items, but not for 1000, then there's probably some threshold in between. Use a binary search to find that threshold.
The threshold itself may be a clue. For example, maybe it corresponds to a magic value in the algorithm or to some other value you wouldn't expect to be correlated. For example, perhaps it's a problem when the number of items exceeds the number of pixels in the x direction of the chart you're trying to draw. The clue might be enough to help you solve the problem. If not, it may give you a clue as to how to force the problem to happen with a smaller value (e.g., debug it with a very narrow chart area).
The threshold may be smaller than you think, and may be directly debuggable.
If the threshold is a big value, like 1000. Perhaps you can set a conditional breakpoint to skip right to iteration 999, and then single-step from there.
There may not be a definite threshold, which suggests that it's not the magnitude of the input size, but some other property you should be looking at (e.g., powers of 10 don't work, but everything else does).
Decompose the problem and write unit tests
This can be tedious but is often extremely valuable--not just for the current issue, but for the future. Convince yourself that each individual piece works in isolation.
Re-visit recent changes
If it used to work and now it doesn't, look at the most recent changes first. Source control tools are very useful in helping you remember what has changed recently.
Remove code and add it back piece by piece
Comment out as much code as you can and still get some kind of reasonable output (even if that output doesn't meet all the requirements). For example, instead of using a complicated rounding function, just truncate values. Comment out code that adds decorative touches. Put assert(false) in any special case handlers you don't think should be activated for the test data.
Now verify that output, and slowly add back the functionality you removed, one baby step at a time. Test thoroughly at each step.
Profile the code
Profiling is usually for optimization, but it can sometimes give you insight into code, especially when the data size is too large for single-stepping through the debugger. I like to use line or statement counts. Is the loop body executing the number of times you expect? Or twice as often? Or not at all? How about the then and else clauses of those if statements? Logic bugs often become very obvious with this type of profiling.

Tools for Feature Extraction from Binary Data of Images

I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.
I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)
I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.
I am new to this, any help would be great!
I do not think this is feasible.
The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.
Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.
As the data may be compressed, a naive binary feature thing will not work, either.
You probably don't want to look at the actual pixel data at all since the corruption most (almost certain) lay in the file header with it's different "chunks" (example for png, works differently but in the same way for other formats):
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
It should be straight forward to choose features, make a program that reads all the header information from the file and if the information is missing and use this information as features. Still will be much smaller then the unnecessary raw image data.
Oh, and always start out with simpler algorithms like pca together with kmeans or something, and if they fail you should bring out the big guns.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Resources