Data dependency and consistency

Data dependency and consistency - ruby

I'm developing a quite large (for me) ruby script for engineering calculations. The script creates a few objects that are interconnected in a hierarchical fashion.
For example one object (Inp) contains the input parameters for a set of simulations. Other objects (SimA, SimB, SimC) are used to actually perform the simulations and each of them may generate one or more output objects (OutA, OutB, OutC) that contain the results and produce the actual files used for the visualization or analysis by other objects and so on.
The first time I perform and complete all the simulations all the objects will be fully defined and I will have a series or files that represent the outputs for the user.
Now suppose that the user needs to change one of the attributes of Inp. Depending on which attribute has been modified some simulations will have to be re-run and some object OutX will be rendered invalid otherwise the consistency would be loss as the outputs would not correspond to the inputs anymore.
I would like to know whether there is a design pattern that would facilitate this process. Also I was wondering whether some sort of graph could be used to visually represents the various dependencies between objects in a clear way.
From what I have been reading (this question is a year old) I think that the Ruby Observable class could be used for this purpose. Every time a parent object changes, it should send a message to its children so that they can update their state.
Is this the recommended approach?
I hope this makes the question a bit clearer.

I'm not sure that I fully understand your question, but the problem of stages which depend on results of previous stages which in turn depend on results from previous stages which themselves depend on result from previous stages, and every one of those stages can fail or take an arbitrary amount of time, is as old as programming itself and has been solved a number of times.
Tools which do this are typically called "build tools", because this is a problem that often occurs when building complex software systems, but they are in no way limited to building software. A more fitting term would be "dependency-oriented programming". Examples include make, ant, or Ruby's own rake.

Related

OpenMDAO - information on cycles

In OpenMDAO, is there any way to get the analytics about the execution of the nonlinear solvers within a coupled model (containing multiple cycles and subcycles), such as the number of iterations within each of the cycles, and execution time?

Though there is no specific functionality to get this exact data, you should be able to get the information you need from case record data which includes iteration counts and time-stamps. So you'd have to do a bit of analysis on the first/last case of a specific run of a solver to compute the run times. Iteration counts should be very strait forward.
This question seems closely related to another one, recently posted which did identify a bug in OpenMDAO. (Issue #2453). Until that bug is fixed, you'll need to use the case names to separate out which cases belong to which cycles, since you can only currently add the recorders to the components/groups and not to the nested solvers themselves. But the naming of the cases should still allow you to pull the data you need out.

Is it possible to apply Machine Learning algorithm to predict Failure in large HPC systems based on a years of systematic data collection?

The provided CSV dataset categories look like the following:
DATE | Hardware Identifier | What Failed | Description of Failure | Action Taken
The complete data can be easily downloaded from the Dropbox service using this link: data.csv
The data is very systematic, the input is very consistent and nicely structured. This data comes from a Computer Failure Data Repository. Additional details can be found on this link at USENIX: PNNL
About the data:
There are somewhat little over 2800 entries of single failure events that were collected over 4 years. Each event is described by the exact date and time when the event took place, what Node in the system failed, what hardware component of that node failed.
About the system:
Consists of 980 nodes processing some heavy calculation for the Molecular Science Computing Facility. Each node is designated by its own, unique ID.
My question:
Is it possible to perform any meaningful Machine Learning technique on such dataset, that would, in the end, be capable of predicting future failures in the system?
For example, would it be possible to train the ML algorithm on the provided dataset in order to predict either:
What node might fail soon (based on Hardware Identifier field)
What (node-piece of hardware) combination might fail soon (based on Hardware Identifier and either What failed or Description of Failure field)
What kind of failure might occur next anywhere in the system (based on What Failed field)
To me, this sounds like a huge classification problem. For example, in the case of (node-piece of hardware that failed), there are several thousands of different possibilities (classes). Having in mind that there are only little over 2800 single failure events described in the table, I don't feel like this would work.
Also, I am confused about how I should feed the data into the algorithm. Should the only input to the algorithm be the DATE field (converted to numeric linear growing time)? That doesn't seem right. Is it possible to feed the algorithm somehow with the time variable combined with some history of recent failure events? Should I restructure data to feed the algorithm with time variable + failure history (that might be limited, for example, to the last 30 days, or to feed the whole failure history of the system)?
May I hear your opinion? Is it possible to train an algorithm from this dataset that could predict any of the above-mentioned failure events (like, i.e. What node will fail next) given some input of the system (I can only think of time as an input for now, but that sounds wrong).
Since I am just starting to get involved with the ML algorithms, my thinking on the topic is probably very narrow and limited, so please feel free to suggest if you feel I should take a completely different approach on this.

Before we go on, remember that these failures are generally considered fairly random, so any results you get will likely be fairly unreliable.
The main problem to consider, is that you have very little data compared to the amount of nodes, slightly less than 3 on average, which means that you have to use some incredibly simple models, that would not give you much advantage over a random guess, for you to even have any certainty in your variables (separate mean time between failure would not have a determinable error, if it is even calculateable). For this I would probably treat each node as a separate test point, and then train a tree based algorithm to try to predict when the last failure in the nodes sequence of failures is, but that also mean that it would only be applicable to a subset of the database. This might be able to vaguely predict whether the node will fall in the near future and what type it would most likely be, but it like be fairly close to the estimate of mean time to failure and most common failure for all nodes.
If you want some meaningful results, you will need to have some attributes of the nodes that you can do the machine learning on, such as hardware components and when they were installed, and then have that as input in the classification. Since the problem will likely behave fairly randomly, you would get more information from trying to solve the regression problem instead of the classification problem, since you can still get good precision on a probabilistic model, even though the classification itself would be highly uncertain.

How to use BDD to code complex data structures / data layers

I'm new to behavior-driven development and I can't find any examples or guidelines that parallel my current problem.
My current project involves a massive 3D grid with an arbitrary number of slots in each of the discrete cells. The entities stored in these slots have their own slots and, thus, an arbitrary nesting of entities can exist. The final implementation of the object(s) used will need be backed by some kind of persistent data store, which complicates the API a bit (i.e. using words like load/store instead of get/set and making sure modifying returned items doesn't modify the corresponding items in the data store itself). Don't worry, my first implementation will simply exist in-memory, but the API is what I'm supposed to be defining behavior against, so the actual implementation doesn't matter right now.
The thing I'm stuck on is the fact that BDD literature focuses on the interactions between objects and how mock objects can help with that. That doesn't seem to apply at all here. My abstract data store's only real "behavior" involves loading and storing data from entities outside those represented by the programming language itself; I can't define or test those behaviors since they're implementation-dependent.
So what can I define/test? The natural alternative is state. Store something. Make sure it loads. Modify the thing I loaded and make sure after I reload it's unmodified. Etc. But I'm under the impression that this is a common pitfall for new BDD developers, so I'm wondering if there's a better way that avoids it.
If I do take the state-testing route, a couple other questions arise. Obviously I can test an empty grid first, then an empty entity at one location, but what next? Two entities in different locations? Two entities in the same location? Nested entities? How deep should I test the nesting? Do I test the Cartesian product of these non-exclusive cases, i.e. two entities in the same location AND with nested entities each? The list goes on forever and I wouldn't know where to stop.

The difference between TDD and BDD is about language. Specifically, BDD focuses on function/object/system behavior to improve design and test readability.
Often when we think about behavior we think in terms of object interaction and collaboration and therefore need mocks to unit test. However, there is nothing wrong with an object whose behavior is to modify the state of a grid, if that is appropriate. State or mock based testing can be used in TDD/BDD alike.
However, for testing complex data structures, you should use a Matchers (e.g. Hamcrest in Java) to test only the part of the state you are interested in. You should also consider whether you can decompose the complex data into objects that collaborate (but only if that makes sense from an algorithmic/design standpoint).

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.

(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.

Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.

I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.

One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.

Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio