OpenMDAO - information on cycles - time

In OpenMDAO, is there any way to get the analytics about the execution of the nonlinear solvers within a coupled model (containing multiple cycles and subcycles), such as the number of iterations within each of the cycles, and execution time?

Though there is no specific functionality to get this exact data, you should be able to get the information you need from case record data which includes iteration counts and time-stamps. So you'd have to do a bit of analysis on the first/last case of a specific run of a solver to compute the run times. Iteration counts should be very strait forward.
This question seems closely related to another one, recently posted which did identify a bug in OpenMDAO. (Issue #2453). Until that bug is fixed, you'll need to use the case names to separate out which cases belong to which cycles, since you can only currently add the recorders to the components/groups and not to the nested solvers themselves. But the naming of the cases should still allow you to pull the data you need out.

Related

Is it possible to apply Machine Learning algorithm to predict Failure in large HPC systems based on a years of systematic data collection?

The provided CSV dataset categories look like the following:
DATE | Hardware Identifier | What Failed | Description of Failure | Action Taken
The complete data can be easily downloaded from the Dropbox service using this link: data.csv
The data is very systematic, the input is very consistent and nicely structured. This data comes from a Computer Failure Data Repository. Additional details can be found on this link at USENIX: PNNL
About the data:
There are somewhat little over 2800 entries of single failure events that were collected over 4 years. Each event is described by the exact date and time when the event took place, what Node in the system failed, what hardware component of that node failed.
About the system:
Consists of 980 nodes processing some heavy calculation for the Molecular Science Computing Facility. Each node is designated by its own, unique ID.
My question:
Is it possible to perform any meaningful Machine Learning technique on such dataset, that would, in the end, be capable of predicting future failures in the system?
For example, would it be possible to train the ML algorithm on the provided dataset in order to predict either:
What node might fail soon (based on Hardware Identifier field)
What (node-piece of hardware) combination might fail soon (based on Hardware Identifier and either What failed or Description of Failure field)
What kind of failure might occur next anywhere in the system (based on What Failed field)
To me, this sounds like a huge classification problem. For example, in the case of (node-piece of hardware that failed), there are several thousands of different possibilities (classes). Having in mind that there are only little over 2800 single failure events described in the table, I don't feel like this would work.
Also, I am confused about how I should feed the data into the algorithm. Should the only input to the algorithm be the DATE field (converted to numeric linear growing time)? That doesn't seem right. Is it possible to feed the algorithm somehow with the time variable combined with some history of recent failure events? Should I restructure data to feed the algorithm with time variable + failure history (that might be limited, for example, to the last 30 days, or to feed the whole failure history of the system)?
May I hear your opinion? Is it possible to train an algorithm from this dataset that could predict any of the above-mentioned failure events (like, i.e. What node will fail next) given some input of the system (I can only think of time as an input for now, but that sounds wrong).
Since I am just starting to get involved with the ML algorithms, my thinking on the topic is probably very narrow and limited, so please feel free to suggest if you feel I should take a completely different approach on this.
Before we go on, remember that these failures are generally considered fairly random, so any results you get will likely be fairly unreliable.
The main problem to consider, is that you have very little data compared to the amount of nodes, slightly less than 3 on average, which means that you have to use some incredibly simple models, that would not give you much advantage over a random guess, for you to even have any certainty in your variables (separate mean time between failure would not have a determinable error, if it is even calculateable). For this I would probably treat each node as a separate test point, and then train a tree based algorithm to try to predict when the last failure in the nodes sequence of failures is, but that also mean that it would only be applicable to a subset of the database. This might be able to vaguely predict whether the node will fall in the near future and what type it would most likely be, but it like be fairly close to the estimate of mean time to failure and most common failure for all nodes.
If you want some meaningful results, you will need to have some attributes of the nodes that you can do the machine learning on, such as hardware components and when they were installed, and then have that as input in the classification. Since the problem will likely behave fairly randomly, you would get more information from trying to solve the regression problem instead of the classification problem, since you can still get good precision on a probabilistic model, even though the classification itself would be highly uncertain.

Determinism in tensorflow gradient updates?

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks
There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different
The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)
I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

Data dependency and consistency

I'm developing a quite large (for me) ruby script for engineering calculations. The script creates a few objects that are interconnected in a hierarchical fashion.
For example one object (Inp) contains the input parameters for a set of simulations. Other objects (SimA, SimB, SimC) are used to actually perform the simulations and each of them may generate one or more output objects (OutA, OutB, OutC) that contain the results and produce the actual files used for the visualization or analysis by other objects and so on.
The first time I perform and complete all the simulations all the objects will be fully defined and I will have a series or files that represent the outputs for the user.
Now suppose that the user needs to change one of the attributes of Inp. Depending on which attribute has been modified some simulations will have to be re-run and some object OutX will be rendered invalid otherwise the consistency would be loss as the outputs would not correspond to the inputs anymore.
I would like to know whether there is a design pattern that would facilitate this process. Also I was wondering whether some sort of graph could be used to visually represents the various dependencies between objects in a clear way.
From what I have been reading (this question is a year old) I think that the Ruby Observable class could be used for this purpose. Every time a parent object changes, it should send a message to its children so that they can update their state.
Is this the recommended approach?
I hope this makes the question a bit clearer.
I'm not sure that I fully understand your question, but the problem of stages which depend on results of previous stages which in turn depend on results from previous stages which themselves depend on result from previous stages, and every one of those stages can fail or take an arbitrary amount of time, is as old as programming itself and has been solved a number of times.
Tools which do this are typically called "build tools", because this is a problem that often occurs when building complex software systems, but they are in no way limited to building software. A more fitting term would be "dependency-oriented programming". Examples include make, ant, or Ruby's own rake.

Creating the N most accurate sparklines for M sets of data

I recently constructed a simple name popularity tool (http://names.yafla.com) that allows users to select names and investigate their popularity over time and by state. This is just a fun project and has no commercial or professional value, but solved a curiosity itch.
One improvement I would like to add is the display of simple sparklines beside each name in the select list, showing the normalized national popularity trends since 1910.
Doing an image request for every single name -- where hypothetically I've preconstructed the spark lines for every possible variant -- would slow the interface too much and yield a lot of unnecessary traffic as users quickly scroll and filter past hundreds of names they aren't interested in. Building sprites with sparklines for sets of names is a possibility, but again with tens of thousands of names, in the end the user's cache would be burdened with a lot of unnecessary information.
My goal is absolutely tuned minimalism.
Which got me contemplating the interesting challenge of taking M sets of data (occurrences over time) and distilling that to the most proximal N representative sparklines. For this purpose they don't have to be exact, but should be a general match, and where I could tune N to yield a certain accuracy number.
Essentially a form of sparkline lossy compression.
I feel like this most certainly is a solved problem, but can't find or resolve the heuristics that would yield the algorithms that would shorten the path.
What you describe seems to be cluster analysis - e.g. shoving that into Wikipedia will give you a starting point. Particular methods for cluster analysis include k-means and single linkage. A related topic is Latent Class Analysis.
If you do this, another option is to look at the clusters that come out, give them descriptive names, and then display the cluster names rather than inaccurate sparklines - or I guess you could draw not just a single line in the sparkline, but two or more lines showing the range of popularities seen within that cluster.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

Resources