Can someone confirm the stateful-ness of metrics in tf.keras (tensorflow-2)? - metrics

I am new to tensorflow and trying to re-write some of the usual metrics such as accuracy and f1, but on a per-class basis, i.e. if my data has 3 classes A, B, C, what are the accuracies, f1 s etc for A, B and C separately. My model and training scripts etc are almost entirely defined in tf.keras. But since I'm a little late to the party, I'm not sure if the metrics: both built in ones such as tf.keras.Metric.TruePositives for instance or the general tf.keras.Metric are stateful, i.e. the metric will NOT be averaged over batches? I have read and re-read too many github and stack-overflow issues over keras metrics being non-stateful in the past and so can't seem to wrap my head around whether I can trust tf.keras.Metric in tensorflow-2 and start subclassing it, or do something else.
Any insight would be great. (A definitive pointer to a release update where this is written explicitly would be a plus!!)
Thanks,
Tanmoy

Related

Kedro Data Modelling

We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science

What's a simple system to model and compare journeys?

Let's say I would like to build a system modelling the behaviour of visitors to a city.
For argument's sake, the city has 5 places of interest: A, B, C, D, and E. All are equally likely to be the first place to visit, and all are within easy reach of one another.
I am interested in drawing conclusions resembling the following:
"Users who visit C commonly go on to visit B."
"Users who visit A hardly ever go on to visit D."
"Users who visit B are equally likely to visit C and E."
My problems as I understand them are as follows:
I don't know anything about graph theory. (But I am prepared to read up on it).
I'm uncertain of the best way to store this kind of data. If not an SQL DB, what?
What sort of operations am I going to be performing on the data I end up with? Could I use a general-purpose language like Ruby?
Thank you for any guidance.
The type of storage obviously depends on the sort of data you have. If it's just what you describe here then you can represent each journey as a string:
ABCB
DCDE
...
This well fits in a database, but of course such a list can be stored using any means, whatever is most easily available to you. You probably don't even need the entire list, an accumulated version might be sufficient, where you store each string exactly once, along with its count:
ABDC 177
DEA 2996
...
For such a table a database is appropriate, but its still simple enough to be stored in a plain file.
For examining the data you don't care about graph theory, rather read up on statistics and machine learning. The first thing you want to analyze is the correlation of the various places. You can do that using simple string operations, e.g. count the substrings "AD" to find out how often people go from A to D. And regarding the language: You want to calculate and visualize correlations, so maybe you pick something where that kind of stuff isn't too hard. This could be something specialized like Matlab or R, or something more general like Python/Matplotlib/scikit-learn. I don't know about Ruby.

How to visualize profile files graphically?

I'm developing Go 1.2 on Windows 8.1 64 bit. I had many issues getting the go pprof tool to work properly such as memory addresses being displayed instead of actual function names.
However, i found profile which seems to do a great job at producing profile files, which work with the pprof tool. My guestion is, how do i use those profile files for graphical visualization?
U can try go tool pprof /path/to/program profile.prof to solve function not true problem.
if u want graphical visualization, try input web in pprof.
If your goal is to see pretty but basically meaningless pictures, go for visualization as #Specode suggested.
If your goal is speed, then I recommend you forget visualization.
Visualization does not tell you what you need to fix.
This method does tell you what to fix.
You can do it quite effectively in GDB.
EDIT in response to #BurntSushi5:
Here are my "gripes with graphs" :)
In the first place, they are super easy to fool.
For example, suppose A1 spends all its time calling C2, and vice-versa.
Then suppose a new routine B is inserted, such that when A1 calls B, B calls C2, and when A2 calls B, B calls C1.
The graph loses the information that every time C2 is called, A1 is above it on the stack, and vice-versa.
For another example, suppose every call to C is from A.
Then suppose instead A "dispatches" to a bunch of functions B1, B2, ..., each of which calls C.
The graph loses the information that every call to C comes through A.
Now to the graph that was linked:
It places great emphasis on self time, making giant boxes, when inclusive time is far more important. (In fact, the whole reason gprof was invented was because self time was about as useful as a clock with only a second-hand.) They could at least have scaled the boxes by inclusive time.
It says nothing about the lines of code that the calls come from, or that are spending the self time. It's based on the assumption that all functions should be small. Maybe that's true, and maybe not, but is it a good enough reason for the profile output to be unhelpful?
It is chock-full of little boxes that don't matter because their time is insignificant. All they do is take up gobs of real estate and distract you.
There's nothing in there about I/O. The profiler from which the graph came apparently embodies that the only I/O is necessary I/O, so there's no need to profile it (even if it takes 90% of the time). In big programs, it's really easy for I/O to be done that isn't really necessary, taking a big fraction of time, and so-called "CPU profilers" have the prejudice that it doesn't even exist.
There doesn't seem to be any instance of recursion in that graph, but recursion is common, and useful, and graphs have difficulty displaying it with meaningful measurements.
Just pointing out that, if a small number of stack samples are taken, roughly half of them would look like this:
blah-blah-couldn't-read-it
blah-blah-couldn't-read-it
blah-blah-couldn't-read-it
fragbag.(*structureAtoms).BestStructureFragment
structure.RMSDMem
... a couple other routines
The other half of the samples are doing something else, equally informative.
Since each stack sample shows you the lines of code where the calls come from, you're actually being told why the time is being spent.
(Activities that don't take much time have very small likelihood of being sampled, which is good, because you don't care about those.)
Now I don't know this code, but the graph gives me a strong suspicion that, like a lot of code I see, the devil's in the data structure.

Separation and pattern matching techniques

I am new to Artificial Neural Networks.
I am interested in an application like this:
I have a significantly large set of objects. Each object has six properties, denoted by P1–P6. Each property has a value which is a symbolic value. In other words, in my example P1–P6 can have a value from the set {A, B, C, D, E, F}. They are not numeric. (Suppose A,B,C,D,E,F are colours; then you will understand my idea.)
Now, there is another property R that I am interested in. Suppose
R = {G1, G2, G3, G4, G5}
I need to train a system for a large set of P1–P6 and the relevant R. Now I want to do the following.
I have an object and I know the values of P1 to P6. I need to find
the R (The Group that the object belongs.)
To get a desired R what is the pattern I need to have in P1–P6.
As an example given that R = G2 I need to figure out any pattern in P1–P6.
My questions are:
What are the theories/technologies/techniques I should read and
learn in order to implement 1 and 2, respectively?
What are the tools/libraries you can recommend to get this
simulated/implemented/tested?
The way you described your problem, you need to look up various machine learning techniques. If it were me, I would try and read about k-NN (k Nearest Neighbours) for the classification. When I say classification, I mean getting the R if you know P1-P6. It is a really simple technique and should be helpful here.
As for the other way around, what you basically need is a representative sample of your population. This is I think not so usual, but you could try something like a k-means Clustering. Clustering methods usually determine the class of an object (property R) by themselves, but k-means Clustering is cool in this situation because you need to give it the number of object classes (e.g. different possible values of R), and in the end you get one representative sample.
You definitely shouldn't go for any really complex techniques (like neural networks) in my opinion since your data doesn't have a precise numerical interpretation and the values can't be interpreted gradually.
The recommended tools really depend on your base programming language. There's a great tool called Orange which is Python-based and it's my tool of choice for these kind of things (especially since it is really easy to connect your Python modules with C/C++). If you prefer Java, there's a quite similar tool called Weka that you could use. I think Weka is a little bit better documented, but I don't like Java so I've never tried it out.
Both of these tools have a graphical clickable interface where you could just load your data and get the classification done, play with the parameters and check what kind of output you get using different techniques and different set-ups. Once you decide that you got the results you need (or if you just don't like graphical interfaces) you can also use both of them as libraries of a kind when programming (Python for Orange and Java for Weka) and make the classification a part of a bigger project.
If you look through the documentation of Orange or Weka, I think it will give you a few ideas about what you could actually do with the data you have and when you know a few techniques that seem interesting to you and applicable to the data, maybe you could get more quality comments and info on a few specific methods here than when just searching for a general advice.
You should check out classification algorithms (a subsection of artificial intelligence), especially the nearest neighbor-algorithms. Your problem may be solved by different techniques, which all have different advantages and disadvantages.
However, I do not know of any method in artificial intelligence, which allows a two-way classification (or in other words, that both implement your prerequisites 1 and 2 simultaneously). As all you want to do so far is having a bidirectional mapping of P1..P6 <=> R, I would suggest to just use a mapping table instead of an artificial intelligence algorithm. An AI would work great if you not exactly know, which of your samples is categorized under A..E in P1..P6.
If you insist on using an AI for it, I'd suggest to first look at a Perceptron. A perceptron consists of input, intermediate and output neurons. For your example, you'd have the input-Neurons P1a..P1e, P2a..P2e, ... and five output neurons R1..R5. After training, you should be able to input P1..P6 and get the appropriate R1..R5 as output.
As for frameworks and technologies, I only know of the Business Intelligence suite for Visual Studio, although there are a lot of other frameworks for AI out there. Since I do not have used any of them (I always coded them myself in C/C++), I can't recommend any.
It seems like a typical classification problem. In case you really have a lot of data have a look at Apache Mahout which provides distributed implementations of machine learning algorithms. If you need something less complex for prototyping TimBL is a nice alternative.

what are the risks and ramifications of changing the document validation criteria in a running Couch database?

To take the simplest possible example:
Start with an empty database.
Add a document
Add a design document with validation function that rejects everything
Replicate that database.
To ask a concrete question to begin with, one with an answer that I hope can be given very quickly by pointing me to the right url: is the result of this replication defined by some rule, for example that the documents are always replicated in the order they were saved, or does the successful replication of the first document depend on whether the design document happened to arrive at the destination first? In the quick experiment I did, both documents did get successfully validated, but I'm trying to find out if that outcome is defined in a spec somewhere or it's implementation dependent.
To ask a followup question that's more handwavey and may not have a single answer, what else can happen and what sorts of solutions have emerged to manage those problems? It's obviously possible for different servers to simultaneously (and I use that word hesitantly) have different versions of a validation function. I suppose the validators could be backwards compatible, where every new version adds a case to a switch statement that looks up a say a schema_version attribute of the document. Then if a version 2 document arrives at a server where the version 3 validator is the gatekeeper, it'll be allowed in. If a version 3 document arrives at a version 2 validator, it's a bit more tricky, it presumably depends on whether strictness or leniency is an appropriate default for the application. But can either of those things even happen, or do the replication rules ensure that even if servers are going up and down, updates and deletes are being done all over the place, and replication connections are intermittent and indirect, that a document will never arrive on a given server before its appropriate validation function, and that a validation function will never arrive too late to handle one of the documents it was supposed to check?
I could well be overcomplicating this or missing out on some Zen insight, but painful experience has taught me that I'm not clever enough to predict what sorts of states concurrent systems can get themselves into.
EDIT:
As Marcello says in a comment, updates on individual servers have sequence numbers, and replication applies the updates in sequence number order. I had a vague idea that that was the case, but I'm still fuzzy on the details. I'm trying to find the simplest possible model that will give me an idea about what can and can't happen in a complex CouchDB system.
Suppose I take the state of server A that's started off empty and has three document writes made to it. So its state can be represented as the following string:
A1,A2,A3
Suppose server B also has three writes: B1,B2,B3
We replicate A to B, so the state of B is now: B1,B2,B3,A1,A2,A3. Although presumably the A updates have taken a sequence number on entering B, so the state is now: B1, B2, B3, B4(A1), B5(A2), B6(A3).
If I understand correctly, the replicator also makes a record of the fact that everything up to A3 has been replicated to B, and it happens to store this record as part of B's internal state, but I'm wondering if this is an implementation detail that can be disregarded in the simple model.
If you operate those sets of rules, the A updates and the B updates would stay in order on any server they were replicated to. Perhaps the only way they could get out of order is if you did something like replicating A to B, deleting A1 on A and A2 on B, replicating A to C, then replicating B to C, leaving a state on C of: A2, A3, B1, B2, B3, B4(A1).
Is this making any sense at all? Maybe strings aren't the right way of visualising it, maybe it's better to think of, I don't know, a bunch of queues (servers) in an airport , airport staff (replicators) moving people from queue to queue according to certain rules , and put yourself into the mind of someone trying to skip the queue, ie somehow get into a queue before someone who's ahead of them in their current queue. That has the advantage of personalising the model, but we probably don't want to replicate people in airports.
Or maybe there's some way of expliaining it as a Towers of Hanoi type game, although with FIFO queues instead of LIFO stacks.
It's a model I'm hoping to find - absolutely precise as far as behavior is concerned, all irrelevant implementation details stripped away, and using whatever metaphor or imagery makes it easiest to intuit.
The basic use case is simple. CouchDB uses sequence numbers to index database changes and to ask what changes need to be replicated. Order is implicit in this algorithm and what you fear should not happen. As a side note, the replication process only copies the last revision of a document, but this does not change anything about order.

Resources