pymc3 how to code multi-state discrete Bayes net CPT? - pymc

I'm trying to build a simple Bayesian network, where rain and sprinkler are the parents of wetgrass, but rain and sprinkler each have three (fuzzy-logic type rather rather than the usual two boolean) states, and wetgrass has two states (true/false). I can't find anywhere in the pymc3 docs what syntax to use to describe the CPTs for this -- I'm trying the following based on 2-state examples but it's not generalizing to three states the way I thought it would. Can anyone show the correct way to do this? (And also for the more general case where wetgrass has three states too.)
rain = mc.Categorical('rain', p = np.array([0.5, 0. ,0.5]))
sprinker = mc.Categorical('sprinkler', p=np.array([0.33,0.33,0.34]))
wetgrass = mc.Categorical('wetgrass',
mc.math.switch(rain,
mc.math.switch(sprinker, 10, 1, -4),
mc.math.switch(sprinker, -20, 1, 3),
mc.math.switch(sprinker, -5, 1, -0.5)))
[gives error at wetgrass definition:
Wrong number of inputs for Switch.make_node (got 4((, , , )), expected 3)
]
As I understand it - switch is a theano function similar to (b?a:b) in a C program; which is only doing a two way comparison. It's maybe possible to set up the CPT using a whole load of binary switches like this, but I really want to just give a 3D matrix CPT as the input as in BNT and other bayes net libraries. Is this currently possible ?

You can code a three-way switch using two individual switches:
tt.switch(sprinker == 0,
10
tt.switch(sprinker == 1, 1, -4))
But in general it is probably better to index into a table:
table = tt.constant(np.array([[...], [...]]))
value = table[rain, sprinker]

Related

Is there a way to use range with Z3ints in z3py?

I'm relatively new to Z3 and experimenting with it in python. I've coded a program which returns the order in which different actions is performed, represented with a number. Z3 returns an integer representing the second the action starts.
Now I want to look at the model and see if there is an instance of time where nothing happens. To do this I made a list with only 0's and I want to change the index at the times where each action is being executed, to 1. For instance, if an action start at the 5th second and takes 8 seconds to be executed, the index 5 to 12 would be set to 1. Doing this with all the actions and then look for 0's in the list would hopefully give me the instances where nothing happens.
The problem is: I would like to write something like this for coding the problem
list_for_check = [0]*total_time
m = s.model()
for action in actions:
for index in range(m.evaluate(action.number) , m.evaluate(action.number) + action.time_it_takes):
list_for_check[index] = 1
But I get the error:
'IntNumRef' object cannot be interpreted as an integer
I've understood that Z3 isn't returning normal ints or bools in their models, but writing
if m.evaluate(action.boolean):
works, so I'm assuming the if is overwritten in a way, but this doesn't seem to be the case with range. So my question is: Is there a way to use range with Z3 ints? Or is there another way to do this?
The problem might also be that action.time_it_takes is an integer and adding a Z3int with a "normal" int doesn't work. (Done in the second part of the range).
I've also tried using int(m.evaluate(action.number)), but it doesn't work.
Thanks in advance :)
When you call evaluate it returns an IntNumRef, which is an internal z3 representation of an integer number inside z3. You need to call as_long() method of it to convert it to a Python number. Here's an example:
from z3 import *
s = Solver()
a = Int('a')
s.add(a > 4);
s.add(a < 7);
if s.check() == sat:
m = s.model()
print("a is %s" % m.evaluate(a))
print("Iterating from a to a+5:")
av = m.evaluate(a).as_long()
for index in range(av, av + 5):
print(index)
When I run this, I get:
a is 5
Iterating from a to a+5:
5
6
7
8
9
which is exactly what you're trying to achieve.
The method as_long() is defined here. Note that there are similar conversion functions from bit-vectors and rationals as well. You can search the z3py api using the interface at: https://z3prover.github.io/api/html/namespacez3py.html

Even an image in data set used to train is giving opposite values when making prediction

I am new to ML and TensorFlow. I am trying to build a CNN to categorize a good image against corrupted images, similar to rock paper scissor tutorials in tensor flow, except for only two categories.
The Colab Notebook
Model Architecture
train_generator = training_datagen.flow_from_directory(
TRAINING_DIR,
target_size=(150,150),
class_mode='categorical'
)
validation_generator = validation_datagen.flow_from_directory(
VALIDATION_DIR,
target_size=(150,150),
class_mode='categorical'
)
model = tf.keras.models.Sequential([
# Note the input shape is the desired size of the image 150x150 with 3 bytes color
# This is the first convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(150, 150, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
# The second convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The third convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The fourth convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# Flatten the results to feed into a DNN
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.5),
# 512 neuron hidden layer
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(2, activation='softmax')
])
model.summary()
model.compile(loss = 'categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit_generator(train_generator, epochs=25, validation_data = validation_generator, verbose = 1)
model.save("rps.h5")
Only Change I made was turning input shape to (150,150,1) to (150,150,3) and changed last layers output to 2 neurons from 3. The training gave me consistently accuracy of 90 above for data set of 600 images in each class. But when I am making a prediction using code in the tutorial, it gives me highly wrong values even for data in the data set.
PREDICTION
Original code in TensorFlow tutorial
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3)) # changed target_size to (150, 150,3)) from (150,150 )
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
I changed target_size to (150, 150,3)) from (150,150) in my belief that since my input is a 3 channel image,
Result
It gives very wrong values [0,1][0,1] for even images in which are in dataset
But when I changed the code to this
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x /= 255.
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
In this case values come like
[[9.9999774e-01 2.2242968e-06]]
[[9.9999785e-01 2.1864464e-06]]
[[9.9999785e-01 2.1641024e-06]]
one or two errors are there but it is very much correct
So my question even though the last activation is softmax, why it is now coming in decimal values, is there any logical mistake in the way I am making predictions.? I tried binary also, but couldn't find much difference.
Please note -
When you are changing output classes from 2 to 3, you are asking the model to categorise into 3 classes. This would contradict your problem statement which separates good and corrupted ones i.e 2 output classes (a binary problem). I think it can be reversed from 3 to 2 if I have understood the question correctly.
Second the output you are getting is perfectly correct, the neural network models outputs probabilities instead of absolute class values like 0 or 1. By probability, it tells how likely it belongs to say class 0 or class 1.
Also , as mentioned above by #BBloggsbott - you just have to use np.argmax on the output array which will tell you the probability of belonging to class 1 (Positive class) by default.
Hope this helps.
Thanks.
Softmax returns probability distributions for the vector it gets as input. So, the fact that you are getting decimal values is not a problem. If you want to find the exact class each image belongs to, try using the argmax function on the predictions.

EasyPredictModelWrapper giving wrong prediction

public BinomialModelPrediction predictBinomial(RowData data) throws PredictException {
double[] preds = this.preamble(ModelCategory.Binomial, data);
BinomialModelPrediction p = new BinomialModelPrediction();
double d = preds[0];
p.labelIndex = (int)d;
String[] domainValues = this.m.getDomainValues(this.m.getResponseIdx());
p.label = domainValues[p.labelIndex];
p.classProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.classProbabilities, 0, p.classProbabilities.length);
if(this.m.calibrateClassProbabilities(preds)) {
p.calibratedClassProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.calibratedClassProbabilities, 0, p.calibratedClassProbabilities.length);
}
return p;
}
Eg: classProbabilities =[0.82333,0,276666]
labelIndex = 1
label = true
domainValues = [false,true]
what does this labelIndex signifies and does the class probabilities
order is same as the domain value order ,If order is same then it means that here probability of false is 0.82333 and probability of true is 0.27666 but why is this labelIndex showing as 1 and label as true.
Please help me to figure out this issue.
Like Tom commented, the prediction is not "wrong". You can infer from this that the threshold H2O has chosen is less than 0.27666. You probably have imbalanced training data, otherwise H2O would have not picked a low threshold for classifying a predicted value of 0.27666 as a 1. Does your training set include fewer examples of the positive class than the negative class?
If you don't like that threshold for whatever reason, then you can manually create your own. Just make sure you know how to properly evaluate the effect of using different thresholds on the performance of your model, otherwise I'd recommend just using the default threshold.
The name, "classProbabilities" is a misnomer. These are not actual probabilities, they are predicted values, though people often use the terms interchangeably. Binary classification algorithms produce "predicted values" that look like probabilities when they're between 0 and 1, but unless a calibration process is performed, they are not going to represent the probabilities. Calibration is not necessarily a straight-forward process and there are many techniques. Here's some more info about calibration methods for imbalanced data. In H2O, you can perform calibration using Platt scaling using the calibrate_model option. But this is probably not really necessary to what you're trying to do.
The proper way to use the raw output from a binary classification model is to only look at the predicted value for the positive class (you can simply ignore the predicted value for the negative class). Then you choose a threshold which suits your needs, or you can use the default threshold in H2O, which is chosen to maximize the F1 score. Some other software will use a hardcoded threshold of 0.5, but that will be a terrible choice if you don't have an even number of positive and negative examples in your training data. If you have only a few positive examples in your training data, then the best threshold will be something much lower than 0.5.

Boost Confidence of Overlapping Observations In Apache Spark

I'm fairly new to scala/spark, so forgive me if my question is elementary but I've searched everywhere and can't find the answer.
Problem
I'm trying to boost the confidence scores a bunch of network router observations (observations of probable router types at different network junctions).
I have a type NetblockObservation combines device types seen on a network with an associated netblock and a confidence. The confidence is the confidence that we accurately identified which device the device we saw.
case class NetblockObservation(
device_type: String
ip_start: Long,
ip_end: Long,
confidence: Double
)
If the confidence is above some threshold thresh, then I want that observation to be in the returned dataset. If it's below thresh, it should not be.
In addition if I have two observations with the same device_type and that one contains the other, the containee should have its confidence increased by by the confidence of the container.
Example
Let's say I have 3 Netblock Observations
// 0.0.0.0/28
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 15, confidence_score: .4)
// 0.0.0.0/29
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 7, confidence_score: .4)
// 0.0.0.0/30
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 3, confidence_score: .4)
With a confidence threshold of 1, I would expect to have a single output of NetblockObservation(device_type: "x", ip_start: 0, ip_end: 4, confidence_score: 1.2)
Explanation: I am allowed to add the confidence scores of NetblockObservation's together if it's contained and has the same device_type
I was allowed to add the confidence score of the 0.0.0.0/29 to the confidence of the 0.0.0.0/30 because it's contained within it.
I was not allowed to add the confidence score of 0.0.0.0/30 to the 0.0.0.0/29 because 0.0.0.0/29 is not contained within 0.0.0.0/30.
My (pitiful) Attempt
Failure reason: Too slow / never completed
I attempted to implement this while simultaneously learning scala/spark so I'm not sure if it's the idea or the implementation which is wrong. I think it would eventually work but after an hour, it hadn't completed on a dataset of size 300,000 (small compared to production scale) so I gave up on it.
The idea is to find the largest netblock and separate the data into netblocks which are contained and netblocks which are not contained. The netblocks which are not contained are recursively passed back into the same function. If the largest netblock has a confidence_score of 1, the entire contained dataset is disregarded and the largest is added to return dataset. If the confidence_score is less then 1, then its confidence_score is added to everything in the contained dataset and that group is recursively passed back to the same function. Eventually, you should only be left with the data which has a confidence_score greater then 1. This algorithm also has the issue of not taking device_type into account.
def handleDataset(largestInNetData: Option[NetblockObservation], netData: RDD[NetblockObservation]): RDD[NetblockObservation] = {
if (netData.isEmpty) spark.sparkContext.emptyRDD else largestInNetData match {
case Some(largest) =>
val grouped = netData.groupBy(item =>
if (item.ip_start >= largest.ip_start && item.ip_end <= largest.ip_end) largestInNetData
else None)
def lookup(k: Option[NetblockObservation]) = grouped.filter(_._1 == k).flatMap(_._2)
val nos = handleDataset(None, lookup(None))
// Threshold is assumed to be 1
val next = if (largest.confidence_score >= 1) spark.sparkContext.parallelize(Seq(largest)) else
handleDataset(None, lookup(largestInNetData)
.filter(x => x != largest)
.map(x => x.copy(confidence_score = x.confidence_score + largest.confidence_score)))
nos ++ next
case None =>
val largest = netData.reduce((a: NetblockObservation, b: NetblockObservation) => if ((a.ip_end - a.ip_start) > (b.ip_end - b.ip_start)) a else b)
handleDataset(Option(largest), netData)
}
}
It is a fairly involved bit of code, so here is a general algorithm that I hope will help:
Forget about Spark for a moment and write a Scala function, probably in the companion object for NetblockObservation, that takes a collection of them and returns a subset of that collection that is contained. You should unit test the heck out of this function, and again this is pure Scala.
Moving now to Spark. Do a groupBy on your RDD[NetblockObservation] with device_type as the key producing essentially a map of String to Iterable[NetblockObservation].
Filter out all the entries in the map that have a value of size 1 and have a confidence below thresh.
For the entries that remain, apply your function from the first step to the collections of NetblockObservations with a mapValues.
Do a reduceByKey or similar to simply add up the confidence_scores of the contained values.
Enjoy a refreshing beverage.

For-loop/ simple data extraction and comparison in R

In this example dataset i have created a column called 'Var'. This is the result i would like from a the code. The pseudo-code to give Var is like this : For each ID_Survey, compare the Distance in sequence, if the difference between sequential Distances is 10, then Var=1, otherwise Var=0. Var should be 1 for both elements of the sequence where the difference is 10.
#Generate data
ID_Survey=rep(seq(1,3,1),each=4)
Distance= c(0,25,30,40,50,160,170,190,200,210,1000,1010)
Var= c(0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1);
TestData=data.frame(cbind(ID_Survey,Distance,Var))
TestData
I can use a simple for-loop like this, which nearly works, but it trips-up when moving between ID_Survey.
for(i in 1:(nrow(TestData)-1)){
TestData$Var2[i]=(TestData$Distance[i+1]==TestData$Distance[i]+10)}
I need to incorporate the above into a function which splits the data.frame into groups based on ID_Survey. I'm trying to build something like the following...
New6=do.call(rbind, by(TestData,list(TestData$ID_Survey),
FUN=function(x)
for (i in nrow(x)){ #loop must build an argument then return it.
#conditional statements in here.
return(x[i,])})); #loop appears to return 1st argument only.
... but i can't get the for-loop to operate inside the by-statement.
Any guidance much appreciated. Many thanks.
Using the data.table function (.SD) manages separating and collating chunks of the data.frame (as defined by ID_Survey) after it has been sent to a function. No doubt someone else will have a more elegant solution, but this seems to do the job:
library(data.table)
ComPair=function(a,b){V=ifelse(a==b-10,TRUE,FALSE);return(V)}
TestFunction=function(FData){
if(nrow(FData)>1){
for(i in 1:(nrow(FData)-1)){
V=ComPair(FData$Distance[i],FData$Distance[i+1])
if(V==1){ FData$Var2[i]=V;FData$Var2[i+1]=V}
}
};return(FData)}
TestData_dt=data.table(TestData)
TestData2=TestData_dt[,TestFunction(.SD),ID_Survey]
TestData2

Resources