Variable types in TensorFlow - tensorflow-datasets

I have a model with several variable types.
boolean flag: 0 or 1
positive float value: strictly greater than zero and known max < 1000
integer: 1 < value < 12
categorical input: "AA","AB",.., "ZZ" - only about 100 values are observed
Integer score as output value
cvs file looks like
"bool","pos_float","int_val","category_name","output_score"
0,1.234,9,"CD",2
1,6.836,5,"KF",6
0,903.836,10,"AZ",4
.....
import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
training_data_df = pd.read_csv("data_training.csv", dtype=float)
X_training = training_data_df.drop('output_score', axis=1).values
Y_training = training_data_df[['output_score']].values
test_data_df = pd.read_csv("data_test.csv", dtype=float)
X_testing = test_data_df.drop('output_score', axis=1).values
Y_testing = test_data_df[['output_score']].values
X_scaler = MinMaxScaler(feature_range=(0, 1))
Y_scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_training = X_scaler.fit_transform(X_training)
Y_scaled_training = Y_scaler.fit_transform(Y_training)
X_scaled_testing = X_scaler.transform(X_testing)
Y_scaled_testing = Y_scaler.transform(Y_testing)
Code above treats each variable as float and scales variables to (0,1). How to tell tensorflow that a variable is an integer? How to treat categorical variables?

For the categorical variables, you're going to need to transform them into a numeric representation, either by a one-hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) or via a hashing trick (https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f).
Essentially, you need to transform those strings into 1/0 boolean values for each feature category.
However, certain model, such as tree-based models like Random Forests and Gradient Boosted Trees, CAN handle multiple categories, so they simply need to be converted to a numeric-category type (you can retain the string values as labels).

Related

Sorting array of struct in Julia

Suppose if I have the following in Julia:
mutable struct emptys
begin_time::Dict{Float64,Float64}; finish_time::Dict{Float64,Float64}; Revenue::Float64
end
population = [emptys(Dict(),Dict(),-Inf) for i in 1:n_pop] #n_pop is a large positive integer value.
for ind in 1:n_pop
r = rand()
append!(population[ind].Revenue, r)
append!(population[ind].begin_time, Dict(r=>cld(r^2,rand())))
append!(population[ind].finish_time, Dict(r=>r^3/rand()))
end
Now I want to sort this population based on the Revenue value. Is there any way for Julia to achieve this? If I were to do it in Python it would be something like this:
sorted(population, key = lambda x: x.Revenue) # The population in Python can be prepared using https://pypi.org/project/ypstruct/ library.
Please help.
There is a whole range of sorting functions in Julia. The key functions are sort (corresponding to Python's sorted) and sort! (corresponding to Python's list.sort).
And as in Python, they have a couple of keyword arguments, one of which is by, corresponding to key.
Hence the translation of
sorted(population, key = lambda x: x.Revenue)
would be
getrevenue(e::emptys) = e.Revenue
sort(population, by=getrevenue)
Or e -> e.Revenue, but having a getter function is good style anyway.

How does keras pass y_pred to loss object/function via model.compile

If I have a function that defines the triple loss (which expects a y_true & y_pred as input parameters), and I "reference or call it" via the following:
model.compile(optimizer="rmsprop", loss=triplet_loss, metrics=[accuracy])
Hows does the y_pred get passed to the triplet_loss function?
For example the triplet_loss function may be:
def triplet_loss(y_true, y_pred, alpha = 0.2):
"""
Implementation of the triplet loss function
Arguments:
y_true -- true labels, required when you define a loss in Keras,
y_pred -- python list containing three objects:
"""
anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
# distance between the anchor and the positive
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)))
# distance between the anchor and the negative
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)))
# compute loss
basic_loss = pos_dist-neg_dist+alpha
loss = tf.maximum(basic_loss,0.0)
return loss
Thanks Jon
I did a little bit of poking through the keras source code. In the Model() class:
First they modify the function a bit to take into account weights:
self.loss_functions = loss_functions
weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
A bit later during training they map their outputs (predictions) to their targets (labels) and call the loss function to get the output_loss. Here y_true and y_pred are passed into your function.
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
loss_weight = loss_weights_list[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)

Cannot convert object of type 'list' to a numeric value

I am making a pyomo model, where i want to use random numbers for my two dimensional parameters. I put a small python script for random numbers that looks exactly what i wanted to see for my two dimensional parameter. I am getting a TypeError: Cannot convert object of type 'list'(value =[[....]] to a numeric value. in my objective function. Below is my objective function and random numbers script.
model.obj = Objective(expr=sum(model.C[v,l] * model.T[v,l] for v in model.V for l in model.L) + \
sum(model.OC[o,l] * model.D[o,l] for o in model.O for l in model.L), sense=minimize)
import random
C = [[] for i in range(7)]
for i in range(7):
for j in range(5):
C[i]+= [random.randint(100,500)]
model.C = Param(model.V, model.L, initialize=C)
Please let me know if someone can help fixing this.
You should initialize your parameter using a function instead of a nested list
def init_c(m, i, j):
return random.randint(100,500)
model.c = Param(model.V, model.L, initialize=init_c)

Passing & returning a list/array as a parameter/ return type to a UDF in Redshift

I have a bunch of metrics that consume the entire list of float values of a column(think a series of order value on which I a doing some outlier analysis, hence needing the entire array of values) .
Can I pass the entire list as a parameter ? It would be too much data munging, if I were to do this in python entirely. Thoughts ?
# Redshift UDF - the red part is invalid signature & needs a fill
create function Median_absolute_deviation(y <Pass a list, but how? >,threshold float)
--INPUTS:
--a list of order values, -- a threshold
RETURNS <return a list, but how? >
STABLE
AS $
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > threshold
$LANGUAGE plpythonu
I can pass the m = np.median(y) from another function (using select statement on the DB) - but again calculating abs_dev & left_mad & right_mad needs the entire series.
Can I use anyelement data type here ? AWS Reference : http://docs.aws.amazon.com/redshift/latest/dg/udf-data-types.html
This is what I tried . Also, I would like to return the value of that column if flag was "0" - but I guess I can do it on 2nd pass ?
create or replace function Median_absolute_deviation(y anyelement ,thresh int)
--INPUTS:
--a list of order values, -- a threshold
-- I tried both float & anyelement return type, but same error
RETURNS float
--OUTPUT:
-- returns the value of order amount if not outlier, else returns 0
STABLE
AS $$
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
flag= 1 if (modified_z_score > thresh ) else 0
return flag
$$LANGUAGE plpythonu
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5;
An error occurred when executing the SQL command:
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5
ERROR: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
Detail:
-----------------------------------------------
error: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 47544645
location: udf_client.cpp:298
process: query6_41 [pid=24744]
-----------------------------------------------
Execution time: 0.73s
1 statement failed.
My end goal is populating tableau views using these computations made via UDF's(the end goal) - so I need something that can interact with tableau and do computations on the fly using a function. Suggestions ?
Redshift only supports scalar UDFs for the time being, which means that you basically CANNOT pass a list as a parameter.
That being said, you can be creative and pass it as a string of numbers separated with a special character and then reconvert it to a list in your udf eg.:
list = [1, 2, 3.5] can be passed as
string_list = "1|2|3.5"
For this to work you need to pre-decide the precision of your numbers and the maximum size of your list, so as to define a varchar of the appropriate length.
It is not the best practice, but it will work.

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources