How to apply if condition in GPU DataFrame- cuDF to filter the DataFrame? - rapids

I'd like to filter a cuDF data frame based on a column value, and then create a new column based on a condition specified. Basically, how can I apply the following in cuDF?
df.loc[df.column_name condition, 'new column name'] = 'value if condition is met'

Given Pandas in cuDF
# value to be replaced in series
value = 'value if condition is met'
# condition to qualify for replacement
mask = df.column_name condition
# https://docs.rapids.ai/api/cudf/stable/
df['new column name'] = df.masked_assign(value, mask)
Applied Example
"""explanation:
>> if there is no pool, pool_sqft should be 0
"""
# value to be replaced in series
value = 0
# condition to qualify for replacement
mask = df_train['pool_count']==0
# https://docs.rapids.ai/api/cudf/stable/
df['pool_sqft'] = df.masked_assign(value, mask)

While masked_assign works for certain conditions, applymap is syntactically better and functionally similar to the Pandas API.
Also, #ashwin-srinath mentions that __setitem()__ is coming the 0.9 release, so you'll just be able to do df[condition] = value. masked_assign might be going away in favor of just __setitem()__ as masked_assign is not a Pandas API function.

You can also use .query()
Example:
expr = "(a == 2) or (b == 3)"
filtered_df = df.query(expr)
where a and b are the names of the columns in the dataframe.

Related

Variable types in TensorFlow

I have a model with several variable types.
boolean flag: 0 or 1
positive float value: strictly greater than zero and known max < 1000
integer: 1 < value < 12
categorical input: "AA","AB",.., "ZZ" - only about 100 values are observed
Integer score as output value
cvs file looks like
"bool","pos_float","int_val","category_name","output_score"
0,1.234,9,"CD",2
1,6.836,5,"KF",6
0,903.836,10,"AZ",4
.....
import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
training_data_df = pd.read_csv("data_training.csv", dtype=float)
X_training = training_data_df.drop('output_score', axis=1).values
Y_training = training_data_df[['output_score']].values
test_data_df = pd.read_csv("data_test.csv", dtype=float)
X_testing = test_data_df.drop('output_score', axis=1).values
Y_testing = test_data_df[['output_score']].values
X_scaler = MinMaxScaler(feature_range=(0, 1))
Y_scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_training = X_scaler.fit_transform(X_training)
Y_scaled_training = Y_scaler.fit_transform(Y_training)
X_scaled_testing = X_scaler.transform(X_testing)
Y_scaled_testing = Y_scaler.transform(Y_testing)
Code above treats each variable as float and scales variables to (0,1). How to tell tensorflow that a variable is an integer? How to treat categorical variables?
For the categorical variables, you're going to need to transform them into a numeric representation, either by a one-hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) or via a hashing trick (https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f).
Essentially, you need to transform those strings into 1/0 boolean values for each feature category.
However, certain model, such as tree-based models like Random Forests and Gradient Boosted Trees, CAN handle multiple categories, so they simply need to be converted to a numeric-category type (you can retain the string values as labels).

Octave: matrix multiplication over a group

I'd like to simply compute multiplication of two matrices.
But instead of real numbers I'd like to use elements of a finite group in the matrix.
Namely I want to use elements of F4={0,1,x,1+x} (so i only have 4 possible elements). In this group, addition and multiplication are well-defined, and the relations x^2=1+x, 1+1=0 and x+x=0 hold.
Since I'm a beginner at programming in Octave, I have no idea how to compute operations with something different than real numbers.
My idea was, that if it's possible to define some operations on a certain set of elements (here F4), then it's maybe possible to use these operations when multiplicating matrices.
I think the most efficient way to do arithmetic with a finite group of possible values and non-standard addition and multiplication is by table lookup.
Table lookup requires matrices to be encoded such that the elements are indices into the list of group elements. And since indexing starts at 1, you'll need to represent {0,1,x,x+1} as {1,2,3,4}.
But aside the awkward mapping of 1=0, 2=1, things are quite straightforward with table lookup. This is some example code I cooked up, it seems to work but I might have made some mistake (and I might have misunderstood the exact arithmetic rules):
function out = group_mtimes(lhs,rhs)
[I,K] = size(lhs);
[K2,J] = size(rhs);
if K~=K2, error('Inner dimensions must agree'), end
out = zeros(I,J);
for j=1:J
for i=1:I
v = 1;
for k=1:K
v = group_scalar_add(v, group_scalar_times(lhs(i,k),rhs(k,j)));
end
out(i,j) = v;
end
end
disp('lhs = ')
group_print(lhs)
disp('rhs = ')
group_print(rhs)
disp('lhs * rhs = ')
group_print(out)
end
function group_print(in)
names = {'0','1','x','1+x'};
disp(names(in)) % Quick-and-dirty, can be done much better!
end
function out = group_scalar_add(lhs,rhs)
table = [
1,2,3,4
2,1,4,3
3,4,1,2
4,3,2,1
];
out = table(lhs,rhs);
end
function out = group_scalar_times(lhs,rhs)
table = [
1,1,1,1
1,2,3,4
1,3,4,2
1,4,2,3
];
out = table(lhs,rhs);
end
For example:
>> lhs=[1,2,3,4;2,3,1,4]';
>> rhs=[2,3;4,1];
>> group_mtimes(lhs,rhs);
lhs =
'0' '1'
'1' 'x'
'x' '0'
'1+x' '1+x'
rhs =
'1' 'x'
'1+x' '0'
lhs * rhs =
'1+x' '0'
'0' 'x'
'x' '0'
'x' '1'
There is no input checking in this code, if the input contains a 5, you'll get and index out of range error.
As I mentioned in a comment, you could make a class that encapsulates arrays of this type. You could then overload plus, times and mtimes (for operators +, .* and *, respectively), as well as disp to write out the values properly. You would define the constructor so that objects of this class always have valid values, this would prevent lookup table indexing errors. Such a class would make working with these functions a lot simpler.
For the special case of Galois fields of even characteristic, such as F4, you can use the functions provided by the communications package from Octave Forge:
Functions reference: Galois Fields of Even Characteristic
Galois fields of odd charactristic are not implemented yet:
Functions reference: Galois Fields of Odd Characteristic

How to save Julia for loop returns in an array or dataframe?

I am trying to apply a function over each row of a DataFrame as the code shows.
using RDatasets
iris = dataset("datasets", "iris")
function mean_n_var(x)
mean1=mean([x[1], x[2], x[3], x[4]])
var1=var([x[1], x[2], x[3], x[4]])
rst=[mean1, var1]
return rst
end
mean_n_var([2,4,5,6])
for row in eachrow(iris[1:4])
println(mean_n_var(convert(Array, row)))
end
However, instead of printing results, I'd like to save them in an array or another DataFrame.
Thanks in advance.
I thought it is worth to mention some more options available over what was already mentioned.
I assume you want a Matrix or a DataFrame. There are several possible approaches.
First is the most direct to get a Matrix:
mean_n_var(a) = [mean(a), var(a)]
hcat((mean_n_var(Array(x)) for x in eachrow(iris[1:4]))...) # rows
vcat((mean_n_var(Array(x)).' for x in eachrow(iris[1:4]))...) # cols
another possible approach is vectorized, e.g.:
mat_iris = Matrix(iris[1:4])
mat = hcat(mean(mat_iris, 2), var(mat_iris, 2))
df = DataFrame([vec(f(mat_iris, 2)) for f in [mean,var]], [:mean, :var])
DataFrame(mat) # this constructor also accepts variable names on master but is not released yet

Passing & returning a list/array as a parameter/ return type to a UDF in Redshift

I have a bunch of metrics that consume the entire list of float values of a column(think a series of order value on which I a doing some outlier analysis, hence needing the entire array of values) .
Can I pass the entire list as a parameter ? It would be too much data munging, if I were to do this in python entirely. Thoughts ?
# Redshift UDF - the red part is invalid signature & needs a fill
create function Median_absolute_deviation(y <Pass a list, but how? >,threshold float)
--INPUTS:
--a list of order values, -- a threshold
RETURNS <return a list, but how? >
STABLE
AS $
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > threshold
$LANGUAGE plpythonu
I can pass the m = np.median(y) from another function (using select statement on the DB) - but again calculating abs_dev & left_mad & right_mad needs the entire series.
Can I use anyelement data type here ? AWS Reference : http://docs.aws.amazon.com/redshift/latest/dg/udf-data-types.html
This is what I tried . Also, I would like to return the value of that column if flag was "0" - but I guess I can do it on 2nd pass ?
create or replace function Median_absolute_deviation(y anyelement ,thresh int)
--INPUTS:
--a list of order values, -- a threshold
-- I tried both float & anyelement return type, but same error
RETURNS float
--OUTPUT:
-- returns the value of order amount if not outlier, else returns 0
STABLE
AS $$
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
flag= 1 if (modified_z_score > thresh ) else 0
return flag
$$LANGUAGE plpythonu
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5;
An error occurred when executing the SQL command:
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5
ERROR: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
Detail:
-----------------------------------------------
error: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 47544645
location: udf_client.cpp:298
process: query6_41 [pid=24744]
-----------------------------------------------
Execution time: 0.73s
1 statement failed.
My end goal is populating tableau views using these computations made via UDF's(the end goal) - so I need something that can interact with tableau and do computations on the fly using a function. Suggestions ?
Redshift only supports scalar UDFs for the time being, which means that you basically CANNOT pass a list as a parameter.
That being said, you can be creative and pass it as a string of numbers separated with a special character and then reconvert it to a list in your udf eg.:
list = [1, 2, 3.5] can be passed as
string_list = "1|2|3.5"
For this to work you need to pre-decide the precision of your numbers and the maximum size of your list, so as to define a varchar of the appropriate length.
It is not the best practice, but it will work.

USing AddExpression / MathExpression in Weka

I am working on a very basic WEKA assignment, and I'm trying to use WEKA to preprocess data from the GUI (most current version). I am trying to do very basic if statements and mathematical statements in the expression box when double clicking on MathExpression and I haven't had any success. For example I want to do
if (a5 == 2 || a5 == 0) then y = 1; else y = 0
Many different variations of this haven't worked for me and I'm also unclear on how to refer to "y" or if it needs a reference within the line.
Another example is -abs(log(a7)–3) which I wasn't able to work out either. Any ideas about how to make these statements work?
From javadoc of MathExpression
The 'A'
letter refers to the value of the attribute being processed.
Other attribute values (numeric only) can be accessed through
the variables A1, A2, A3, ...
Your filter applies to all attributes of your dataset. If I load iris dataset and apply following filter.
weka.filters.unsupervised.attribute.MathExpression -E log(A).
your attribute ,sepallength values change as following.
Before Filter After Filter
Minimum 4.3 Minimum 1.459
Maximum 7.9 Maximum 2.067
Mean 5.843 Mean 1.755
StdDev 0.828 StdDev 0.141
Also if you look to javadoc, there is no if else function but ifelse function. Therefore you should write something like
ifelse ( (A == 2 || A == 0), 1,0 )
Also this filter applies to all attributes. If you want to change only one attribute and according to other attribute values ; then you need to use "Ignore range option" and use A1,A2 to refer to other attribute values.
if you need to add new attribute use AddExpression.
An instance filter that creates a new attribute by applying a mathematical expression to existing attributes.

Resources