Passing & returning a list/array as a parameter/ return type to a UDF in Redshift - user-defined-functions

I have a bunch of metrics that consume the entire list of float values of a column(think a series of order value on which I a doing some outlier analysis, hence needing the entire array of values) .
Can I pass the entire list as a parameter ? It would be too much data munging, if I were to do this in python entirely. Thoughts ?
# Redshift UDF - the red part is invalid signature & needs a fill
create function Median_absolute_deviation(y <Pass a list, but how? >,threshold float)
--INPUTS:
--a list of order values, -- a threshold
RETURNS <return a list, but how? >
STABLE
AS $
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > threshold
$LANGUAGE plpythonu
I can pass the m = np.median(y) from another function (using select statement on the DB) - but again calculating abs_dev & left_mad & right_mad needs the entire series.
Can I use anyelement data type here ? AWS Reference : http://docs.aws.amazon.com/redshift/latest/dg/udf-data-types.html
This is what I tried . Also, I would like to return the value of that column if flag was "0" - but I guess I can do it on 2nd pass ?
create or replace function Median_absolute_deviation(y anyelement ,thresh int)
--INPUTS:
--a list of order values, -- a threshold
-- I tried both float & anyelement return type, but same error
RETURNS float
--OUTPUT:
-- returns the value of order amount if not outlier, else returns 0
STABLE
AS $$
import numpy as np
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y<=m])
right_mad = np.median(abs_dev[y>=m])
y_mad = np.zeros(len(y))
y_mad[y < m] = left_mad
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
flag= 1 if (modified_z_score > thresh ) else 0
return flag
$$LANGUAGE plpythonu
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5;
An error occurred when executing the SQL command:
select Median_absolute_deviation(price,3) from my_table where price >0 limit 5
ERROR: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
Detail:
-----------------------------------------------
error: IndexError: invalid index to scalar variable.. Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 47544645
location: udf_client.cpp:298
process: query6_41 [pid=24744]
-----------------------------------------------
Execution time: 0.73s
1 statement failed.
My end goal is populating tableau views using these computations made via UDF's(the end goal) - so I need something that can interact with tableau and do computations on the fly using a function. Suggestions ?

Redshift only supports scalar UDFs for the time being, which means that you basically CANNOT pass a list as a parameter.
That being said, you can be creative and pass it as a string of numbers separated with a special character and then reconvert it to a list in your udf eg.:
list = [1, 2, 3.5] can be passed as
string_list = "1|2|3.5"
For this to work you need to pre-decide the precision of your numbers and the maximum size of your list, so as to define a varchar of the appropriate length.
It is not the best practice, but it will work.

Related

Access Values of Cell Array deeply nested within Structure Array

I have a nested structure_array/cell_array/structure_array of character values which is the result of a web query which returns a converted JSON object, the needed numeric value(s) of which I can access in loops thus:
for ix = 1 : size( S.orderBook.buckets , 2 )
if ( str2double( S.orderBook.buckets{ ix }.price ) >= str2double( S.orderBook.price ) )
mid_ix = ix ;
break ;
endif
endfor
The above loop gets the index, mid_ix, of the cell in the middle of the region of interest, and
orderbook_begin_ix = mid_ix - 20 ; orderbook_end_ix = mid_ix + 20 ;
jj = 0 ;
for ix = orderbook_begin_ix : orderbook_end_ix
jj = jj + 1 ;
new_orderbook_data( 1 , jj ) = str2double( S.orderBook.buckets{ ix }.longCountPercent ) ;
endfor
this second loop fills the pre-initialised matrix, new_orderbook_data, with the values of interest.
However, I was wondering whether there is a quicker/more elegant way of getting these values? At the moment, as can be seen above, I am having to run a "look up" for loop that encloses an "if statement" to get in the ballpark of the required numeric values, and then run a second for loop in the region of the ballpark to extract these required values.
Note: cross posted at Octave forum
I think I have solved this by using the syntax below:
prices = cellfun( #str2double , { [ S.orderBook.buckets{:} ].price } ) ;
which gives me a matrix "prices" to which I can further apply vectorised code.
Explanation:-
the { : } extracts the prices from the cell array into a comma
separated list,
the enclosing [ ] puts this list into a structure array,
the [ ].price extracts just the prices which are then put back into a
cell array with the outermost enclosing { }
and then the string values are converted to numeric by applying the
cellfun to this prices cell array and
are finally assigned to the "prices" matrix.

Why does my "random key code" creates a same key every once in a while?

I`m having the following code from which I extract randomPW for my db.
I need this string of random characters in order to use it a primary key in my Db. The problem is that I`m getting quite a lot of duplicates when I execute this code more than once or if I get a Loop in order to extract (for example) 100 keys at once.
If I try to reload the page in order to insert one by one key the same problem occurs... every 50-80 reloads there is a duplicate. What's wrong with my code?
<%
Function RandomPW(myLength)
Const minLength = 6
Const maxLength = 20
Dim X, Y, strPW
If myLength = 0 Then
Randomize
myLength = Int((maxLength * Rnd) + minLength)
End If
For X = 1 To myLength
Y = Int((3 * Rnd) + 1) '(1) Numeric, (2) Uppercase, (3) Lowercase
Select Case Y
Case 1
'Numeric character
Randomize
strPW = strPW & CHR(Int((9 * Rnd) + 48))
Case 2
'Uppercase character
Randomize
strPW = strPW & CHR(Int((25 * Rnd) + 65))
Case 3
'Lowercase character
Randomize
strPW = strPW & CHR(Int((25 * Rnd) + 97))
End Select
Next
RandomPW = strPW
End Function
%>
I expect my code to extract a string that will not duplicate every now and then.
I need this string of random characters in order to use it a primary key in my Db.
In this case I would recommend to use Scriptlet.TypeLib :
Function RandomPW(myLength)
Set TypeLib = CreateObject("Scriptlet.TypeLib")
If myLength < Len(TypeLib.Guid)
RandomPW = Left(TypeLib.Guid, myLength)
Else
RandomPW = TypeLib.Guid
End If
End Function
Randomize is not supposed to be used more than once, unless you want to make sure you are creating fake, repeatable randomness. Per docs, helpfully linked by Lankymart (emphasis mine):
Randomize uses number to initialize the Rnd function's random-number generator, giving it a new seed value. If you omit number, the value returned by the system timer is used as the new seed value.
The system timer referred to above is in seconds; which means, successive calls to Randomize in short succession will make sure the following Rnd is yielding the same value.
It would likely help you immensely to remove all calls to Randomize.

How to apply if condition in GPU DataFrame- cuDF to filter the DataFrame?

I'd like to filter a cuDF data frame based on a column value, and then create a new column based on a condition specified. Basically, how can I apply the following in cuDF?
df.loc[df.column_name condition, 'new column name'] = 'value if condition is met'
Given Pandas in cuDF
# value to be replaced in series
value = 'value if condition is met'
# condition to qualify for replacement
mask = df.column_name condition
# https://docs.rapids.ai/api/cudf/stable/
df['new column name'] = df.masked_assign(value, mask)
Applied Example
"""explanation:
>> if there is no pool, pool_sqft should be 0
"""
# value to be replaced in series
value = 0
# condition to qualify for replacement
mask = df_train['pool_count']==0
# https://docs.rapids.ai/api/cudf/stable/
df['pool_sqft'] = df.masked_assign(value, mask)
While masked_assign works for certain conditions, applymap is syntactically better and functionally similar to the Pandas API.
Also, #ashwin-srinath mentions that __setitem()__ is coming the 0.9 release, so you'll just be able to do df[condition] = value. masked_assign might be going away in favor of just __setitem()__ as masked_assign is not a Pandas API function.
You can also use .query()
Example:
expr = "(a == 2) or (b == 3)"
filtered_df = df.query(expr)
where a and b are the names of the columns in the dataframe.

How to create the equivalent of a HashMap<Int, Int[]> in Lua

I would like to have a simple data structure in lua resembling a Java HashMap equivalent.
The purpose of this is that I wish to maintain a unique key 'userID' mapped against a set of two values which get constantly updated, for example;
'77777', {254, 24992}
Any suggestions as to how can I achieve this?
-- Individual Aggregations
local dictionary = ?
-- Other Vars
local sumCount = 0
local sumSize = 0
local matches = redis.call(KEYS, query)
for _,key in ipairs(matches) do
local val = redis.call(GET, key)
local count, size = val:match(([^:]+):([^:]+))
topUsers(string.sub(key, 11, 15), sumCount, sumSize)
-- Global Count and Size for the Query
sumCount = sumCount + tonumber(count)
sumSize = sumSize + tonumber(size)
end
local result = string.format(%s:%s, sumCount, sumSize)
return result;
-- Users Total Data Aggregations
function topUsers()
-- Do sums for each user
end
Assuming that dictionary is what you are asking about:
local dictionary = {
['77777'] = {254, 24992},
['88888'] = {253, 24991},
['99999'] = {252, 24990},
}
The tricky part is that the key is a string that can't be converted to a Lua variable name so you must surround each key with []. I can't find a clear description of rule for this in Lua 5.1 reference manual, but the Lua wiki says that if a key "consists of underscores, letters, and numbers, but doesn't start with a number" only then does it not require the [] when defined in the above manner, otherwise the square brackets are required.
Just use a Lua table indexed by userID and with values another Lua table with two entries:
T['77777']={254, 24992}
This is possible implementation of the solution.
local usersTable = {}
function topUsers(key, count, size)
if usersTable[key] then
usersTable[key][1] = usersTable[key][1] + count
usersTable[key][2] = usersTable[key][2] + size
else
usersTable[key] = {count, size}
end
end
function printTable(t)
for key,value in pairs(t) do
print(key, value[1], value[2])
end
end

Granger Test (Python) Error message - TypeError: unsupported operand type(s) for -: 'str' and 'int'

I am trying to run a granger causality test on two currency pairs but I seem to get this error message in Shell whenever I try and test it. Can anyone please advise?
I am very new to programming and need this to run an analysis for my project. In shell, I am putting -
import ats15
ats15.grangertest('EURUSD', 'EURGBP', 8)
What is going wrong? I have copied the script below.
Thanks in advance.
Heading ##def grangertest(Y,X,maxlag):
"""
Performs a Granger causality test on variables (vectors) Y and X.
The null hypothese is: Does X cause Y ?
Returned value: pvalue, F, df1, df2
"""
# Create linear model involving Y lags only.
n = len(Y)
if n != len(X):
raise ValueError, "grangertest: incompatible Y,X vectors"
M = [ [0] * maxlag for i in range(n-maxlag)]
for i in range(maxlag, n):
for j in range(1, maxlag+1):
M[i-maxlag][j-1] = Y[i-j]
fit = ols(M, Y[maxlag:])
RSSr = fit.RSS
# Create linear model including X lags.
for i in range(maxlag, n):
xlagged = [X[i-j] for j in range(1, maxlag+1)]
M[i-maxlag].extend(xlagged)
fit = ols(M, Y[maxlag:])
RSSu = fit.RSS
df1 = maxlag
df2 = n - 2 * maxlag - 1
F = ((RSSr - RSSu)/df1)/(RSSu/df2)
pvalue = 1.0 - stats.f.cdf(F,df1,df2)
return pvalue, F, df1, df2, RSSr, RSSu
You didn't post the full traceback, but this error message:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
means what it says. There's an operand - -- the subtraction operator -- and it doesn't know how to handle subtracting an integer from a string. Why would strings be involved? Well, you're calling the function with:
ats15.grangertest('EURUSD', 'EURGBP', 8)
and so you're giving grangertest two strings and an integer. But it seems like grangertest expects
def grangertest(Y,X,maxlag):
two sequences (lists, arrays, whatever) of numbers to use as Y and X, not strings. If EURUSD and EURGBP are names you've given to lists beforehand, then you don't need the quotes:
ats15.grangertest(EURUSD, EURGBP, 8)
but if not, then you should pass grangertest the lists under whatever name you've called them.
The input to the grangertest function must be two lists of numbers. grangertest doesn't know anything about currencies, so passing it currency strings won't work.
You have to fetch the exchange rate data somehow so that you can pass it to grangertest. If EURUSD and EURGBP are variables, then you don't put quotes around them when you pass them to a function (e.g. ats15.grangertest(EURUSD, EURGBP, 8)).

Resources