How to groupby with custom function in python cuDF? - rapids

I am new to using GPU for data manipulations, and have been struggling to replicate some of the functions in cuDF. For instance, I want to get a mode value for each group in the dataset. In Pandas it is easily done with custom functions:
df = pd.DataFrame({'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]}
| group | value |
| ----- | ----- |
| 1 | 10 |
| 2 | 10 |
| 2 | 30 |
| 1 | 20 |
| 3 | 20 |
| 1 | 10 |
| 2 | 30 |
def get_mode(customer):
freq = {}
for category in customer:
freq[category] = freq.get(category, 0) + 1
key = max(freq, key=freq.get)
return [key, freq[key]]
df.groupby('group').agg(get_mode)
| group | value |
| ----- | ----- |
| 1 | 10 |
| 2 | 30 |
| 3 | 20 |
However, I just can't seem to be able to replicate the same functionality in cuDF. Even though there seems to be a way to do it, of which I have found some examples, but it somehow doesn't work in my case. For example, the following is the function I tried to use for cuDF:
def get_mode(group, mode):
print(group)
freq = {}
for i in range(cuda.threadIdx.x, len(group), cuda.blockDim.x):
category = group[i]
freq[category] = freq.get(category, 0) + 1
mode = max(freq, key=freq.get)
max_freq = freq[mode]
df.groupby('group').apply_grouped(get_mode, incols=['group'],
outcols=dict((mode=np.float64))
Can someone please help me understand what is going wrong here, and how to fix it? Attempting to run the code above throws the following error (hopefully I managed to put it under the spoiler):
Error code
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_get at 0x7fa8f0500710>) found for signature:
>>> impl_get(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))
There are 2 candidate implementations:
- Of which 1 did not match due to:
Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
With argument(s): '(DictType[undefined,undefined]<iv=None>, int64, int64)':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type DictType[undefined,undefined]<iv=None>
During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
def impl(dct, key, default=None):
castedkey = _cast(key, keyty)
^
raised from /opt/conda/lib/python3.7/site-packages/numba/core/typeinfer.py:1086
- Of which 1 did not match due to:
Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
With argument(s): '(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type DictType[undefined,undefined]<iv={}>
During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
def impl(dct, key, default=None):
castedkey = _cast(key, keyty)
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.DictType'>, 'get') for DictType[undefined,undefined]<iv={}>)
During: typing of call at /tmp/ipykernel_33/2595976848.py (6)
File "../../tmp/ipykernel_33/2595976848.py", line 6:
<source missing, REPL/exec in use?>
During: resolving callee type: type(<numba.cuda.compiler.Dispatcher object at 0x7fa8afe49520>)
During: typing of call at <string> (10)
File "<string>", line 10:
<source missing, REPL/exec in use?>

cuDF builds on top of Numba's CUDA target to enable UDFs. This doesn't support using a dictionary in the UDF, but you your use case can expressed with built-in operations with pandas or cuDF by combining value_counts and drop_duplicates.
import pandas as pd
​
df = pd.DataFrame(
{
'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]
}
)
​
out = (
df
.value_counts()
.reset_index(name="count")
.sort_values(["group", "count"], ascending=False)
.drop_duplicates(subset="group", keep="first")
)
print(out[["group", "value"]])
group value
4 3 20
1 2 30
0 1 10

This is probably not the answer you are looking for, but I found a workaround for mode. It isn't the best way, doesn't use GPU, and can be quite slow.
import pandas as pd
import cudf
df = cudf.DataFrame({'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]}
df.to_pandas().groupby('group').agg({'value':pd.Series.mode})

Related

How to create structure of table for best query answers and performance in Cassandra?

I have the following situations: I need to redesign a Cassandra keyspace by restructuring multiple tables.
The existing structure is the following
token#cqlsh> describe series;
CREATE TABLE series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
)
Instead of having the following structure:
type | name | as_of | data | hash
------+-------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------
OP | LU_STC | 2022-09-30 00:00:00.000000+0000 | {"type": "OP", "name": "LU_STC", "as_of": "2022-09-30", "data": [{"year": 2022, "month": 9, "day": 30, "hour": 0, "quarter": 3, "week": 39, "wk_year": 2022, "is_peak": 0, "value": 4.689399994443761}, {"year": 2022, "month": 9, "day": 30, "hour": 1, "quarter": 3, "week": 39, "wk_year": 2022, "is_peak": 0, "value": 12.761399943614606}], "hash": "cf4e383a370416ebbdec8a2f2ca4e28982cd9f871a9f540e4f4f6c8807a7015a"} | cf4e383a370416ebbdec8a2f2ca4e28982cd9f871a9f540e4f4f6c8807a7015a
where the data field contains more than 50k entries:
{
"type":"OP",
"name":"LU_STC",
"as_of":"2022-09-30",
"data":[
{
"year":2022,
"month":9,
"day":30,
"hour":0,
"quarter":3,
"week":39,
"wk_year":2022,
"is_peak":0,
"value":4.689399994443761
},
...
50k entries
...
{
"year":2022,
"month":9,
"day":30,
"hour":1,
"quarter":3,
"week":39,
"wk_year":2022,
"is_peak":0,
"value":12.761399943614606
}
],
"hash":"cf4e383a370416ebbdec8a2f2ca4e28982cd9f871a9f540e4f4f6c8807a7015a"
}
a table (series_test) where each data field is written like a row, having the following structure:
token#cqlsh> select * series_test where name = 'DE_STC' and as_of = '2022-09-30' and date <= '2022-10-01 00:00:00.000000+0000' ALLOW FILTERING;
name | as_of | date | commodity | type | is_peak | quarter | value | week | wk_year
--------+---------------------------------+---------------------------------+-----------+------+---------+---------+---------+------+---------
DE_STC | 2022-09-30 00:00:00.000000+0000 | 2022-09-30 00:00:00.000000+0000 | power | OP | False | 3 | -1.0377 | 39 | 2022
DE_STC | 2022-09-30 00:00:00.000000+0000 | 2022-10-01 00:00:00.000000+0000 | power | OP | False | 4 | 31.0728 | 39 | 2022
The question is how I can build the partition key to be unique and to be able to query for sets of rows based on the date column (timestamp) for getting records only for a given period?
Is the following structure correct ?
token#cqlsh> describe series_test;
CREATE TABLE series_test (
name text,
as_of timestamp,
date timestamp,
commodity text,
type text,
is_peak boolean,
quarter int,
value float,
week int,
wk_year int,
PRIMARY KEY ((name, as_of, date), commodity, type)
)

Eloquent how to get Sum of a column separated by 2 has many relationships

I am trying to get the sum of a column of another table that is 2 hasMany relationships away. Ideally I would like to use eloquent and not use raw queries or joins.
Table structure
factions table
id
example rows
1
2
3
islands table
id | faction_id
example rows
1 | 1
2 | 2
3 | 3
4 | 1
5 | 2
6 | 3
shrines table
id | island_id | points (integer)
example rows
1 | 1 | 200
2 | 2 | 100
3 | 3 | 50
4 | 4 | 75
5 | 5 | 200
6 | 6 | 100
7 | 1 | 25
8 | 2 | 40
9 | 3 | 50
Relationships:
Faction hasMany Island
Island hasMany shrines
I would like to be able to get all Factions and get the sum of the points that effectively belong to each faction where the result is something like
$factions = [
['id' => 1, 'points' => 300],
['id' => 2, 'points' => 340],
['id' => 3, 'points' => 200],
...
]
so points in the sum of the shrines.points that belongs to the islands that belongs to the faction.
Any help is appreciated and I've found a few posts that talk about similar problems but the problem isn't exactly the same and the solution isn't exactly what I am looking for.
If you add a hasManyThrough relationship on your Faction model, you should be able to use an accessor to do it:
Faction.php
protected $appends = ['points'];
public function shrines(){
return $this->hasManyThrough(Shrine::class, Island::class);
}
public function getPointsAttribute(){
return $this->shrines()->sum('points');
}

Impala substr can't get utf8 character correctly

I am new to ETL and I was assigned with a task on sanitizing some sensitive information before giving the data to a client.
I am using HUE web client with Impala.
What I want to do is:
For example, a column info like '京客隆(三里屯店)', then I need to transform it into something like '京XXX店)' .
My query is:
select '京客隆(三里屯店)', concat(substr('京客隆(三里屯店)', 1, 3), 'XXX', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -6, 6));
But I get gibberish in the output:
'京客隆(三里屯店)' | concat(substr('京客隆(三里屯店)', 1, 3), 'xxx', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') - 6, 6))
京客隆(三里屯店) | 京XXX�店�
The problem is that :
select '京客隆(三里屯店)', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -3 , 3);
output: 京客隆(三里屯店) ��
doesn't get the correct characaters. Why is that? I pasted the string in python shell and I can get the correct characters if I only take the last 3 bytes.
It turns out that I misunderstood the function substr.
substr(STRING a, INT start [, INT len]) :
It takes characters starting from (including) INT start. So for example my string '京客隆(三里屯店)' is 27 bytes long in total, and each utf8 char takes 3 bytes here. I need to take the last 3 bytes, which is the ) , then I need to write:
substr('京客隆(三里屯店), 27 - 2 ,3 ) .
It then gets the 25, 26, 27 3 bytes and display the char ) correctly.
Updated:
I was told to use :
SELECT regexp_replace('京客隆(三里屯店)', '(.)(.*)(.{2})', '\\1***\\3');
works like an charm :P.

How to use VowpalWabbit python framework learn for multi line example?

from vowpalwabbit import pyvw
vw = pyvw.vw("--cb 3 --epsilon 0.2 --quiet")
input = "2:-5:0.2 | Anna"
vw.learn(input)
input = "3:-20:0.2 | Anna \n 2:-20:0.2 | Anna \n 1:-20:0.2 | Anna"
vw.learn([vw.example(string) for string in input.split('\n')])
print(vw.predict(" | Anna"))
This piece of code is throwing error:
RuntimeError Traceback (most recent call last)
<ipython-input-7-e8693ac0708c> in <module>()
4 vw.learn(input)
5 input = "3:-20:0.2 | Anna \n 2:-20:0.2 | Anna \n 1:-20:0.2 | Anna"
----> 6 vw.learn([vw.example(string) for string in input.split('\n')])
7
8 vw.learn(input)
/usr/local/lib/python3.6/dist-packages/vowpalwabbit/pyvw.py in learn(self, ec)
168 pylibvw.vw.learn(self, ec)
169 elif isinstance(ec, list):
--> 170 pylibvw.vw.learn_multi(self,ec)
171 else:
172 raise TypeError('expecting string or example object as ec argument for learn, got %s' % type(ec))
RuntimeError: This reduction does not support multi-line example.
Why am I getting this error? What is the correct syntax for learning from multi-line example?
The issue is that the reduction you are using, CB, is a single line reduction. Therefore passing multi line examples does not make sense in this case. This can be seen by the error:
RuntimeError: This reduction does not support multi-line example.
You can read more about --cb here: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Logged-Contextual-Bandit-Example

Calculate features at multiple training windows in Featuretools

I have a table with customers and transactions. Is there a way how to get features that would be filtered for last 3/6/9/12 months? I would like to automatically generate features:
number of trans in last 3 months
....
number of trans in last 12 months
average trans in last 3 months
...
average trans in last 12 months
I've tried using the training_window =["1 month", "3 months"],, but it does not seem to return multiple features for each window.
Example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
window_features = ft.dfs(entityset=es,
target_entity="customers",
training_window=["1 hour", "1 day"],
features_only = True)
window_features
Do I have to do individual windows separately and then merge the results?
As you mentioned, in Featuretools 0.2.1 you have to build the feature matrices individually for each training window and then merge the results. With your example, you would do that as follows:
import pandas as pd
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=['count'],
trans_primitives=[],
features_only = True)
fm_1 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1h',
verbose=True)
fm_2 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1d',
verbose=True)
new_df = fm_1.reset_index()
new_df = new_df.merge(fm_2.reset_index(), on="customer_id", suffixes=("_1h", "_1d"))
Then, the new dataframe will look like:
customer_id COUNT(sessions)_1h COUNT(transactions)_1h COUNT(sessions)_1d COUNT(transactions)_1d
1 1 17 3 43
2 3 36 3 36
3 0 0 1 25
4 0 0 0 0
5 1 15 2 29

Resources