Convert cuDF data frame column to 1 or 0 for “true”/“false” values - rapids

I am using RAPIDS (0.9 release) docker container. How can I do the following with RAPIDS cuDF?
df['new_column'] = df['column_name'] > condition
df[['new_column']] *= 1

You can do this in the same way as with pandas.
import cudf
df = cudf.DataFrame({'a':[0,1,2,3,4]})
df['new'] = df['a'] >= 3
df['new'] = df['new'].astype('int') # could use int8, int32, or int64
# could also do (df['a'] >= 3).astype('int')
df
a new
0 0 0
1 1 0
2 2 0
3 3 1
4 4 1

Related

Assign specific color to seaborn heatmap

I'm trying to make heatmap using seaborn, but got stuck to change color on specific values. Suppose, the value 0 should be white, and value 1 should be grey, then over that uses the palette as provided by cmap.
Was trying to use mask, but got confused.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
df = pd.read_csv('/home/test.csv', index_col=0)
fig, ax = plt.subplots()
sns.heatmap(df, cmap="Reds", vmin=0, vmax=15)
plt.show()
this for the sample data
TAG A B C D E F G H I J
TAG_1 1 0 0 5 0 7 1 1 0 10
TAG_2 0 1 0 6 0 6 0 0 0 7
TAG_3 0 1 0 2 0 4 0 0 1 4
TAG_4 0 0 0 3 1 3 0 0 0 10
TAG_5 1 0 1 5 0 2 1 1 0 11
TAG_6 0 0 0 0 0 0 0 0 0 12
TAG_7 0 1 0 0 1 0 0 0 0 0
TAG_8 0 0 0 1 0 0 1 0 1 0
TAG_9 0 0 1 0 0 0 0 0 0 0
TAG_10 0 0 0 0 0 0 0 0 0 0
df.set_index('TAG', inplace=True) tells seaborn that the tags should be used as tags, not as data.
The 'binary' colormap goes smoothly from white for the lower values to dark black for the highest. Playing with vmin and vmax, setting vmin=0 and vmax to a value between 1.5 and about 5, value 0 will be white and 1 will be any desired type of gray.
To set a mask, the dataframe should be converted to a 2D numpy array and be of type float.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from io import StringIO
data_str = StringIO('''TAG A B C D E F G H I J
TAG_1 1 0 0 5 0 7 1 1 0 10
TAG_2 0 1 0 6 0 6 0 0 0 7
TAG_3 0 1 0 2 0 4 0 0 1 4
TAG_4 0 0 0 3 1 3 0 0 0 10
TAG_5 1 0 1 5 0 2 1 1 0 11
TAG_6 0 0 0 0 0 0 0 0 0 12
TAG_7 0 1 0 0 1 0 0 0 0 0
TAG_8 0 0 0 1 0 0 1 0 1 0
TAG_9 0 0 1 0 0 0 0 0 0 0
TAG_10 0 0 0 0 0 0 0 0 0 0''')
df = pd.read_csv(data_str, delim_whitespace=True)
df.set_index('TAG', inplace=True)
values = df.to_numpy(dtype=float)
ax = sns.heatmap(values, cmap='Reds', vmin=0, vmax=15, square=True)
sns.heatmap(values, xticklabels=df.columns, yticklabels=df.index,
cmap=plt.get_cmap('binary'), vmin=0, vmax=2, mask=values > 1, cbar=False, ax=ax)
plt.show()
Alternatively, a custom colormap could be created. That way the colorbar will also show the adapted colors.
from matplotlib.colors import LinearSegmentedColormap
cmap_reds = plt.get_cmap('Reds')
num_colors = 15
colors = ['white', 'grey'] + [cmap_reds(i / num_colors) for i in range(2, num_colors)]
cmap = LinearSegmentedColormap.from_list('', colors, num_colors)
ax = sns.heatmap(df, cmap=cmap, vmin=0, vmax=num_colors, square=True, cbar=False)
cbar = plt.colorbar(ax.collections[0], ticks=range(num_colors + 1))
plt.show()

how can I create an incidence matrix in Julia

I would like to create an incidence matrix.
I have a file with 3 columns, like:
id x y
A 22 2
B 4 21
C 21 360
D 26 2
E 22 58
F 2 347
And I want a matrix like (without col and row names):
2 4 21 22 26 58 347 360
A 1 0 0 1 0 0 0 0
B 0 1 1 0 0 0 0 0
C 0 0 1 0 0 0 0 1
D 1 0 0 0 1 0 0 0
E 0 0 0 1 0 1 0 0
F 1 0 0 0 0 0 1 0
I have started the code like:
haps = readdlm("File.txt",header=true)
hap1_2 = map(Int64,haps[1][:,2:end])
ID = (haps[1][:,1])
dic1 = Dict()
for (i in 1:21)
dic1[ID[i]] = hap1_2[i,:]
end
X=[zeros(21,22)]; #the original file has 21 rows and 22 columns
X1 = hcat(ID,X)
The problem now is that I don't know how to fill the matrix with 1s in the specific columns as in the example above.
I'm also not sure if I'm on the right way.
Any suggestion that could help me??
Thanks!
NamedArrays is a neat package which allows naming both rows and columns and seems to fit the bill for this problem. Suppose the data is in data.csv, here is one method to go about it (install NamedArrays with Pkg.add("NamedArrays")):
data,header = readcsv("data.csv",header=true);
# get the column names by looking at unique values in columns
cols = unique(vec([(header[j+1],data[i,j+1]) for i in 1:size(data,1),j=1:2]))
# row names from ID column
rows = data[:,1]
using NamedArrays
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
# now stamp in the 1s in the right places
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],(header[c],data[r,c])] = 1 ; end
Now we have (note I transposed narr for better printout):
julia> narr'
10x6 NamedArray{Int64,2}:
attr ╲ id │ A B C D E F
──────────┼─────────────────
("x",22) │ 1 0 0 0 1 0
("x",4) │ 0 1 0 0 0 0
("x",21) │ 0 0 1 0 0 0
("x",26) │ 0 0 0 1 0 0
("x",2) │ 0 0 0 0 0 1
("y",2) │ 1 0 0 1 0 0
("y",21) │ 0 1 0 0 0 0
("y",360) │ 0 0 1 0 0 0
("y",58) │ 0 0 0 0 1 0
("y",347) │ 0 0 0 0 0 1
But, if DataFrames are necessary, similar tricks should apply.
---------- UPDATE ----------
In case the column of a value should be ignored i.e. x=2 and y=2 should both set a 1 on column for value 2, then the code becomes:
using NamedArrays
data,header = readcsv("data.csv",header=true);
rows = data[:,1]
cols = map(string,sort(unique(vec(data[:,2:end]))))
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],string(data[r,c])] = 1 ; end
giving:
julia> narr
6x8 NamedArray{Int64,2}:
id ╲ attr │ 2 4 21 22 26 58 347 360
──────────┼───────────────────────────────────────
A │ 1 0 0 1 0 0 0 0
B │ 0 1 1 0 0 0 0 0
C │ 0 0 1 0 0 0 0 1
D │ 1 0 0 0 1 0 0 0
E │ 0 0 0 1 0 1 0 0
F │ 1 0 0 0 0 0 1 0
Here is a slight variation on something that I use for creating sparse matrices out of categorical variables for regression analyses. The function includes a variety of comments and options to suit it to your needs. Note: as written, it treats the appearances of "2" and "21" in x and y as separate. It is far less elegant in naming and appearance than the nice response from Dan Getz. The main advantage here is that it works with sparse matrices so if your data is huge, this will be helpful in reducing storage space and computation time.
function OneHot(x::Array, header::Bool)
UniqueVals = unique(x)
Val_to_Idx = [Val => Idx for (Idx, Val) in enumerate(unique(x))] ## create a dictionary that maps unique values in the input array to column positions in the new sparse matrix.
ColIdx = convert(Array{Int64}, [Val_to_Idx[Val] for Val in x])
MySparse = sparse(collect(1:length(x)), ColIdx, ones(Int32, length(x)))
if header
return [UniqueVals' ; MySparse] ## note: this won't be sparse
## alternatively use return (MySparse, UniqueVals) to get a tuple, second element is the header which you can then feed to something to name the columns or do whatever else with
else
return MySparse ## use MySparse[:, 2:end] to drop a value (which you would want to do for categorical variables in a regression)
end
end
x = [22, 4, 21, 26, 22, 2];
y = [2, 21, 360, 2, 58, 347];
Incidence = [OneHot(x, true) OneHot(y, true)]
7x10 Array{Int64,2}:
22 4 21 26 2 2 21 360 58 347
1 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 1

Count the frequency of matrix values including 0

I have a vector
A = [ 1 1 1 2 2 3 6 8 9 9 ]
I would like to write a loop that counts the frequencies of values in my vector within a range I choose, this would include values that have 0 frequencies
For example, if I chose the range of 1:9 my results would be
3 2 1 0 0 1 0 1 2
If I picked 1:11 the result would be
3 2 1 0 0 1 0 1 2 0 0
Is this possible? Also ideally I would have to do this for giant matrices and vectors, so the fasted way to calculate this would be appreciated.
Here's an alternative suggestion to histcounts, which appears to be ~8x faster on Matlab 2015b:
A = [ 1 1 1 2 2 3 6 8 9 9 ];
maxRange = 11;
N = accumarray(A(:), 1, [maxRange,1])';
N =
3 2 1 0 0 1 0 1 2 0 0
Comparing the speed:
K>> tic; for i = 1:100000, N1 = accumarray(A(:), 1, [maxRange,1])'; end; toc;
Elapsed time is 0.537597 seconds.
K>> tic; for i = 1:100000, N2 = histcounts(A,1:maxRange+1); end; toc;
Elapsed time is 4.333394 seconds.
K>> isequal(N1, N2)
ans =
1
As per the loop request, here's a looped version, which should not be too slow since the latest engine overhaul:
A = [ 1 1 1 2 2 3 6 8 9 9 ];
maxRange = 11; %// your range
output = zeros(1,maxRange); %// initialise output
for ii = 1:maxRange
tmp = A==ii; %// temporary storage
output(ii) = sum(tmp(:)); %// find the number of occurences
end
which would result in
output =
3 2 1 0 0 1 0 1 2 0 0
Faster and not-looping would be #beaker's suggestion to use histcounts:
[N,edges] = histcounts(A,1:maxRange+1);
N =
3 2 1 0 0 1 0 1 2 0
where the +1 makes sure the last entry is included as well.
Assuming the input A to be a sorted array and the range starts from 1 and goes until some value greater than or equal to the largest element in A, here's an approach using diff and find -
%// Inputs
A = [2 4 4 4 8 9 11 11 11 12]; %// Modified for variety
maxN = 13;
idx = [0 find(diff(A)>0) numel(A)]+1;
out = zeros(1,maxN); %// OR for better performance : out(maxN) = 0;
out(A(idx(1:end-1))) = diff(idx);
Output -
out =
0 1 0 3 0 0 0 1 1 0 3 1 0
This can be done very easily with bsxfun.
Let the data be
A = [ 1 1 1 2 2 3 6 8 9 9 ]; %// data
B = 1:9; %// possible values
Then
result = sum(bsxfun(#eq, A(:), B(:).'), 1);
gives
result =
3 2 1 0 0 1 0 1 2

MATLAB - Combine two binary image by comparing 3 x 3 patch (sub-matrix)

Matlab - Hello, I want to combine two binary images with same size (111x111), but first i want to divide the image into 3 x 3 matrix patch (37 sub matrix), with the two conditions:
1.If the 3 x 3 patches from image 2 matrix values is all white (1) then the result matrix = image 1 matrix , example:
image 1 patch: image 2 patch: result:
1 1 0 1 1 1 1 1 0
1 0 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1
2. Else, i want to keep the center value of 3 x 3 patches (index (2,2)) from image 1, but the other value from image 2
image 1 patch: Image 2 patch : result:
0 0 0 1 0 1 1 0 1
0 0 0 1 1 0 1 0 0
0 0 0 1 0 1 1 0 1
And do the whole image and combine the whole 3 x 3 patches into result image (111x111 again)
My Code so far (Using mat2cell):
clear;
clc;
I1 = imread('image1.bmp');
I2 = imread('image2.bmp');
TI1 = im2bw(I1); %Thresholding I1
TI2 = im2bw(I2); %Thresholding I2
%Mat2cell patch
cellTI1 = mat2cell(TI1, 3*ones(size(TI1,1)/3,1), 3*ones(size(TI1,2)/3,1))
cellTI2= mat2cell(TI2, 3*ones(size(TI2,1)/3,1), 3*ones(size(TI2,2)/3,1))
% Im Confused with the loop
result1 = ones(37,37);
for i=1:3
for j=1:3
for m=1:37
for n=1:37
if TI2{m,n} == [1 1 1;
1 1 1;
1 1 1]
result1 = TI1(m,n);
else
result1 = [TI2{1,1}(1,1) TI2{1,1}(1,2) TI2{1,1}(1,3);
TI2{1,1}(2,1) TI1{1,1}(2,2) TI2{1,1}(3,2);
TI2{1,1}(3,1) TI2{1,1}(3,2) TI2{1,1}(3,3)];
end
end
end
Sorry for my bad English,
Thanks

Count the number of rows between each instance of a value in a matrix

Assume the following matrix:
myMatrix = [
1 0 1
1 0 0
1 1 1
1 1 1
0 1 1
0 0 0
0 0 0
0 1 0
1 0 0
0 0 0
0 0 0
0 0 1
0 0 1
0 0 1
];
Given the above (and treating each column independently), I'm trying to create a matrix that will contain the number of rows since the last value of 1 has "shown up". For example, in the first column, the first four values would become 0 since there are 0 rows between each of those rows and the previous value of 1.
Row 5 would become 1, row 6 = 2, row 7 = 3, row 8 = 4. Since row 9 contains a 1, it would become 0 and the count starts again with row 10. The final matrix should look like this:
FinalMatrix = [
0 1 0
0 2 1
0 0 0
0 0 0
1 0 0
2 1 1
3 2 2
4 0 3
0 1 4
1 2 5
2 3 6
3 4 0
4 5 0
5 6 0
];
What is a good way of accomplishing something like this?
EDIT: I'm currently using the following code:
[numRow,numCol] = size(myMatrix);
oneColumn = 1:numRow;
FinalMatrix = repmat(oneColumn',1,numCol);
toSubtract = zeros(numRow,numCol);
for m=1:numCol
rowsWithOnes = find(myMatrix(:,m));
for mm=1:length(rowsWithOnes);
toSubtract(rowsWithOnes(mm):end,m) = rowsWithOnes(mm);
end
end
FinalMatrix = FinalMatrix - toSubtract;
which runs about 5 times faster than the bsxfun solution posted over many trials and data sets (which are about 1500 x 2500 in size). Can the code above be optimized?
For a single column you could do this:
col = 1; %// desired column
vals = bsxfun(#minus, 1:size(myMatrix,1), find(myMatrix(:,col)));
vals(vals<0) = inf;
result = min(vals, [], 1).';
Result for first column:
result =
0
0
0
0
1
2
3
4
0
1
2
3
4
5
find + diff + cumsum based approach -
offset_array = zeros(size(myMatrix));
for k1 = 1:size(myMatrix,2)
a = myMatrix(:,k1);
widths = diff(find(diff([1 ; a])~=0));
idx = find(diff(a)==1)+1;
offset_array(idx(idx<=numel(a)),k1) = widths(1:2:end);
end
FinalMatrix1 = cumsum(double(myMatrix==0) - offset_array);
Benchmarking
The benchmarking code for comparing the above mentioned approach against the one in the question is listed here -
clear all
myMatrix = round(rand(1500,2500)); %// create random input array
for k = 1:50000
tic(); elapsed = toc(); %// Warm up tic/toc
end
disp('------------- With FIND+DIFF+CUMSUM based approach') %//'#
tic
offset_array = zeros(size(myMatrix));
for k1 = 1:size(myMatrix,2)
a = myMatrix(:,k1);
widths = diff(find(diff([1 ; a])~=0));
idx = find(diff(a)==1)+1;
offset_array(idx(idx<=numel(a)),k1) = widths(1:2:end);
end
FinalMatrix1 = cumsum(double(myMatrix==0) - offset_array);
toc
clear FinalMatrix1 offset_array idx widths a
disp('------------- With original approach') %//'#
tic
[numRow,numCol] = size(myMatrix);
oneColumn = 1:numRow;
FinalMatrix = repmat(oneColumn',1,numCol); %//'#
toSubtract = zeros(numRow,numCol);
for m=1:numCol
rowsWithOnes = find(myMatrix(:,m));
for mm=1:length(rowsWithOnes);
toSubtract(rowsWithOnes(mm):end,m) = rowsWithOnes(mm);
end
end
FinalMatrix = FinalMatrix - toSubtract;
toc
The results I got were -
------------- With FIND+DIFF+CUMSUM based approach
Elapsed time is 0.311115 seconds.
------------- With original approach
Elapsed time is 7.587798 seconds.

Resources