How to sort pandas dataframe non-lexical? - sorting

What I do to sort credit in the following dataframe is to use sort_values() function (I've also tried sort()):
df.sort_values('credit', ascending=False, inplace=True)
The problem is that credits are sorted like below:
i credit m reg_date b id
----------------------------------------------------------------------
238 0 4600000.00 0 2014-04-14 False 102214
127 0 4600000.00 0 2014-12-30 False 159479
13 0 16800000.00 0 2015-01-12 False 163503
248 0 16720000.00 0 2012-11-11 False 5116
Ascending is False that's why 4600000.00 is before other credits. But this is not what I wanted. I wanted to sort based on the values. So in the sample above 16800000.00 and 16720000.00 should be before 4600000.00. How to sort this Dataframe non-lexical?
EDIT-1:
Data is more than that and can contain:
120 0 16708000.00 0 2013-12-17 False 51433
248 0 16720000.00 0 2012-11-11 False 5116
13 0 16800000.00 0 2015-01-12 False 163503
21 0 4634000.00 0 2014-12-29 False 159239
136 0 4650000.00 0 2012-11-07 False 4701
.. ... ... ... ... ... ...
231 0 7715000.00 0 2014-02-15 False 83936
182 0 7750000.00 0 2015-07-13 False 201584

You could sort the column separately as type float and use the index to slice the original index
In your case:
import pandas as pd
from StringIO import StringIO
text = """136 0 4650000.00 0 2012-11-07 False 4701
231 0 7715000.00 0 2014-02-15 False 83936
13 0 16800000.00 0 2015-01-12 False 163503
120 0 16708000.00 0 2013-12-17 False 51433
248 0 16720000.00 0 2012-11-11 False 5116
21 0 4634000.00 0 2014-12-29 False 159239
182 0 7750000.00 0 2015-07-13 False 201584
"""
df = pd.read_csv(StringIO(text), delim_whitespace=True,
header=None, index_col=0,
names=['i', 'credit', 'm', 'reg_date', 'b', 'id'])
print df.loc[df.credit.astype(float).sort_values(ascending=False).index]
i credit m reg_date b id
13 0 16800000.0 0 2015-01-12 False 163503
248 0 16720000.0 0 2012-11-11 False 5116
120 0 16708000.0 0 2013-12-17 False 51433
182 0 7750000.0 0 2015-07-13 False 201584
231 0 7715000.0 0 2014-02-15 False 83936
136 0 4650000.0 0 2012-11-07 False 4701
21 0 4634000.0 0 2014-12-29 False 159239

Related

elasticsearch disk usage / indexes size

i use elasticsearch and when i use _cat/allocation/:
shards disk.indices disk.used disk.avail disk.total disk.percent
10 4.9mb 51.4gb 956.3gb 1007.8gb 5
10 4.7mb 51.5gb 956.2gb 1007.8gb 5
disk.used is over 50GB
using _cat/shards:
index shard prirep state docs store
cs-card-logs_20180712-001 4 p STARTED 724 572.8kb
cs-card-logs_20180712-001 4 r STARTED 724 539.7kb
cs-card-logs_20180712-001 3 r STARTED 673 997.8kb
cs-card-logs_20180712-001 3 p STARTED 673 969.8kb
cs-card-logs_20180712-001 2 p STARTED 699 1mb
cs-card-logs_20180712-001 2 r STARTED 699 556.9kb
cs-card-logs_20180712-001 1 r STARTED 670 1mb
cs-card-logs_20180712-001 1 p STARTED 670 546.7kb
cs-card-logs_20180712-001 0 p STARTED 722 1013.1kb
cs-card-logs_20180712-001 0 r STARTED 722 1020.8kb
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open read_me 5 1 0 0 1.5kb 795b
green open cs-card-logs_20180712-001 5 1 3106 0 4.8mb 2.4mb
the store size is lower than 5mb
using /_cat/segments/
index shard prirep segment generation docs.count docs.deleted size size.memory committed searchable version compound
cs-card-logs_20180712-001 0 p _5u 210 245 0 209.7kb 45308 true true 5.5.2 false
cs-card-logs_20180712-001 0 p _5v 211 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 0 p _5w 212 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 0 r _5u 210 243 0 207.8kb 45243 true true 5.5.2 false
cs-card-logs_20180712-001 0 r _5v 211 2 0 10.4kb 8095 true true 5.5.2 true
cs-card-logs_20180712-001 0 r _5w 212 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 0 r _5x 213 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _50 180 188 0 178.4kb 44552 true true 5.5.2 false
cs-card-logs_20180712-001 1 r _51 181 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _52 182 2 0 10.4kb 8095 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _53 183 1 0 4.4kb 3262 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _54 184 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _55 185 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _56 186 1 0 4.4kb 3262 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _57 187 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 1 r _58 188 2 0 8.3kb 6826 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _50 180 189 0 178.7kb 44568 true true 5.5.2 false
cs-card-logs_20180712-001 1 p _51 181 2 0 10.4kb 8095 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _52 182 1 0 4.4kb 3262 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _53 183 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _54 184 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _55 185 1 0 4.4kb 3262 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _56 186 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 1 p _57 187 2 0 8.3kb 6826 true true 5.5.2 true
cs-card-logs_20180712-001 2 p _64 220 240 0 209.8kb 45900 true true 5.5.2 false
cs-card-logs_20180712-001 2 p _65 221 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 2 r _64 220 238 0 209.8kb 45873 true true 5.5.2 false
cs-card-logs_20180712-001 2 r _65 221 1 0 4.8kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 2 r _66 222 1 0 4.4kb 3262 true true 5.5.2 true
cs-card-logs_20180712-001 2 r _67 223 1 0 5kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 3 r _5u 210 226 0 207.1kb 45876 true true 5.5.2 false
cs-card-logs_20180712-001 3 r _5v 211 1 0 6.5kb 5269 true true 5.5.2 true
cs-card-logs_20180712-001 3 r _5w 212 2 0 39.5kb 27250 true true 5.5.2 true
cs-card-logs_20180712-001 3 p _5u 210 223 0 205.6kb 45812 true true 5.5.2 false
cs-card-logs_20180712-001 3 p _5v 211 2 0 10.4kb 8095 true true 5.5.2 true
cs-card-logs_20180712-001 3 p _5w 212 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 3 p _5x 213 1 0 6.5kb 5269 true true 5.5.2 true
cs-card-logs_20180712-001 3 p _5y 214 2 0 39.5kb 27250 true true 5.5.2 true
cs-card-logs_20180712-001 4 p _64 220 240 0 207kb 45498 true true 5.5.2 false
cs-card-logs_20180712-001 4 p _65 221 1 0 4.8kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 4 p _66 222 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 4 p _67 223 1 0 5.1kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 4 p _68 224 1 0 6.7kb 5397 true true 5.5.2 true
cs-card-logs_20180712-001 4 p _69 225 2 0 40.2kb 27796 true true 5.5.2 true
cs-card-logs_20180712-001 4 r _64 220 240 0 207.1kb 45506 true true 5.5.2 false
cs-card-logs_20180712-001 4 r _65 221 1 0 4.8kb 3872 true true 5.5.2 true
cs-card-logs_20180712-001 4 r _66 222 1 0 5.2kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 4 r _67 223 1 0 5.1kb 4113 true true 5.5.2 true
cs-card-logs_20180712-001 4 r _68 224 1 0 6.7kb 5397 true true 5.5.2 true
cs-card-logs_20180712-001 4 r _69 225 2 0 40.2kb 27796 true true 5.5.2 true
I can't figure why is my disk usage so high ?
what can i do to find the reason of this disk.used ?
how can i check what is taking that's much space ?
did someone can help me ?
thanks
The figured reported by the disk.used column is the disk space used in total, i.e. also outside of ES.
The size used by ES is in the disk.indices column. This column was added in order to provide more insights into the ES vs non-ES disk usage.
So in order to find out what's taking up disk space, you can leverage the du command at the root of your filesystem, but it's not ES.

Assign specific color to seaborn heatmap

I'm trying to make heatmap using seaborn, but got stuck to change color on specific values. Suppose, the value 0 should be white, and value 1 should be grey, then over that uses the palette as provided by cmap.
Was trying to use mask, but got confused.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
df = pd.read_csv('/home/test.csv', index_col=0)
fig, ax = plt.subplots()
sns.heatmap(df, cmap="Reds", vmin=0, vmax=15)
plt.show()
this for the sample data
TAG A B C D E F G H I J
TAG_1 1 0 0 5 0 7 1 1 0 10
TAG_2 0 1 0 6 0 6 0 0 0 7
TAG_3 0 1 0 2 0 4 0 0 1 4
TAG_4 0 0 0 3 1 3 0 0 0 10
TAG_5 1 0 1 5 0 2 1 1 0 11
TAG_6 0 0 0 0 0 0 0 0 0 12
TAG_7 0 1 0 0 1 0 0 0 0 0
TAG_8 0 0 0 1 0 0 1 0 1 0
TAG_9 0 0 1 0 0 0 0 0 0 0
TAG_10 0 0 0 0 0 0 0 0 0 0
df.set_index('TAG', inplace=True) tells seaborn that the tags should be used as tags, not as data.
The 'binary' colormap goes smoothly from white for the lower values to dark black for the highest. Playing with vmin and vmax, setting vmin=0 and vmax to a value between 1.5 and about 5, value 0 will be white and 1 will be any desired type of gray.
To set a mask, the dataframe should be converted to a 2D numpy array and be of type float.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from io import StringIO
data_str = StringIO('''TAG A B C D E F G H I J
TAG_1 1 0 0 5 0 7 1 1 0 10
TAG_2 0 1 0 6 0 6 0 0 0 7
TAG_3 0 1 0 2 0 4 0 0 1 4
TAG_4 0 0 0 3 1 3 0 0 0 10
TAG_5 1 0 1 5 0 2 1 1 0 11
TAG_6 0 0 0 0 0 0 0 0 0 12
TAG_7 0 1 0 0 1 0 0 0 0 0
TAG_8 0 0 0 1 0 0 1 0 1 0
TAG_9 0 0 1 0 0 0 0 0 0 0
TAG_10 0 0 0 0 0 0 0 0 0 0''')
df = pd.read_csv(data_str, delim_whitespace=True)
df.set_index('TAG', inplace=True)
values = df.to_numpy(dtype=float)
ax = sns.heatmap(values, cmap='Reds', vmin=0, vmax=15, square=True)
sns.heatmap(values, xticklabels=df.columns, yticklabels=df.index,
cmap=plt.get_cmap('binary'), vmin=0, vmax=2, mask=values > 1, cbar=False, ax=ax)
plt.show()
Alternatively, a custom colormap could be created. That way the colorbar will also show the adapted colors.
from matplotlib.colors import LinearSegmentedColormap
cmap_reds = plt.get_cmap('Reds')
num_colors = 15
colors = ['white', 'grey'] + [cmap_reds(i / num_colors) for i in range(2, num_colors)]
cmap = LinearSegmentedColormap.from_list('', colors, num_colors)
ax = sns.heatmap(df, cmap=cmap, vmin=0, vmax=num_colors, square=True, cbar=False)
cbar = plt.colorbar(ax.collections[0], ticks=range(num_colors + 1))
plt.show()

Finding islands of ones with zeros boundary

I am trying to find islands of numbers in a matrix.
By an island, I mean a rectangular area where ones are connected with each other either horizontally, vertically or diagonally including the boundary layer of zeros
Suppose I have this matrix:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
By boundary layer, I mean row 2 and 7, and column 3 and 10 for island#1.
This is shown below:
I want the row and column indices of the islands. So for the above matrix, the desired output is:
isl{1}= {[2 3 4 5 6 7]; % row indices of island#1
[3 4 5 6 7 8 9 10]} % column indices of island#1
isl{2}= {[2 3 4 5 6 7]; % row indices of island#2
[12 13 14 15 16 17]}; % column indices of island#2
isl{3} ={[9 10 11 12]; % row indices of island#3
[2 3 4 5 6 7 8 9 10 11];} % column indices of island#3
It doesn't matter which island is detected first.
While I know that the [r,c] = find(matrix) function can give the row and column indices of ones but I have no clues on how to detect the connected ones since they can be connected in horizontal, vertical and diagonal order.
Any ideas on how to deal with this problem?
You should look at the BoundingBox and ConvexHull stats returned by regionprops:
a = imread('circlesBrightDark.png');
bw = a < 100;
s = regionprops('table',bw,'BoundingBox','ConvexHull')
https://www.mathworks.com/help/images/ref/regionprops.html
Finding the connected components and their bounding boxes is the easy part. The more difficult part is merging the bounding boxes into islands.
Bounding Boxes
First the easy part.
function bBoxes = getIslandBoxes(lMap)
% find bounding box of each candidate island
% lMap is a logical matrix containing zero or more connected components
bw = bwlabel(lMap); % label connected components in logical matrix
bBoxes = struct2cell(regionprops(bw, 'BoundingBox')); % get bounding boxes
bBoxes = cellfun(#round, bBoxes, 'UniformOutput', false); % round values
end
The values are rounded because the bounding boxes returned by regionprops lies outside its respective component on the grid lines rather than the cell center, and we need integer values to use as subscripts into the matrix. For example, a component that looks like this:
0 0 0
0 1 0
0 0 0
will have a bounding box of
[ 1.5000 1.5000 1.0000 1.0000 ]
which we round to
[ 2 2 1 1]
Merging
Now the hard part. First, the merge condition:
We merge bounding box b2 into bounding box b1 if b2 and the island of b1 (including the boundary layer) have a non-null intersection.
This condition ensures that bounding boxes are merged when one component is wholly or partially inside the bounding box of another, but it also catches the edge cases when a bounding box is within the zero boundary of another. Once all of the bounding boxes are merged, they are guaranteed to have a boundary of all zeros (or border the edge of the matrix), otherwise the nonzero value in its boundary would have been merged.
Since merging involves deleting the merged bounding box, the loops are done backwards so that we don't end up indexing non-existent array elements.
Unfortunately, making one pass through the array comparing each element to all the others is insufficient to catch all cases. To signal that all of the possible bounding boxes have been merged into islands, we use a flag called anyMerged and loop until we get through one complete iteration without merging anything.
function mBoxes = mergeBoxes(bBoxes)
% find bounding boxes that intersect, and merge them
mBoxes = bBoxes;
% merge bounding boxes that overlap
anyMerged = true; % flag to show when we've finished
while (anyMerged)
anyMerged = false; % no boxes merged on this iteration so far...
for box1 = numel(mBoxes):-1:2
for box2 = box1-1:-1:1
% if intersection between bounding boxes is > 0, merge
% the size of box1 is increased b y 1 on all sides...
% this is so that components that lie within the borders
% of another component, but not inside the bounding box,
% are merged
if (rectint(mBoxes{box1} + [-1 -1 2 2], mBoxes{box2}) > 0)
coords1 = rect2corners(mBoxes{box1});
coords2 = rect2corners(mBoxes{box2});
minX = min(coords1(1), coords2(1));
minY = min(coords1(2), coords2(2));
maxX = max(coords1(3), coords2(3));
maxY = max(coords1(4), coords2(4));
mBoxes{box2} = [minX, minY, maxX-minX+1, maxY-minY+1]; % merge
mBoxes(box1) = []; % delete redundant bounding box
anyMerged = true; % bounding boxes merged: loop again
break;
end
end
end
end
end
The merge function uses a small utility function that converts rectangles with the format [x y width height] to a vector of subscripts for the top-left, bottom-right corners [x1 y1 x2 y2]. (This was actually used in another function to check that an island had a zero border, but as discussed above, this check is unnecessary.)
function corners = rect2corners(rect)
% change from rect = x, y, width, height
% to corners = x1, y1, x2, y2
corners = [rect(1), ...
rect(2), ...
rect(1) + rect(3) - 1, ...
rect(2) + rect(4) - 1];
end
Output Formatting and Driver Function
The return value from mergeBoxes is a cell array of rectangle objects. If you find this format useful, you can stop here, but it's easy to get to the format requested with ranges of rows and columns for each island:
function rRanges = rect2range(bBoxes, mSize)
% convert rect = x, y, width, height to
% range = y:y+height-1; x:x+width-1
% and expand range by 1 in all 4 directions to include zero border,
% making sure to stay within borders of original matrix
rangeFun = #(rect) {max(rect(2)-1,1):min(rect(2)+rect(4),mSize(1));...
max(rect(1)-1,1):min(rect(1)+rect(3),mSize(2))};
rRanges = cellfun(rangeFun, bBoxes, 'UniformOutput', false);
end
All that's left is a main function to tie all of the others together and we're done.
function theIslands = getIslandRects(m)
% get rectangle around each component in map
lMap = logical(m);
% get the bounding boxes of candidate islands
bBoxes = getIslandBoxes(lMap);
% merge bounding boxes that overlap
bBoxes = mergeBoxes(bBoxes);
% convert bounding boxes to row/column ranges
theIslands = rect2range(bBoxes, size(lMap));
end
Here's a run using the sample matrix given in the question:
M =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> getIslandRects(M)
ans =
{
[1,1] =
{
[1,1] =
9 10 11 12
[2,1] =
2 3 4 5 6 7 8 9 10 11
}
[1,2] =
{
[1,1] =
2 3 4 5 6 7
[2,1] =
3 4 5 6 7 8 9 10
}
[1,3] =
{
[1,1] =
2 3 4 5 6 7
[2,1] =
12 13 14 15 16 17
}
}
Quite easy!
Just use bwboundaries to get the boundaries of each of the blobs. you can then just get the min and max in each x and y direction of each boundary to build your box.
Use image dilation and regionprops
mat = [...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1;
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0;
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1;
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0;
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0];
mat=logical(mat);
dil_mat=imdilate(mat,true(2,2)); %here we make bridges to 1 px away ones
l_mat=bwlabel(dil_mat,8);
bb = regionprops(l_mat,'BoundingBox');
bb = struct2cell(bb); bb = cellfun(#(x) fix(x), bb, 'un',0);
isl = cellfun(#(x) {max(1,x(2)):min(x(2)+x(4),size(mat,1)),...
max(1,x(1)):min(x(1)+x(3),size(mat,2))},bb,'un',0);

how can I create an incidence matrix in Julia

I would like to create an incidence matrix.
I have a file with 3 columns, like:
id x y
A 22 2
B 4 21
C 21 360
D 26 2
E 22 58
F 2 347
And I want a matrix like (without col and row names):
2 4 21 22 26 58 347 360
A 1 0 0 1 0 0 0 0
B 0 1 1 0 0 0 0 0
C 0 0 1 0 0 0 0 1
D 1 0 0 0 1 0 0 0
E 0 0 0 1 0 1 0 0
F 1 0 0 0 0 0 1 0
I have started the code like:
haps = readdlm("File.txt",header=true)
hap1_2 = map(Int64,haps[1][:,2:end])
ID = (haps[1][:,1])
dic1 = Dict()
for (i in 1:21)
dic1[ID[i]] = hap1_2[i,:]
end
X=[zeros(21,22)]; #the original file has 21 rows and 22 columns
X1 = hcat(ID,X)
The problem now is that I don't know how to fill the matrix with 1s in the specific columns as in the example above.
I'm also not sure if I'm on the right way.
Any suggestion that could help me??
Thanks!
NamedArrays is a neat package which allows naming both rows and columns and seems to fit the bill for this problem. Suppose the data is in data.csv, here is one method to go about it (install NamedArrays with Pkg.add("NamedArrays")):
data,header = readcsv("data.csv",header=true);
# get the column names by looking at unique values in columns
cols = unique(vec([(header[j+1],data[i,j+1]) for i in 1:size(data,1),j=1:2]))
# row names from ID column
rows = data[:,1]
using NamedArrays
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
# now stamp in the 1s in the right places
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],(header[c],data[r,c])] = 1 ; end
Now we have (note I transposed narr for better printout):
julia> narr'
10x6 NamedArray{Int64,2}:
attr ╲ id │ A B C D E F
──────────┼─────────────────
("x",22) │ 1 0 0 0 1 0
("x",4) │ 0 1 0 0 0 0
("x",21) │ 0 0 1 0 0 0
("x",26) │ 0 0 0 1 0 0
("x",2) │ 0 0 0 0 0 1
("y",2) │ 1 0 0 1 0 0
("y",21) │ 0 1 0 0 0 0
("y",360) │ 0 0 1 0 0 0
("y",58) │ 0 0 0 0 1 0
("y",347) │ 0 0 0 0 0 1
But, if DataFrames are necessary, similar tricks should apply.
---------- UPDATE ----------
In case the column of a value should be ignored i.e. x=2 and y=2 should both set a 1 on column for value 2, then the code becomes:
using NamedArrays
data,header = readcsv("data.csv",header=true);
rows = data[:,1]
cols = map(string,sort(unique(vec(data[:,2:end]))))
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],string(data[r,c])] = 1 ; end
giving:
julia> narr
6x8 NamedArray{Int64,2}:
id ╲ attr │ 2 4 21 22 26 58 347 360
──────────┼───────────────────────────────────────
A │ 1 0 0 1 0 0 0 0
B │ 0 1 1 0 0 0 0 0
C │ 0 0 1 0 0 0 0 1
D │ 1 0 0 0 1 0 0 0
E │ 0 0 0 1 0 1 0 0
F │ 1 0 0 0 0 0 1 0
Here is a slight variation on something that I use for creating sparse matrices out of categorical variables for regression analyses. The function includes a variety of comments and options to suit it to your needs. Note: as written, it treats the appearances of "2" and "21" in x and y as separate. It is far less elegant in naming and appearance than the nice response from Dan Getz. The main advantage here is that it works with sparse matrices so if your data is huge, this will be helpful in reducing storage space and computation time.
function OneHot(x::Array, header::Bool)
UniqueVals = unique(x)
Val_to_Idx = [Val => Idx for (Idx, Val) in enumerate(unique(x))] ## create a dictionary that maps unique values in the input array to column positions in the new sparse matrix.
ColIdx = convert(Array{Int64}, [Val_to_Idx[Val] for Val in x])
MySparse = sparse(collect(1:length(x)), ColIdx, ones(Int32, length(x)))
if header
return [UniqueVals' ; MySparse] ## note: this won't be sparse
## alternatively use return (MySparse, UniqueVals) to get a tuple, second element is the header which you can then feed to something to name the columns or do whatever else with
else
return MySparse ## use MySparse[:, 2:end] to drop a value (which you would want to do for categorical variables in a regression)
end
end
x = [22, 4, 21, 26, 22, 2];
y = [2, 21, 360, 2, 58, 347];
Incidence = [OneHot(x, true) OneHot(y, true)]
7x10 Array{Int64,2}:
22 4 21 26 2 2 21 360 58 347
1 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 1

Fastest way to find the sign of different square

Given an image I and two matrices m_1 ;m_2 (same size with I). The function f is defined as:
Because my goal design wants to get the sign of f . Hence, the function f can rewritten as following:
I think that second formula is faster than first formula because: It
can ignore the square term
It can compute the sign directly, instead of two steps in first equation: compute the f and check sign.
Do you agree with me? Do you have another faster formula for f
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
The ouput f is
f =[1 1 1 1 1
1 1 -1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1]
I implemented the first way, but I did not finish the second way by matlab. Could you check help me the second way and compare it
UPDATE: I would like to add code of chepyle and Divakar to make clearly question. Note that both of them give the same result as above f
function compare()
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
function f=first_way()
f=sign((I-m1).^2-(I-m2).^2);
f(f==0)=1;
end
function f= second_way()
f = double(abs(I-m1) >= abs(I-m2));
f(f==0) = -1;
end
function f= third_way()
v1=abs(I-m1);
v2=abs(I-m2);
f= int8(v1>v2) + -1*int8(v1<v2); % need to convert to int from logical
f(f==0) = 1;
end
disp(['First way : ' num2str(timeit(#first_way))])
disp(['Second way: ' num2str(timeit(#second_way))])
disp(['Third way : ' num2str(timeit(#third_way))])
end
First way : 1.2897e-05
Second way: 1.9381e-05
Third way : 2.0077e-05
This seems to be comparable and might be a wee bit faster at times than the original approach -
f = sign(abs(I-m1) - abs(I-m2)) + sign(abs(m1-m2)) + ...
sign(abs(2*I-m1-m2)) - 1 -sign(abs(2*I-m1-m2) + abs(m1-m2))
Benchmarking Code
%// Create random inputs
N = 5000;
I = randi(1000,N,N);
m1 = randi(1000,N,N);
m2 = randi(1000,N,N);
num_iter = 20; %// Number of iterations for all approaches
%// Warm up tic/toc.
for k = 1:100000
tic(); elapsed = toc();
end
disp('------------------------- With Original Approach')
tic
for iter = 1:num_iter
out1 = sign((I-m1).^2-(I-m2).^2);
out1(out1==0)=-1;
end
toc, clear out1
disp('------------------------- With Proposed Approach')
tic
for iter = 1:num_iter
out2 = sign(abs(I-m1) - abs(I-m2)) + sign(abs(m1-m2)) + ...
sign(abs(2*I-m1-m2)) - 1 -sign(abs(2*I-m1-m2) + abs(m1-m2));
end
toc
Results
------------------------- With Original Approach
Elapsed time is 1.751966 seconds.
------------------------- With Proposed Approach
Elapsed time is 1.681263 seconds.
There is a problem with the accuracy of second formula, but for the sake of comparison, here's how I would implement it in matlab, along with a third approach to avoid squaring and the sign() function, inline with your intent. Note that the matlab's matrix and sign functions are pretty well optimized, the second and third approaches are both slower.
function compare()
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
function f=first_way()
f=sign((I-m1).^2-(I-m2).^2);
end
function f= second_way()
v1=(I-m1);
v2=(I-m2);
f= int8(v1<=0 & v2>0) + -1* int8(v1>0 & v2<=0);
end
function f= third_way()
v1=abs(I-m1);
v2=abs(I-m2);
f= int8(v1>v2) + -1*int8(v1<v2); % need to convert to int from logical
end
disp(['First way : ' num2str(timeit(#first_way))])
disp(['Second way: ' num2str(timeit(#second_way))])
disp(['Third way : ' num2str(timeit(#third_way))])
end
The output:
First way : 9.4226e-06
Second way: 1.2247e-05
Third way : 1.1546e-05

Resources