data from a row to a column in python pandas - transformation

In a dataframe:
DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
-----------------------------------------------------------------
12 13 14 15 16 17 11
how do I transpose the data to the followings?
In rows
-------------
12
13
14
15
16
17
11

df.T will transpose your DataFrame
In [9]: import pandas as pd
In [10]: df = pd.DataFrame([{'data1': 12, 'data2': 13, 'data3': 14}])
In [11]: df
Out[11]:
data1 data2 data3
0 12 13 14
In [12]: df.T
Out[12]:
0
data1 12
data2 13
data3 14
In [13]:

Related

Algorithm to find best option

I have multiple key:value and need to find the best possible option.
Product
Price
Product
Price
Product
Price
label1
11
label2
12
label3
13
label4
14
label5
15
label6
16
I need to find all possible options starting from the first column
and find best solution like an example:
Product
Price
Product
Price
Product
Price
SUM
Result
label1
11
label2
12
label3
13
36
label1
11
label2
12
label6
16
39
label1
11
label5
15
label3
13
39
label1
11
label5
15
label6
16
42
label4
14
label5
15
label6
16
45
label4-label5-label6
label4
14
label5
15
label3
13
42
label4
14
label2
12
label3
13
39
label4
14
label2
12
label6
16
42
Please, any language, I need to understand algorithm
It seems like you are trying to maximize the sum and there is no constraint other than having only one item from each column. In that case: just take the item with the maximum value from each column.
Pseudo code:
res = []
for each col in cols:
item = getMaxItem(col)
res.push(item)
getMaxItem(col):
item = col[0]
for each i in col:
if i.price > item.price:
item = i
return item
This would be an O(n) solution, with n being the number of items in all the columns.

i tried 2 style of nested loop in golang, but it has different output

i have this quiz, you should make an output like this, and i search youtube tutorials for "for golang" and it explain that it has 2 style of for in golang,
1
21
11
12
13
14
22
11
12
13
14
23
11
12
13
14
24
11
12
13
14
2
21
11
12
13
14
22
11
12
13
14
23
11
12
13
14
24
11
12
13
14
3
21
11
12
13
14
22
11
12
13
14
23
11
12
13
14
24
11
12
13
14
4
21
11
12
13
14
22
11
12
13
14
23
11
12
13
14
24
11
12
13
14
5
21
11
12
13
14
22
11
12
13
14
23
11
12
13
14
24
11
12
13
14
it should be vertically outputted, not horizontally, so i build 3 variable, i = 1, j = 21, and k = 11, and i use for to automatically increase the value, the 1st style worked, but the 2nd style somehow its different
yt vid : https://www.youtube.com/watch?v=jZ-llP_yKNo on 5:28 min he explain that for has 2 style
1st style :
for i:=1; i <= 5; i++{
fmt.Println(i)
for j:=21; j <= 24; j++ {
println(j)
for k:=11; k<=14; k++ {
fmt.Println(k)
}
}
}
2nd style :
i:=1
j:=21
k:=11
for i <= 5{
fmt.Println(i)
i++
for j <= 24 {
println(j)
j++
for k<=14 {
fmt.Println(k)
k++
}
}
}
It's not about the syntax but about your logic.
In the 1st style with for i := ..., whenever next loop run, you reset the value to the init state, means it always sets j to 21 and k to 11. So there will a many sub loop runs.
In contrast, 2nd style, you init value j and k right before going to loop. So in the second loop of i, j and k are still the same value with 25 and 15 in that order.
There are multiple options to print the output in the golang.
fmt.Println appends a new line in the end.
fmt.Printf prints content as it is.
For more details read the documentation.
for i := 1; i <= 5; i++ {
fmt.Printf("%v ", i)
for j := 21; j <= 24; j++ {
fmt.Printf("%v ", j)
for k := 11; k <= 14; k++ {
fmt.Printf("%v ", k)
}
}
}
Output
1 21 11 12 13 14 22 11 12 13 14 23 11 12 13 14 24 11 12 13 14 2 21 11 12 13 14 22 11 12 13 14 23 11 12 13 14 24 11 12 13 14 3 21 11 12 13 14 22 11 12 13 14 23 11 12 13 14 24 11 12 13 14 4 21 11 12 13 14 22 11 12 13 14 23 11 12 13 14 24 11 12 13 14 5 21 11 12 13 14 22 11 12 13 14 23 11 12 13 14 24 11 12 13 14
To add a new line use the \n escape sequence.
Check the running code link

The Traveling Salesman algorithm bug

I have tried to make an algorithm solving the traveling salesman problem as follows:
%main function:
[siz, ~] = size(table);
done(1:siz) = false;
done(1) = true;
[dist, path] = bruteForce(table, done, 1);
function bruteForce:
function [distance, path] = bruteForce(table, done, index)
size = length(done);
dmin = inf;
distance = 0;
path = [];
%finding minimum distance
for i = 1:size
if ~done(i)
done(i) = true;
%iterating through all nodes using recursion
[d, p] = bruteForce(table, done, i);
if (d < dmin)
dmin = d;
path = [i p];
distance = dmin + table(i, index);
end
%freing the node again
done(i) = false;
end
end
if distance == 0
distance = table(1, index);
path = 1;
end
Unfortunately, for the following matrix:
B = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
Instead of getting the expected result:
1-8-5-4-10-6-3-7-2-11-9-1 = 253km
I get:
1-8-11-3-4-6-10-5-9-2-7-1 = 271km
Could you help me find the bug?
If brute force is a must and speed is no issue, then just use the perms function for the number of cities. This allows for an easy implementation:
table = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
[siz, ~] = size(table);
[bp, b] = bruteForce(table, siz)
function [bestpath, best] = bruteForce(table, siz)
p = perms(1:siz);
[r, c] = size(p);
best = inf;
for i = 1:r
path = p(i, :);
dist = distCalculatorReturn(table, path);
if dist < best
best = dist;
bestpath = path;
end
end
bestpath = [bestpath, bestpath(1)];
end
function [totaldist] = distCalculatorReturn(distMatrix, proposedPath)
dist = 0;
i = 1;
while i ~= length(proposedPath)
dist = dist + distMatrix(proposedPath(i),proposedPath(i+1));
i = i+1;
end
dist = dist + distMatrix(proposedPath(1), proposedPath(end));
totaldist = dist;
end
This yields the answer you are looking for. However, if you are only solving problems of that size, why not apply a standard simulated annealing. This gives much faster solution times and should solve the problem size consistently:
table = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
[path, dist] = tsp(table, length(table))
function [path, dist] = tsp(D, n)
L = 40*n;
epsi = 1e-9;
x = randperm(n);
fx = distCalculatorReturn(D, x);
T = 1000000;
while T > epsi
for i=1:L
num1 = 1 + floor(rand*n);
num2 = 1 + floor(rand*n);
while num1 == num2
num1 = 1 + floor(rand*n);
end
y = x;
swap1 = y(num1);
y(num1) = y(num2);
y(num2) = swap1;
fy = distCalculatorReturn(D,y);
if fy < fx
x = y;
fx = fy;
elseif rand < exp(-(fy - fx)/T)
x = y;
fx = fy;
end
end
T = 0.9*T;
end
path = [x, x(1)];
dist = fx;
end
Your code does not compute the distance for each possible path (as bruteForce suggests). Instead it always starts at node 1 and from there goes always to the node that is closest to the current node. As your example shows, that does not necessarily lead to the overall shortest path. You will need to go through all possible paths to be sure you find the optimum.
Here is my go at your problem:
% distance matrix
B = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
% compute all possible paths assuming we always start at node 1
nNodes = size(B,1);
paths = perms(2:nNodes);
nPaths = size(paths,1);
paths = [ones(nPaths,1) paths ones(nPaths,1)]; % start and finish tour at node 1
% with a random start point:
% paths = perms(1:nNodes);
% paths = [perms(1:nNodes) paths(:,1)];
% compute overall distance for each path
distance = inf;
for idx=1:nPaths
from = paths(idx,1:end-1);
to = paths(idx,2:end);
d = sum(diag(B(from,to)));
if d<distance
distance = d;
optPath = paths(idx,:);
end
end
This leads to the following result:
optPath = [1 9 11 2 7 3 6 10 4 5 8 1]
distance = 253

How to speed up Pandas multilevel dataframe shift by group?

I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))
How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095
similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.
try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0

Update DataTable from another table with LINQ

I have 2 DataTables that look like this:
DataTable 1:
cheie_primara cheie_secundara judet localitate
1 11 A
2 22 B
3 33 C
4 44 D
5 55 A
6 66 B
7 77 C
8 88 D
9 99 A
DataTable 2:
ID_CP BAN JUDET LOCALITATE ADRESA
1 11 A aa random
2 22 B ss random
3 33 C ee random
4 44 D xx random
5 55 A rr random
6 66 B aa random
7 77 C ss random
8 88 D ee random
9 99 A xx random
and I want to update DataTable 1 with the field["LOCALITATE"] using the maching key DataTable1["cheie_primara"] and DataTable2["ID_CP"].
Like this:
cheie_primara cheie_secundara judet localitate
1 11 A aa
2 22 B ss
3 33 C ee
4 44 D xx
5 55 A rr
6 66 B aa
7 77 C ss
8 88 D ee
9 99 A xx
Is there a LINQ methode to update DataTable1 ?
Thanks!
This is working:
DataTable1.AsEnumerable()
.Join( DataTable2.AsEnumerable(),
dt1_Row => dt1_Row.ItemArray[0],
dt2_Row => dt2_Row.ItemArray[0],
(dt1_Row, dt2_Row) => new { dt1_Row, dt2_Row })
.ToList()
.ForEach(o =>
o.dt1_Row.SetField(3, o.dt2_Row.ItemArray[3]));
If you want to use Linq, here's how I'd go about it;
var a = (from d1 in DataTable1
join d2 in DataTable2 on d1.cheie_primara equals d2.ID_CP
select new {d1, d2.LOCALITATE}).ToList();
a.ForEach(b => b.d1.localitate = b.LOCALITATE);

Resources