Pandas pivot table Nested Sorting - sorting

Given this data frame and pivot table:
import pandas as pd
df=pd.DataFrame({'A':['x','y','z','x','y','z'],
'B':['one','one','one','two','two','two'],
'C':[7,5,3,4,1,6]})
df
A B C
0 x one 7
1 y one 5
2 z one 3
3 x two 4
4 y two 1
5 z two 6
table = pd.pivot_table(df, index=['A', 'B'],aggfunc=np.sum)
table
A B
x one 7
two 4
y one 5
two 1
z one 3
two 6
Name: C, dtype: int64
I want to sort the pivot table such that the order of 'A' is z, x, y and the order of 'B' is based on the descendingly-sorted values from data frame column 'C'.
Like this:
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64
Thanks in advance!

I don't believe there is an easy way to accomplish your objective. The following solution first sorts your table is descending order based on the values of column C. It then concatenates each slice based on your desired order.
order = ['z', 'x', 'y']
table = table.reset_index().sort_values('C', ascending=False)
>>> pd.concat([table.loc[table.A == val, :].set_index(['A', 'B']) for val in order])
C
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1

If you can read in column A as categorical data, then it becomes much more straightforward. Setting your categories as list('zxy') and specifying ordered=True uses your custom ordering.
You can read in your data using something similar to:
'A':pd.Categorical(['x','y','z','x','y','z'], list('zxy'), ordered=True)
Alternatively, you can read in the data as you currently are, then use astype to convert A to categorical:
df['A'] = df['A'].astype('category', categories=list('zxy'), ordered=True)
Once A is categorical, you can pivot the same as before, and then sort with:
table = table.sort_values(ascending=False).sortlevel(0, sort_remaining=False)

Solution
custom_order = ['z', 'x', 'y']
kwargs = dict(axis=0, level=0, drop_level=False)
new_table = pd.concat(
[table.xs(idx_v, **kwargs).sort_values(ascending=False) for idx_v in custom_order]
)
Alternate one liner
pd.concat([table.xs(i, drop_level=0).sort_values(ascending=0) for i in list('zxy')]
Explanation
custom_order is your desired order.
kwargs is a convenient way to improve readability (in my opinion). Key elements to note, axis=0 and level=0 might be important for you if you want to leverage this further. However, those are also the default values and can be left out.
drop_level=False is the key argument here and is necessary to keep the idx_v we are taking a xs of such that the pd.concat puts it all together in the way we'd like.
I use a list comprehension in almost the exact same manner as Alexander within the pd.concat call.
Demonstration
print new_table
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64

Related

matlab matrix indexing of multiple columns

Say I have a IxJ matrix of values,
V= [1,4;2,5;3,6];
and a IxR matrix X of indexes,
X = [1 2 1 ; 1 2 2 ; 2 1 2];
I want to get a matrix Vx that is IxR such that for each row i, I want to read R times a (potentially) different column of V, which are given by the numbers in each corresponding column in X.
Vx(i,r) = V(i,X(i,r)).
For instance in this case it would be
Vx = [1,4,1;2,5,5;6,3,6];
Any help to do this fast, (without any looping) is much appreciated!
So what you want to achieve is using vectorization to achieve speed. This is one of the major strength of MATLAB. What you want is a matrix (index in the following code) whose elements are linear indexes that will be used to pick out value from the source matrix(V in your case). The first two lines of codes are doing exactly the same thing as sub2ind, turning subscripts to linear indexes. I'm coding this way so the logic of index conversion is clear.
[m,n] = ndgrid(1:size(X,1),1:size(X,2));
index = m + (X-1)*size(X,1);
Vx = V(index);
You can use bsxfun for an efficient solution -
N = size(V,1)
Vx = V(bsxfun(#plus,[1:N]',(X-1)*N))
Sample run -
>> V
V =
1 4
2 5
3 6
>> X
X =
1 2 1
1 2 2
2 1 2
>> N = size(V,1);
Vx = V(bsxfun(#plus,[1:N]',(X-1)*N))
Vx =
1 4 1
2 5 5
6 3 6
Another method would be to use repmat combined with sub2ind. sub2ind takes in row and column locations and the output are column-major linear indices that you can use to vectorize access into a matrix. Specifically, you want to build a 2D matrix of row indices and column indices which is the same size as X where the column indices are exactly specified as X but the row indices are the same for each row that we're concerned with. Concretely, the first row of this matrix will be all 1s, the next row all 2s, etc. To build this row matrix, first generate a column vector that goes from 1 up to as many rows as there are X and replicate this for as many columns as there are in X. With this new matrix and X, use sub2ind to generate column-major linear indices to finally index V to produce the matrix Vx:
subs = repmat((1:size(X,1)).', [1 size(X,2)]); %'
ind = sub2ind(size(X), subs, X);
Vx = V(ind);

What is the fast way to calculate this summation in MATLAB?

So I have the following constraints:
How to write this in MATLAB in an efficient way? The inputs are x_mn, M, and N. The set B={1,...,N} and the set U={1,...,M}
I did it like this (because I write x as the follwoing vector)
x=[x_11, x_12, ..., x_1N, X_21, x_22, ..., x_M1, X_M2, ..., x_MN]:
%# first constraint
function R1 = constraint_1(M, N)
ee = eye(N);
R1 = zeros(N, N*M);
for m = 1:M
R1(:, (m-1)*N+1:m*N) = ee;
end
end
%# second constraint
function R2 = constraint_2(M, N)
ee = ones(1, N);
R2 = zeros(M, N*M);
for m = 1:M
R2(m, (m-1)*N+1:m*N) = ee;
end
end
By the above code I will get a matrix A=[R1; R2] with 0-1 and I will have A*x<=1.
For example, M=N=2, I will have something like this:
And, I will create a function test(x) which returns true or false according to x.
I would like to get some help from you and optimize my code.
You should place your x_mn values in a matrix. After that, you can sum in each dimension to get what you want. Looking at your constraints, you will place these values in an M x N matrix, where M is the amount of rows and N is the amount of columns.
You can certainly place your values in a vector and construct your summations in the way you intended earlier, but you would have to write for loops to properly subset the proper elements in each iteration, which is very inefficient. Instead, use a matrix, and use sum to sum over the dimensions you want.
For example, let's say your values of x_mn ranged from 1 to 20. B is in the set from 1 to 5 and U is in the set from 1 to 4. As such:
X = vec2mat(1:20, 5)
X =
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
vec2mat takes a vector and reshapes it into a matrix. You specify the number of columns you want as the second element, and it will create the right amount of rows to ensure that a proper matrix is built. In this case, I want 5 columns, so this should create a 4 x 5 matrix.
The first constraint can be achieved by doing:
first = sum(X,1)
first =
34 38 42 46 50
sum works for vectors as well as matrices. If you have a matrix supplied to sum, you can specify a second parameter that tells you in what direction you wish to sum. In this case, specifying 1 will sum over all of the rows for each column. It works in the first dimension, which is the rows.
What this is doing is it is summing over all possible values in the set B over all values of U, which is what we are exactly doing here. You are simply summing every single column individually.
The second constraint can be achieved by doing:
second = sum(X,2)
second =
15
40
65
90
Here we specify 2 as the second parameter so that we can sum over all of the columns for each row. The second dimension goes over the columns. What this is doing is it is summing over all possible values in the set U over all values of B. Basically, you are simply summing every single row individually.
BTW, your code is not achieving what you think it's achieving. All you're doing is simply replicating the identity matrix a set number of times over groups of columns in your matrix. You are actually not performing any summations as per the constraint. What you are doing is you are simply ensuring that this matrix will have the conditions you specified at the beginning of your post to be enforced. These are the ideal matrices that are required to satisfy the constraints.
Now, if you want to check to see if the first condition or second condition is satisfied, you can do:
%// First condition satisfied?
firstSatisfied = all(first <= 1);
%// Second condition satisfied
secondSatisfied = all(second <= 1);
This will check every element of first or second and see if the resulting sums after you do the above code that I just showed are all <= 1. If they all satisfy this constraint, we will have true. Else, we have false.
Please let me know if you need anything further.

Efficient database lookup based on input where not all digits are sigificant

I would like to do a database lookup based on a 10 digit numeric value where only the first n digits are significant. Assume that there is no way in advance to determine n by looking at the value.
For example, I receive the value 5432154321. The corresponding entry (if it exists) might have key 54 or 543215 or any value based on n being somewhere between 1 and 10 inclusive.
Is there any efficient approach to matching on such a string short of simply trying all 10 possibilities?
Some background
The value is from a barcode scan. The barcodes are EAN13 restricted circulation numbers so they have the following structure:
02[1234567890]C
where C is a check sum. The 10 digits in between the 02 and the check sum consist of an item identifier followed by an item measure. There might be a check digit after the item identifier.
Since I can't depend on the data to adhere to any single standard, I would like to be able to define on an ad-hoc basis, how particular barcodes are structured which means that the portion of the 10 digit number that I extract, can be any length between 1 and 10.
Just a few ideas here:
1)
Maybe store these numbers in reversed form in your DB.
If you have N = 54321 you store it as N = 12345 in the DB.
Say N is the name of the column you stored it in.
When you read K = 5432154321, reverse this one too,
you get K1 = 1234512345, now check the DB column N
(whose value is let's say P), if K1 % 10^s == P,
where s=floor(Math.log(P) + 1).
Note: floor(Math.log(P) + 1) is a formula for
the count of digits of the number P > 0.
The value floor(Math.log(P) + 1) you may also
store in the DB as precomputed one, so that
you don't need to compute it each time.
2) As this 1) is kind of sick (but maybe best of the 3 ideas here),
maybe you just store them in a string column and check it with
'like operator'. But this is trivial, you probably considered it
already.
3) Or ... you store the numbers reversed, but you also
store all their residues mod 10^k for k=1...10.
col1, col2,..., col10
Then you can compare numbers almost directly,
the check will be something like
N % 10 == col1
or
N % 100 == col2
or
...
(N % 10^10) == col10.
Still not very elegant though (and not quite sure
if applicable to your case).
I decided to check my idea 1).
So here is an example
(I did it in SQL Server).
insert into numbers
(number, cnt_dig)
values
(1234, 1 + floor(log10(1234)))
insert into numbers
(number, cnt_dig)
values
(51234, 1 + floor(log10(51234)))
insert into numbers
(number, cnt_dig)
values
(7812334, 1 + floor(log10(7812334)))
select * From numbers
/*
Now we have this in our table:
id number cnt_dig
4 1234 4
5 51234 5
6 7812334 7
*/
-- Note that the actual numbers stored here
-- are the reversed ones: 4321, 43215, 4332187.
-- So far so good.
-- Now we read say K = 433218799 on the input
-- We reverse it and we get K1 = 997812334
declare #K1 bigint
set #K1 = 997812334
select * From numbers
where
#K1 % power(10, cnt_dig) = number
-- So from the last 3 queries,
-- we get this row:
-- id number cnt_dig
-- 6 7812334 7
--
-- meaning we have a match
-- i.e. the actual number 433218799
-- was matched successfully with the
-- actual number (from the DB) 4332187.
So this idea 1) doesn't seem that bad after all.

Leads to find tables correlation

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

octave matrix for loop performance

I am new to Octave. I have two matrices. I have to compare a particular column of a one matrix with the other(my matrix A is containing more than 5 variables, similarly matrix B is containing the same.) and if elements in column one of matrix A is equal to elements in the second matrix B then I have to use the third column of second matrix B to compute certain values.I am doing this with octave by using for loop , but it consumes a lot of time to do the computation for single day , i have to do this for a year . Because size of matrices is very large.Please suggest some alternative way so that I can reduce my time and computation.
Thank you in advance.
Thanks for your quick response -hfs
continuation of the same problem,
Thank u, but this will work only if both elements in both the rows are equal.For example my matrices are like this,
A=[1 2 3;4 5 6;7 8 9;6 9 1]
B=[1 2 4; 4 2 6; 7 5 8;3 8 4]
here column 1 of first element of A is equal to column 1 of first element of B,even the second column hence I can take the third element of B, but for the second element of column 1 is equal in A and B ,but second element of column 2 is different ,here it should search for that element and print the element in the third column,and am doing this with for loop which is very slow because of larger dimension.In mine actual problem I have given for loop as written below:
for k=1:37651
for j=1:26018
if (s(k,1:2)==l(j,1:2))
z=sin((90-s(k,3))*pi/180) , break ,end
end
end
I want an alternative way to do this which should be faster than this.
You should work with complete matrices or vectors whenever possible. You should try commands and inspect intermediate results in the interactive shell to see how they fit together.
A(:,1)
selects the first column of a matrix. You can compare matrices/vectors and the result is a matrix/vector of 0/1 again:
> A(:,1) == B(:,1)
ans =
1
1
0
If you assign the result you can use it again to index into matrices:
I = A(:,1) == B(:,1)
B(I, 3)
This selects the third column of B of those rows where the first column of A and B is equal.
I hope this gets you started.

Resources