Tableau LOD to find median - filter

I have some data:
Inst Dest_Group Dest Cipn1 N
I1 C a 43
I1 F a 63
I1 U a 54
I1 C b 96
I1 F b 3
I1 U b 78
I1 C c 12
I1 F c 65
I1 U c 49
I2 C a 3
I2 F a 47
etc...
My worksheet is set up so that [Dest Cipn1] is a row, and [Dest Group] is a column. They display [value] as a bar chart. [value] = {include [Inst] : sum([N])} / {fixed [Inst] : sum([N])}
This worksheet is filtered on [Inst] = I1. I would like to add a reference line that shows the median value for each bar (cell) across all the [Inst]. (In the end I will add a band that displays 25th - 75th percentile but I figured working with the median would be simpler first).
I thought this would work, but it doesn't: [AllInstMedian] = {fixed [Inst],[Dest Group], [Dest Cipn1] : Sum([N])} / {fixed [Inst] : Sum([N])}
Any suggestions? I'm attaching a sample workbook here hoping that helps .
This is cross-posted here
Thank you

Steve mayer commented on the tableau link posted in the question with an answer. I ended up using a Lookup trick to copy inst and then used table calculations on the 25th and 75th window_percentile.

Related

A problem similar to the travelling salesman problem

I have different Features in my dataset these features names as following A B C D E F G H
There is a correlation between these features
Features Correlation
----------------------
A B 70
A C 78
B C 96
A G 93
.
.
.
Therefore, I would like to group similar features together so they can be represented by one feature
Something Like this
Seed Group Correlations Avg
-----------------------------------
A D & G 98 + 93 / 2 = 95.5
B F & C & E 85 + 96 + 79 / 3 = 86.6
..
..
..
H - -
So I get all close correlations in the same group
Another view to the problem
multiple cities in the country (City A B C D.. H)
Each city has a connection to another city
Cities Connection %
----------------------
A B 70
A C 78
B C 96
A G 93
.
.
.
We would like to hire area managers where cities with close connections can be served by the same area manager
We want to have the optimal number of area managers and where they should reside
Office Area Other Served Areas Connection Avg
------------------------------------------------------
A D & G 98 + 93 / 2 = 95.5
B F & C & E 85 + 96 + 79 / 3 = 86.6
..
..
..
H - -
I just want a method of how to figure how to split these features/cities in an optimum way that can cover most features/cities with a minimum number of links/area managers

How to count number of occurrences in a sorted text file

I have a sorted text file with the following format:
Company1 Company2 Date TransactionAmount
A B 1/1/19 20000
A B 1/4/19 200000
A B 1/19/19 324
A C 2/1/19 3456
A C 2/1/19 663633
A D 1/6/19 3632
B C 1/9/19 84335
B C 1/23/19 253
B C 1/13/19 850
B D 1/1/19 234
B D 1/8/19 635
C D 1/9/19 749
C D 1/10/19 203200
Ultimately I want a Python dictionary so that each pair maps to a list containing the number of transactions and the total amount of all transactions. For instance, (A,B) would map to [3,220324].
The file has ~250,000 lines in this format and each pair may have 1 transaction up to ~10 or so transactions. There are also tens of thousands of pairs of companies.
Here's the only way I've thought of implementing it.
my_dict = {}
file = open("my_file.txt").readlines()[1:]
for i in file:
i = i.split()
pair = (i[0],i[1])
amt = int(i[3])
if pair in my_dict:
exist = my_dict[pair]
exist[0] += 1
exist[1] += amt
my_dict[pair] = exist
else:
my_dict[pair] = [1,amt]
I feel like there is a faster way to do this. Any ideas?

Combining every column-combination of an arbitrary number of matrices

I'm trying to figure out a way to do a certain "reduction"
I have a varying number of matrices of varying size, e.g
1 2 2 2 5 6...70 70
3 7 8 9 7 7...88 89
1 3 4
2 7 7
3 8 8
9 9 9
.
.
44 49 49 49 49 49 49
50 50 50 50 50 50 50
87 87 88 89 90 91 92
What I need to do (and I hope that I'm explaining this clearly enough) is to combine any possible
combination of columns from these matrices, this means that one column might be
1
3
1
2
3
9
.
.
.
44
50
87
Which would reduce down to
1
2
3
9
.
.
.
44
50
87
The reason why I'm doing this is because I need to find the smallest unique combined column
What am I trying to accomplish
For those interested, I'm trying to find the smallest set of gene knockouts
to disable reactions. Here, every matrix represents a reactions, and the columns represent the indices of
the genes that would disable that reaction.
The method may be as brute force as needed, as these matrices rarely become overwhelmingly large,
and the reaction combinations won't be long either
The problem
I can't (as far as I know) create a for loop with an arbitrary number of iterators, and the number of
matrices (reactions to disable) is arbitrary.
Clarification
If I have matrices A,B,C with columns a1,a2...b1,b2...c1...cn what I need
are the columns [a1 b1 c1], [a1, b1, c2], ..., [a1 b1 cn] ... [an bn cn]
Solution
Courtesy of Michael Ohlrogge below.
Extension of his answer, for completeness
His solution ends with
MyProd = product(Array_of_ColGroups...)
Which gets the job done
And picking up where he left off
collection = collect(MyProd); #MyProd is an iterator
merged_cols = Array[] # the rows of 'collection' are arrays of arrays
for (i,v) in enumerate(collection)
# I apologize for this line
push!(merged_cols, sort!(unique(vcat(v...))))
end
# find all lengths so I can find which is the minimum
lengths = map(x -> length(x), merged_cols);
loc_of_shortest = find(broadcast((x,y) -> length(x) == y, merged_cols,minimum(lengths)))
best_gene_combos = merged_cols[loc_of_shortest]
tl;dr - complete solution:
# example matrices
a = rand(1:50, 8,4); b = rand(1:50, 10,5); c = rand(1:50, 12,4);
Matrices = [a,b,c];
toJagged(x) = [x[:,i] for i in 1:size(x,2)];
JaggedMatrices = [toJagged(x) for x in Matrices];
Combined = [unique(i) for i in JaggedMatrices[1]];
for n in 2:length(JaggedMatrices)
Combined = [unique([i;j]) for i in Combined, j in JaggedMatrices[n]];
end
Lengths = [length(s) for s in Combined];
Minima = findin(Lengths, min(Lengths...));
SubscriptsArray = ind2sub(size(Lengths), Minima);
ComboTuples = [((i[j] for i in SubscriptsArray)...) for j in 1:length(Minima)]
Explanation:
Assume you have matrix a and b
a = rand(1:50, 8,4);
b = rand(1:50, 10,5);
Express them as a jagged array, columns first
A = [a[:,i] for i in 1:size(a,2)];
B = [b[:,i] for i in 1:size(b,2)];
Concatenate rows for all column combinations using a list comprehension; remove duplicates on the spot:
Combined = [unique([i;j]) for i in A, j in B];
You now have all column combinations of a and b, as concatenated rows with duplicates removed. Find the lengths easily:
Lengths = [length(s) for s in Combined];
If you have more than two matrices, perform this process iteratively in a for loop, e.g. by using the Combined matrix in place of a. e.g. if you have a matrix c:
c = rand(1:50, 12,4);
C = [c[:,i] for i in 1:size(c,2)];
Combined = [unique([i;j]) for i in Combined, j in C];
Once you have the Lengths array as a multidimensional array (as many dimensions as input matrices, where the size of each dimension is the number of columns in each matrix), you can find the column combinations that correspond to the lowest value (there may well be more than one combination), via a simple ind2sub operation:
Minima = findin(Lengths, min(Lengths...));
SubscriptsArray = ind2sub(size(Lengths), Minima)
(e.g. for a randomized run with 3 input matrices, I happened to get 4 results with the minimal length of 19. The result of ind2sub was ([4,4,3,4,4],[3,3,4,5,3],[1,3,3,3,4])
You can convert this further to a list of "Column Combination" tuples with a (somewhat ugly) list comprehension:
ComboTuples = [((i[j] for i in SubscriptsArray)...) for j in 1:length(Minima)]
# results in:
# 5-element Array{Tuple{Int64,Int64,Int64},1}:
# (4,3,1)
# (4,3,3)
# (3,4,3)
# (4,5,3)
# (4,3,4)
Ok, let's see if I understand this. You've got n matrices and want all combinations with one column from each of the n matrices? If so, how about the product() (for Cartesian product) from the Iterators package?
using Iterators
n = 3
Array_of_Arrays = [rand(3,3) for idx = 1:n] ## arbitrary representation of your set of arrays.
Array_of_ColGroups = Array(Array, length(Array_of_Arrays))
for (idx, MyArray) in enumerate(Array_of_Arrays)
Array_of_ColGroups[idx] = [MyArray[:,jdx] for jdx in 1:size(MyArray,2)]
end
MyProd = product(Array_of_ColGroups...)
This will create an iterator object which you can then loop over to consider the specific combinations of columns.

Faster way to decrease some items in a vector in Matlab

I'm looking for a faster way to do decrease the value of certain numbers in a vector in Matlab, for example I've this vector:
Vector a=[1 21 35 44 45 67 77 83 93 100]
Then I have to remove the elements 35,45,77, so:
RemoveVector b=[3,5,7]
RemoveElements c=[35,45,77]
After remove the elements, the should be:
Vector=[1 21 43 65 80 90 97]
Note that besides remove the element, all the next elements decrease their values in 1, I've this code in Matlab:
a(:,b) = [];
b = fliplr(b);
for i=1:size(a,2)
for j=1:size(c,2)
if(a(1,i)>=c(1,j))
a(1,i) = a(1,i) -1;
end
end
end
But is too slow, m0=2.8*10^-3 seconds, there is a faster algorithm? I believe with matrix operations could be faster and elegant.
#Geoff has a good overall approach, but the adjustment can be done in O(n) not O(n*k):
adjustment = zeros(size(a));
adjustment(b(:)) = 1;
a = a - cumsum(adjustment);
a(b(:)) = [];
I think prior to removing the elements from a whose indices are given in b, the code could do all the decrementing first
% copy a
c = a;
% iterate over each index in b
for k=1:length(b)
% for all elements in c that follow the index in b (so b(k)+1…end)
% subtract one
c(b(k)+1:end) = c(b(k)+1:end) - 1;
end
% now remove the elements that correspond to the indices in b
c(b) = [];
Try the above and see what happens!
Thank so much to Geoff and Ben for yours answer, I've proved both answers by this way:
tic
a=[1 21 35 44 45 67 77 83 93 100];
b=[3 5 7];
%Code by Geoff
c = a;
for k=1:length(b)
% for all elements in c that follow the index in b (so b(k)+1…end)
% subtract one
c(b(k)+1:end) = c(b(k)+1:end) - 1;
end
c(b) = [];
m1 = toc;
and
tic
a=[1 21 35 44 45 67 77 83 93 100];
b=[3 5 7];
%Code by Ben
adjustment = zeros(size(a));
adjustment(b(:)) = 1;
a = a - cumsum(adjustment);
a(b(:)) = [];
m2 = toc;
The results in my machine were m1=1.2648*10^-4 seconds and m2=7.426*10^-5 seconds, the second code is faster, my first code gives m0 = 2.8*10^-3 seconds .

Algorithm for sequence calculation

I'm looking for a hint to an algorithm or pseudo code which helps me calculate sequences.
It's kind of permutations, but not exactly as it's not fixed length.
The output sequence should look something like this:
A
B
C
D
AA
BA
CA
DA
AB
BB
CB
DB
AC
BC
CC
DC
AD
BD
CD
DD
AAA
BAA
CAA
DAA
...
Every character above represents actually an integer, which gets incremented from a minimum to a maximum.
I do not know the depth when I start, so just using multiple nested for loops won't work.
It's late here in Germany and I just can't wrap my head around this. Pretty sure that it can be done with for loops and recursion, but I have currently no clue on how to get started.
Any ideas?
EDIT: B-typo corrected.
It looks like you're taking all combinations of four distinct digits of length 1, 2, 3, etc., allowing repeats.
So start with length 1: { A, B, C, D }
To get length 2, prepend A, B, C, D in turn to every member of length 1. (16 elements)
To get length 3, prepend A, B, C, D in turn to every member of length 2. (64 elements)
To get length 4, prepend A, B, C, D in turn to every member of length 3. (256 elements)
And so on.
If you have more or fewer digits, the same method will work. It gets a little trickier if you allow, say, A to equal B, but that doesn't look like what you're doing now.
Based on the comments from the OP, here's a way to do the sequence without storing the list.
Use an odometer analogy. This only requires keeping track of indices. Each time the first member of the sequence cycles around, increment the one to the right. If this is the first time that that member of the sequence has cycled around, then add a member to the sequence.
The increments will need to be cascaded. This is the equivalent of going from 99,999 to 100,000 miles (the comma is the thousands marker).
If you have a thousand integers that you need to cycle through, then pretend you're looking at an odometer in base 1000 rather than base 10 as above.
Your sequence looks more like (An-1 X AT) where A is a matrices and AT is its transpose.
A= [A,B,C,D]
AT X An-1 ∀ (n=0)
sequence= A,B,C,D
AT X An-1 ∀ (n=2)
sequence= AA,BA,CA,DA,AB,BB,CB,DB,AC,BC,CC,DC,AD,BD,CD,DD
You can go for any matrix multiplication code like this and implement what you wish.
You have 4 elements, you are simply looping the numbers in a reversed base 4 notation. Say A=0,B=1,C=2,D=3 :
first loop from 0 to 3 on 1 digit
second loop from 00 to 33 on 2 digits
and so on
i reversed i output using A,B,C,D digits
loop on 1 digit
0 0 A
1 1 B
2 2 C
3 3 D
loop on 2 digits
00 00 AA
01 10 BA
02 20 CA
03 30 DA
10 01 AB
11 11 BB
12 21 CB
13 31 DB
20 02 AC
21 12 BC
22 22 CC
...
The algorithm is pretty obvious. You could take a look at algorithm L (lexicographic t-combination generation) in fascicle 3a TAOCP D. Knuth.
How about:
Private Sub DoIt(minVal As Integer, maxVal As Integer, maxDepth As Integer)
If maxVal < minVal OrElse maxDepth <= 0 Then
Debug.WriteLine("no results!")
Return
End If
Debug.WriteLine("results:")
Dim resultList As New List(Of Integer)(maxDepth)
' initialize with the 1st result: this makes processing the remainder easy to write.
resultList.Add(minVal)
Dim depthIndex As Integer = 0
Debug.WriteLine(CStr(minVal))
Do
' find the term to be increased
Dim indexOfTermToIncrease As Integer = 0
While resultList(indexOfTermToIncrease) = maxVal
resultList(indexOfTermToIncrease) = minVal
indexOfTermToIncrease += 1
If indexOfTermToIncrease > depthIndex Then
depthIndex += 1
If depthIndex = maxDepth Then
Return
End If
resultList.Add(minVal - 1)
Exit While
End If
End While
' increase the term that was identified
resultList(indexOfTermToIncrease) += 1
' output
For d As Integer = 0 To depthIndex
Debug.Write(CStr(resultList(d)) + " ")
Next
Debug.WriteLine("")
Loop
End Sub
Would that be adequate? it doesn't take much memory and is relatively fast (apart from the writing to output...).

Resources