Python: break up dataframe (one row per entry in column, instead of multiple entries in column) - performance

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:
toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
columns=['id','ch','kw'])
Output is:
The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:
My initial solution is the following:
data = pd.DataFrame()
for x in toy.itertuples():
id = x.id; ch = x.ch; keys = x.kw.split(",")
data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']
Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.
Thank you!

You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
cols = toy.columns
splitted = toy['kw'].str.split(',')
l = splitted.str.len()
toy = pd.DataFrame({'id':np.repeat(toy['id'], l),
'ch':np.repeat(toy['ch'], l),
'kw':np.concatenate(splitted)})
toy = toy.reindex_axis(cols, axis=1)
print (toy)
id ch kw
0 1 cv c
0 1 cv d
0 1 cv e
1 2 search a
1 2 search b
1 2 search c
1 2 search d
1 2 search e
2 3 cv d

Related

How to extract optimization problem matrices A,b,c using JuMP in Julia

I create an optimization model in Julia-JuMP using the symbolic variables and constraints e.g. below
using JuMP
using CPLEX
# model
Mod = Model(CPLEX.Optimizer)
# sets
I = 1:2;
# Variables
x = #variable( Mod , [I] , base_name = "x" )
y = #variable( Mod , [I] , base_name = "y" )
# constraints
Con1 = #constraint( Mod , [i in I] , 2 * x[i] + 3 * y[i] <= 100 )
# objective
ObjFun = #objective( Mod , Max , sum( x[i] + 2 * y[i] for i in I) ) ;
# solve
optimize!(Mod)
I guess JuMP creates the problem in the form minimize c'*x subj to Ax < b before it is passes to the solver CPLEX. I want to extract the matrices A,b,c. In the above example I would expect something like:
A
2×4 Array{Int64,2}:
2 0 3 0
0 2 0 3
b
2-element Array{Int64,1}:
100
100
c
4-element Array{Int64,1}:
1
1
2
2
In MATLAB the function prob2struct can do this https://www.mathworks.com/help/optim/ug/optim.problemdef.optimizationproblem.prob2struct.html
In there a JuMP function that can do this?
This is not easily possible as far as I am aware.
The problem is stored in the underlying MathOptInterface (MOI) specific data structures. For example, constraints are always stored as MOI.AbstractFunction - in - MOI.AbstractSet. The same is true for the MOI.ObjectiveFunction. (see MOI documentation: https://jump.dev/MathOptInterface.jl/dev/apimanual/#Functions-1)
You can however, try to recompute the objective function terms and the constraints in matrix-vector-form.
For example, assuming you still have your JuMP.Model Mod, you can examine the objective function closer by typing:
using MathOptInterface
const MOI = MathOptInterface
# this only works if you have a linear objective function (the model has a ScalarAffineFunction as its objective)
obj = MOI.get(Mod, MOI.ObjectiveFunction{MOI.ScalarAffineFunction{Float64}}())
# take a look at the terms
obj.terms
# from this you could extract your vector c
c = zeros(4)
for term in obj.terms
c[term.variable_index.value] = term.coefficient
end
#show(c)
This gives indeed: c = [1.;1.;2.;2.].
You can do something similar for the underlying MOI.constraints.
# list all the constraints present in the model
cons = MOI.get(Mod, MOI.ListOfConstraints())
#show(cons)
in this case we only have one type of constraint, i.e. (MOI.ScalarAffineFunction{Float64} in MOI.LessThan{Float64})
# get the constraint indices for this combination of F(unction) in S(et)
F = cons[1][1]
S = cons[1][2]
ci = MOI.get(Mod, MOI.ListOfConstraintIndices{F,S}())
You get two constraint indices (stored in the array ci), because there are two constraints for this combination F - in - S.
Let's examine the first one of them closer:
ci1 = ci[1]
# to get the function and set corresponding to this constraint (index):
moi_backend = backend(Mod)
f = MOI.get(moi_backend, MOI.ConstraintFunction(), ci1)
f is again of type MOI.ScalarAffineFunction which corresponds to one row a1 in your A = [a1; ...; am] matrix. The row is given by:
a1 = zeros(4)
for term in f.terms
a1[term.variable_index.value] = term.coefficient
end
#show(a1) # gives [2.0 0 3.0 0] (the first row of your A matrix)
To get the corresponding first entry b1 of your b = [b1; ...; bm] vector, you have to look at the constraint set of that same constraint index ci1:
s = MOI.get(moi_backend, MOI.ConstraintSet(), ci1)
#show(s) # MathOptInterface.LessThan{Float64}(100.0)
b1 = s.upper
I hope this gives you some intuition on how the data is stored in MathOptInterface format.
You would have to do this for all constraints and all constraint types and stack them as rows in your constraint matrix A and vector b.
Use the following lines:
Pkg.add("NLPModelsJuMP")
using NLPModelsJuMP
nlp = MathOptNLPModel(model) # the input "< model >" is the name of the model you created by JuMP before with variables and constraints (and optionally the objective function) attached to it.
x = zeros(nlp.meta.nvar)
b = NLPModelsJuMP.grad(nlp, x)
A = Matrix(NLPModelsJuMP.jac(nlp, x))
I didn't try it myself. But the MathProgBase package seems to be able to provide A, b, and c in matrix form.

find all indices of multiple value pairs in a matrix

Suppose I have a matrix A, containing possible value pairs and a matrix B, containing all value pairs:
A = [1,1;2,2;3,3];
B = [1,1;3,4;2,2;1,1];
I would like to create a matrix C that contains all pairs that are allowed by A (i.e. C = [1,1;2,2;1,1]).
Using C = ismember(A,B,'rows') will only show the first occurence of 1,1, but I need both.
Currently I use a for-loop to create C, which looks like:
TFtot = false(size(B(:,1,1),1);
for i = 1:size(a(:,1),1)
TF1 = A(i,1) == B(:,1) & A(i,2) = B(:,2);
TFtot = TF1 | TFtot;
end
C = B(TFtot,:);
I would like to create a faster approach, because this loop currently greatly slows down the algorithm.
You're pretty close. You just need to swap B and A, then use this output to index into B:
L = ismember(B, A, 'rows');
C = B(L,:);
How ismember works in this particular case is that it outputs a logical vector that has the same number of rows as B where the ith value in B tells you whether we have found this ith row somewhere in A (logical 1) or if we haven't found this row (logical 0).
You want to select out those entries in B that are seen in A, and so you simply use the output of ismember to slice into B to extract out the affected rows, and grab all of the columns.
We get for C:
>> C
C =
1 1
2 2
1 1
Here's an alternative using bsxfun:
C = B(all(any(bsxfun(#eq, B, permute(A, [3 2 1])),3),2),:);
Or you could use pdist2 (Statistics Toolbox):
B(any(~pdist2(A,B),1),:);
Using matrix-multiplication based euclidean distance calculations -
Bt = B.'; %//'
[m,n] = size(A);
dists = [A.^2 ones(size(A)) -2*A ]*[ones(size(Bt)) ; Bt.^2 ; Bt];
C = B(any(dists==0,1),:);

The Movie Scheduling _Problem_

Currently I'm reading "The Algorithm Design Manual" by Skiena (well, beginning to read)
He asks a problem he calls the "Movie Scheduling Problem":
Problem: Movie Scheduling Problem
Input: A set I of n intervals on the line.
Output: What is the largest subset of mutually non-overlapping intervals which can
be selected from I?
Example: (Each dashed line is a movie, you want to find a set with the highest quantity of movies)
----a---
-----b---- -----c--- ---d---
-----e--- -------f---
--g-- --h--
The algorithm I thought of to solve it was like this:
I could throw out the "worst offender" (intersects with the most other movies) until there are no worst offenders (zero intersections). The only problem I see is that if there is a tie (say two different movies each intersect with 3 other movies) could it matter which one I throw out?
Basically I'm wondering how I go about turning the idea into "math" and how to prove it correct/incorrect.
The algorithm is incorrect. Let's consider the following example:
Counterexample
|----F----| |-----G------|
|-------D-------| |--------E--------|
|-----A------| |------B------| |------C-------|
You can see that there is a solution of size at least 3 because you can pick A, B and C.
Firstly, let's count, for each interval the number of intersections:
A = 2 [F, D]
B = 4 [D, F, E, G]
C = 2 [E, G]
D = 3 [A, B, F]
E = 3 [B, C, G]
F = 3 [A, B, D]
G = 3 [B, C, E]
Now consider a run of your algorithm. In the first step we delete B because it intersects with the most number of invervals and we get:
|----F----| |-----G------|
|-------D-------| |--------E--------|
|-----A------| |------C-------|
It's easy to see that now from {A, D, F} you can choose only one, because each pair intersects. The same case with {G, E, C}, so after deleting B, you can choose at most one from {A, D, F} and at most one from {G, E, C}, to get the total of 2, which is smaller than the size of {A, B, C}.
The conclusion is, that after deleting B which intersects with the most number of invervals, you can't get the maximum number of nonintersecting movies.
Correct solution
The problem is very well known and one solution is to pick the interval which ends first, delete all intervals intersecting with it and continue until there are no intervals to examine. This is an example of a greedy method and you can find or develop a proof that it's correct.
This looks like a dynamic programming problem to me:
Define the following functions:
sched(t) = best schedule starting at time t
next(t) = set of movies that start next after time t
len(m) = length of movie m
next returns a set because there may be more than one movie that starts at the same time.
then sched should be defined as follows:
sched(t) = max { 1 + sched(t + len(m)), sched(t+1) } where m in next(t)
This recursive function selects a movie m from next(t) and compares the largest possible sets that either include or don't include m.
Invoke sched with the time of your first movie and you will get the size of the optimal set. Getting the optimal set itself just requires a little extra logic to remember which movies you select at each invocation.
I think this recursive (as opposed to iterative) algorithm runs in O(n^2) if you use memoization, where n is the number of movies.
It's correct, but I'd have to consult my algorithms textbook to give you an explicit proof, but hopefully this algorithm makes intuitive sense why it is correct.
# go through the database and create a 2-D matrix indexed a..h by a..h. Set each
# element of the matrix to 1 if the row index movie overlaps the column index movie.
mtx = []
for i in range(8):
column = []
for j in range(8):
column.append(0)
mtx.append(column)
# b <> e
mtx[1][4] = 1
mtx[4][1] = 1
# e <> g
mtx[4][6] = 1
mtx[6][4] = 1
# e <> c
mtx[4][2] = 1
mtx[2][4] = 1
# c <> a
mtx[2][0] = 1
mtx[0][2] = 1
# c <> f
mtx[2][5] = 1
mtx[5][2] = 1
# c <> g
mtx[2][6] = 1
mtx[6][2] = 1
# c <> h
mtx[2][7] = 1
mtx[7][2] = 1
# d <> f
mtx[3][5] = 1
mtx[5][3] = 1
# a <> f
mtx[0][5] = 1
mtx[5][0] = 1
# a <> d
mtx[0][3] = 1
mtx[3][0] = 1
# a <> h
mtx[0][7] = 1
mtx[7][0] = 1
# g <> e
mtx[4][7] = 1
mtx[7][4] = 1
# print out contstraints
for line in mtx:
print line
# keep track of which movies are still allowed
allowed = set(range(8))
# loop through in greedy fashion, picking movie that throws out the least
# number of other movies at each step
best = 8
while best > 0:
best_col = None
best_lost = set()
best = 8 # score if move does not overlap with any other
# each step, only try movies still allowed
for col in allowed:
lost = set()
for row in range(8):
# keep track of other movies eliminated by this selection
if mtx[row][col] == 1:
lost.add(row)
# this was the best of all the allowed choices so far
if len(lost) < best:
best_col = col
best_lost = lost
best = len(lost)
# there was a valid selection, process
if best_col > 0:
print 'watch movie: ', str(unichr(best_col+ord('a')))
for row in best_lost:
# now eliminate the other movies you can't now watch
if row in allowed:
print 'throwing out: ', str(unichr(row+ord('a')))
allowed.remove(row)
# also throw out this movie from the allowed list (can't watch twice)
allowed.remove(best_col)
# this is just a greedy algorithm, not guaranteed optimal!
# you could also iterate through all possible combinations of movies
# and simply eliminate all illegal possibilities (brute force search)

What is an efficient way to convert sets to a column index in R?

Overview
Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?
Illustration
Below is an example illustrating this problem:
# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+
The desired result would be:
# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
d index
4655297 A 1
3328423 A 1
2911946 A 2
2829484 A 2
3871770 A NA
2702914 A NA
2581677 A NA
4106410 A NA
3755846 A NA
3177816 A 1
Naive Attempt
Something like this can be achieved using the following method.
n <- 0
A$index <- NA
lapply(B, function(x){
n <<- n + 1
A[x, "index"] <<- n
})
Problem
However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.
How can the desired result be generated in a quick and efficient manner?
Here is a suggestion using base that isn't too bad when compared to your current method.
Sample data:
A <- data.frame(d = rep("A", 5e6),
set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)
Base method:
system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
# user system elapsed
# 15.30 0.19 15.50
Check:
identical(A$set, A$index)
# TRUE
For anything faster, I suppose data.table will come handy.

scala version of swap algorithm for null models

The problem I am having is with trying to find an efficient way to find swappable elements in a matrix in order to implement a swap algorithm for null model creation.
The matrix consists of 0's and 1's and the idea is that elements can be switched between columns so that the row and column totals of the matrix remain the same.
For example, given the following matrix:
c1 c2 c3 c4
r1 0 1 0 0 = 1
r2 1 0 0 1 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
columns c2 and c4 in r1 and r2 can each be swapped in such a way that totals are not altered i.e.:
c1 c2 c3 c4
r1 0 0 0 1 = 1
r2 1 1 0 0 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
This all needs to be done randomly so as not to introduce any bias.
I have one solution that works. I randomly select a row and two columns. If they yield a 10 or 01 pattern then I randomly select another row and check the same columns to see if they yield the opposite pattern. If either of them fail I start over and select a new element.
This method works but I only "hit" the correct patterns about 10% of the time. In a large matrix or in one with few 1's in the rows I waste a lot of time "missing". I figured that there had to be a more intelligent way of choosing elements in the matrix but still doing it randomly.
The code for the working method is:
def isSwappable(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val indices = getRowAndColIndices(matrix)
(matrix(indices._1._1)(indices._2._1), matrix(indices._1._1)(indices._2._2)) match {
case (1, 0) => {
if (matrix(indices._1._2)(indices._2._1) == 0 & matrix(indices._1._2)(indices._2._2) == 1) {
indices
}
else {
isSwappable(matrix)
}
}
case (0, 1) => {
if (matrix(indices._1._2)(indices._2._1) == 1 & matrix(indices._1._2)(indices._2._2) == 0) {
indices
}
else {
isSwappable(matrix)
}
}
case _ => {
isSwappable(matrix)
}
}
}
def getRowAndColIndices(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
(getNextIndex(rnd.nextInt(matrix.size), matrix.size), getNextIndex(rnd.nextInt(matrix(0).size), matrix(0).size))
}
def getNextIndex(i: Int, constraint: Int): Tuple2[Int, Int] = {
val newIndex = rnd.nextInt(constraint)
newIndex match {
case `i` => getNextIndex(i, constraint)
case _ => (i, newIndex)
}
}
I figured a more efficient way to handle this was to remove any rows that could not be used (all 1's or 0's) and then choose an element randomly. From there I could filter out any columns in the row that had the same value and the choose from the remaining columns.
Once the first row and column are chosen I then filter out the rows that can not provide the required pattern and then choose from the remaining rows.
This works for the most part but the problem that I can't figure out how to deal with is what happens when there are no columns or rows to choose from? I don't want to loop infinitely trying to find the pattern I need and I need a way of starting over if I do get an empty list of rows or columns to choose from.
The code that I have so far that sort of works (until I get an empty list) is:
def getInformativeRowIndices(matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1.distinct.size > 1)
.map(_._2)
.toList
)
def getRowsWithOppositeValueInColumn(col: Int, value: Int, matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1(col) != value)
.map(_._2)
.toList
)
def getColsWithOppositeValueInSameRow(row: Int, value: Int, matrix: Matrix) = (
matrix(row)
.zipWithIndex
.filter(_._1 != value)
.map(_._2)
.toList
)
def process(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val row1Indices = getInformativeRowIndices(matrix)
if (row1Indices.isEmpty) sys.error("No informative rows")
val row1 = row1Indices(rnd.nextInt(row1Indices.size))
val col1 = rnd.nextInt(matrix(0).size)
val colIndices = getColsWithOppositeValueInSameRow(row1, matrix(row1)(col1), matrix)
if (colIndices.isEmpty) process(matrix)
val col2 = colIndices(rnd.nextInt(colIndices.size))
val row2Indices = getRowsWithOppositeValueInColumn(col1, matrix(row1)(col1), matrix)
.intersect(getRowsWithOppositeValueInColumn(col2, matrix(row1)(col2), matrix))
println(row2Indices)
if (row2Indices.isEmpty) process(matrix)
val row2 = row2Indices(rnd.nextInt(row2Indices.size))
((row1, row2), (col1, col2))
}
I think the recursive methods are wrong and don't really work here. Also, I am really just trying to improve the speed of cell selection so any ideas or suggestions would be greatly appreciated.
EDIT:
I have had a chance to play with this little more and have come up with another solution but it does not seem to be much faster then just randomly choosing cells in the matrix. Also, I should add that the matrix needs to be swapped about 30000 times in succession in order for it to be considered randomised and I need to generate 5000 random matrices for each test of which I have at least another 5000 to do so performance is kind of important.
The current solution (besides random cell selection is:
Randomly select 2 rows from the matrix
subtract one row from the other and put it in an Array
if the new Array contains both a 1 and -1 then we can swap
The logic of the subtraction looks like this:
0 1 0 0
- 1 0 0 1
---------------
-1 1 0 -1
The method that does this looks like this:
def findSwaps(matrix: Matrix, iterations: Int): Boolean = {
var result = false
val mtxLength = matrix.length
val row1 = rnd.nextInt(mtxLength)
val row2 = getNextIndex(row1, mtxLength)
val difference = subRows(matrix(row1), matrix(row2))
if (difference.min == -1 & difference.max == 1) {
val zeroOne = difference.zipWithIndex.filter(_._1 == -1).map(_._2)
val oneZero = difference.zipWithIndex.filter(_._1 == 1).map(_._2)
val col1 = zeroOne(rnd.nextInt(zeroOne.length))
val col2 = oneZero(rnd.nextInt(oneZero.length))
swap(matrix, row1, row2, col1, col2)
result = true
}
result
}
The matrix row subtraction looks like this:
def subRows(a: Array[Int], b: Array[Int]): Array[Int] = (a, b).zipped.map(_ - _)
And the actual swap looks like this:
def swap(matrix: Matrix, row1: Int, row2: Int, col1: Int, col2: Int) = {
val temp = (matrix(row1)(col1), matrix(row1)(col2))
matrix(row1)(col1) = matrix(row2)(col1)
matrix(row1)(col2) = matrix(row2)(col2)
matrix(row2)(col1) = temp._1
matrix(row2)(col2) = temp._2
matrix
}
This works much better than before in that I get have between 80% and 90% success for an attempted swap (it was only about 10% with the random cell selection) however... it is still taking about 2.5 minutes to generate 1000 randomised matrices.
Any ideas on how to improve the speed?
I'm going to assume the matrices are big so that storage of the order of (matrix size squared) is not viable (for reasons of either speed or memory).
If you have a sparse matrix, you can enter the index of each 1 in each column in a set (here I show the compact way to do things, but you may wish to iterate with while loops for speed):
val mtx = Array(Array(0,1,0,0),Array(1,0,0,1),Array(0,0,0,0),Array(1,1,1,1))
val cols = mtx.transpose.map(x => x.zipWithIndex.filter(_._1==1).map(_._2).toSet)
Now for each column, a later column contains compatible pairs (at least one) if and only if only the following two sets are nonempty:
def xorish(a: Set[Int], b: Set[Int]) = (a--b, b--a)
So the answer will involve computing these sets and testing whether they're both nonempty.
Now the question is what you mean by "sample randomly". Randomly sampling single 1,0 pairs is not the same as randomly sampling possible swaps. To see this, consider the following:
1 0 1 0
1 0 1 0
1 0 1 0
0 1 1 0
0 1 1 0
0 1 0 1
The two columns on the left have nine possible swaps. The two on the right have only five possible swaps. But if you are looking for (1,0) patterns, you will sample only three times on the left vs. five on the right; if you are looking for either (1,0) or (0,1), you will sample six and six, which again distorts the probabilities. The only way to fix this is either to not be clever, and randomly sample a second time (which in the first case will work out with a usable swap 3/5 of the time, while only 1/5 in the second), or to basically compute every possible pair for swapping (or at least how many pairs there are) and select from that predefined set.
If we want to do the latter, we note that for each pair of nonidentical columns, we can compute the two sets to swap among, and we know the sizes and the product is the total number of possibilities. In order to avoid instantiating all the possibilities, we can create
val poss = {
for (i<-cols.indices; j <- (i+1) until cols.length) yield
(i, j, (cols(i)--cols(j)).toArray, (cols(j)--cols(i)).toArray)
}.filter{ case (_,_,a,b) => a.length>0 && b.length>0 }
and then count how many there are:
val cuml = poss.map{ case (_,_,a,b) => a.size*b.size }.scanLeft(0)(_ + _).toArray
Now to pick a number at random, we pick a number between 0 and cuml.last and pick out which bucket this is and which item within the bucket:
def pickItem(cuml: Array[Int], poss: Seq[(Int,Int,Array[Int],Array[Int])]) = {
val n = util.Random.nextInt(cuml.last)
val k = {
val i = java.util.Arrays.binarySearch(cuml,n)
if (i<0) -i-2 else i
}
val j = n - cuml(k)
val bucket = poss(k)
(
bucket._1, bucket._2,
bucket._3(j % bucket._3.size), bucket._4(j / bucket._3.size)
)
}
This ends up returning (c1,c2,r1,r2) selected randomly.
Now that you have the coordinates, you can create the new matrix however you wish. (Most efficient is probably to do an in-place swap of the entries, and then swap back when you want to try again.)
Note that this is only sensible for a large number of independent swaps from the same starting matrix. If you instead want to do this iteratively and maintain independence, you are probably best off doing this randomly after all unless the matrices are extremely sparse, at which point it's worth simply storing the matrices in some standard sparse matrix format (i.e. by index of nonzero entries) and doing your manipulation on those (probably with mutable sets and an update strategy, since the consequences of a single swap are confined to about n of the entries in an n*n matrix).

Resources