Does mlogit from statsmodel expect a wide format? - statsmodels

Discrete Choice Analysis with Python.
Generally, there are two formats for representing regression data:
long format
wide format
Long format features a row for each potential option, plus a Y column with either 0 or 1 based on the choice. Wide format has only one line per person (survey respondent), and the Y comprehends all the features that are selected and the X comprehends all the product alternatives.
Example Long
person answer Y ~ x1 x2
1 1 0 green large
1 1 1 red large
1 2 1 green small
...
Example Wide
y1 y2 ~ x11 x12 x21 x22
green large green large red large
green small green small red small
...
Is my description correct?
does statsmodel mlogit use the wide format here described?

Related

Where is the midpoint of random-float 1?

Netlogo dictionary says:
"If number is positive, reports a random floating point number greater than or equal to 0 but strictly less than number."
random-float 1
will generate a number greater than or equal to 0 but less than 1. To evenly split the results, is the proper split
if x < 0.5
or
if x <= 0.5
My guess is that the distance from 0 to just before 0.5 is equal to the distance from 5 to just before 1.0, so that x < 0.5 is the correct answer.
I just tested it to see how many decimel places the normal random-float 1 goes to and i got :
show random-float 1
0.24664519166881826
the odds of actually landing on a 0.50000000000000000 vs. 0.50000000000000001 is incredibly low and I would not worry about using 0.5<= or 0.5>=. If you reaaally want to have it be even, you could use
set blah .5
while [blah = .5] [
set blah random-float 1 ]
to make it re-roll a number if it truely lands on 0.5. Or you can one-of to select one of 2 possible outcomes.
Perhaps a developer will pipe in with more explicit technical advice.

What kind of algorithm is used to generate a square matrix?

I need generate a matrix and fill with numbers and inactive cells, but that the sum of each columns or rows are equal. I know the magic box and sudoku, but is different. Can you help me please? What kind algorithm I need use for generate this matrix?
E.g
X = 0 = block inactive
Matrix ( 4x4 )
0 8 4 X | 12
2 0 8 2 | 12
10 1 X 1 | 12
0 3 X 9 | 12
____________|
12 12 12 12
Other example:
Matrix ( 5x5 )
0 2 2 3 5 | 12
2 4 0 5 1 | 12
8 2 0 2 0 | 12
0 4 2 0 6 | 12
2 0 8 2 0 | 12
______________|
12 12 12 12 12
The result can be any other number, it is not always 12. Just as in Example I was easier to do for me. It's not be symmetrical.
Note: This is not magic box, also is not sudoku.
Conclusion:
1) I need build this box and fill with number and block inactive.
2) Always matrix is square(3x3, 4x4, 5x5, NxN, ...)
3) When I fill of space is not block, I can use number one, two or three digits.
4) The sum of all sides must be equal.
5) In the above example, X is block. Block mean not use for player.
6) you can inactive block can be 0, however does not affect the sum.
7) There is also no restriction on how many blocks or inactive will have no
8) To fill cells with numbers, this can be repeated if you want. There is not restriction.
9) The matrix is ​​always a square and may be of different dimensions. (2)
Thanks guys for your help. And sorry that the problem is incomplete and for my english is too bad, but that's all.
In terms of agorithms, I would approach it as a system of linear equations. You can put the box as a matrix of variables:
x11 x12 x13 x14
x21 x22 x23 x24
x31 x32 x33 x34
x41 x42 x43 x44
Then you would make the equations as:
row1 = row2 (x11 + x12 + x13 + x14 = x21 + x22 + x23 + x24)
row1 = row3 (...)
row1 = row4
row1 = col1
row1 = col2
row1 = col3
row1 = col4
For N = 4, you would have 16 variables and 7 equations, so you would have a solution with a number of degrees of freedom (at least 9, as pointed out by #JamesMcLeod, and exactly 9, as stated by #Chris), so you could generate every possible matrix satisfying the restrictions just giving values to every free parameter. In the resulting matrix, you could mark every cell with 0 as an inactive cell.
To do this however you would need a library or software package with the ability to solve systems of linear equations with degrees of freedom (several math software packages can do this, but right now only Maple comes to my mind).
PD: I've just read that numbers must have one, two or three digits (and be positive, too?). To address this, you could just "take care" when choosing the values for the free parameters once the system of equations is solved, or you could add inequalities to the problem like:
x11 < 1000
x11 >= 0 (if values must be positive)
x12 < 1000
(...)
But then it would be a linear programming problem. You may approach it like this too.
PD2: You can also make simple cases with diagonal matrices:
7 X X X
X 7 X X
X X 7 X
X X X 7
But I guess you already knew that...
Edit: Thanks James McLeod and Chris for your corrections.
do you fill the matrix with random numbers? You need a function that has an argument as 1 dimension vector which will verify if the sum of the row's elements is 12, then you can still use this function for columns(with a loop) into your main.

convert image to matrix with specific values MATLAB

I have the image linked below I need to turn convert into a binary matrix. I need the green beads to be one value (0) and the silver beads another (1). I've tried converting it to black and white using various scalars, but the shadows create problems. Either the shadows need to be associated with the surrounding color or they need to become invisible such as below:
If shadows = 0, green = 1, silver =2
1 2 1 1
0 1 2 2
2 0 0 1
Would become
1 2 1 1
1 2 2
2 1
http://i1373.photobucket.com/albums/ag390/jmangler1/7-11GreenBB250_zpsb583a772.png
Take a look at Image segmentation with matlab
They also have a nice app for playing around with different techniques.

Minimum Tile Ordering

Minimizing Tile Re-ordering Problem:
Suppose I had the following symmetric 9x9 matrix, N^2 interactions between N particles:
(1,2) (2,9) (4,5) (4,6) (5,8) (7,8),
These are symmetric interactions, so it implicitly implies that there exists:
(2,1) (9,2) (5,4) (6,4) (8,5) (8,7),
In my problem, suppose they are arranged in matrix form, where only the upper triangle is shown:
t 0 1 2 (tiles)
# 1 2 3 4 5 6 7 8 9
1 [ 0 1 0 0 0 0 0 0 0 ]
0 2 [ x 0 0 0 0 0 0 0 1 ]
3 [ x x 0 0 0 0 0 0 0 ]
4 [ x x x 0 1 1 0 0 0 ]
1 5 [ x x x x 0 0 0 1 0 ]
6 [ x x x x x 0 0 0 0 ]
7 [ x x x x x x 0 1 0 ]
2 8 [ x x x x x x x 0 0 ]
9 [ x x x x x x x x 0 ] (x's denote symmetric pair)
I have some operation that's computed in 3x3 tiles, and any 3x3 that contains at least a single 1 must be computed entirely. The above example requires at least 5 tiles: (0,0), (0,2), (1,1), (1,2), (2,2)
However, if I swap the 3rd and 9th columns (and along with the rows since its a symmetric matrix) by permutating my input:
t 0 1 2
# 1 2 9 4 5 6 7 8 3
1 [ 0 1 0 0 0 0 0 0 0 ]
0 2 [ x 0 1 0 0 0 0 0 0 ]
9 [ x x 0 0 0 0 0 0 0 ]
4 [ x x x 0 1 1 0 0 0 ]
1 5 [ x x x x 0 0 0 1 0 ]
6 [ x x x x x 0 0 0 0 ]
7 [ x x x x x x 0 1 0 ]
2 8 [ x x x x x x x 0 0 ]
3 [ x x x x x x x x 0 ] (x's denote symmetric pair)
Now I only need to compute 4 tiles: (0,0), (1,1), (1,2), (2,2).
The General Problem:
Given an NxN sparse matrix, finding an re-ordering to minimize the number of TxT tiles that must be computed. Suppose that N is a multiple of T. An optimal, but unfeasible, solution can be found by trying out the N! permutations of the input ordering.
For heuristics, I've tried bandwidth minimization routines (such as Reverse CutHill McKee), Tim Davis' AMD routines, so far to no avail. I don't think diagonalization is the right approach here.
Here's a sample starting matrix:
http://proteneer.com/misc/out2.dat
Hilbert Curve:
RCM:
Morton Curve:
There are several well-known options you can try (some of them you have, but still):
(Reverse) Cuthill-McKee reduced the matrix bandwidth, keeping the entries close to the diagonal.
Approximage Minimum Degree - a light-weight fill-reducing reordering.
fill-reducing reordering for sparse LU/LL' decomposition (METIS, SCOTCH) - quite computationally heavy.
space filling curve reordering (something in these lines)
quad-trees for 2D or oct-trees for 3D problems - you assign the particles to quads/octants and later number them according to the quad/octant id, similar to space filling curves in a sense.
Self Avoiding Walk is used on structured grids to traverse the grid points in such order that all points are only visited once
a lot of research in blocking of the sparse matrix entries has been done in the context of Sparse Matrix-Vector multiplication. Many of the researchers have tried to find good reordering for that purpose (I do not have the perfect overview on that subject, but have a look at e.g. this paper)
All of those tend to find structure in your matrix and in some sense group the non-zero entries. Since you say you deal with particles, it means that your connectivity graph is in some sense 'local' because of spatial locality of the particle interactions. In this case these methods should be of good use.
Of course, they do not provide the exact solution to the problem :) But they are commonly used in exactly such cases because they yield very good reorderings in practice. I wonder what do you mean by saying the methods you tried failed? Do you expect to find the optimum solution? Surely, they improve the situation compared to a random matrix ordering.
Edit Let me briefly go through a few pictures. I have created a 3D structured cartesian mesh composed of 20-node brick elements. I matched the size of the mesh so that it is similar to yours (~1000 nodes). Also, number of non-zero entries per row are not too far off (51-81 in my case, 59-81 in your case, both however have very different distributions) The pictures below show RCM and METIS reorderings for non-periodic mesh (left), and for mesh with complete x-y-z periodicity (right):
Next picture shows the same matrix reordered using METIS and fill-reducing reordering
The difference is striking - bad impact of periodicity is clear. Now your matrix reordered with RCM and METIS
WOW. You have a problem :) First of all, I think there is something wrong with your rcm, because mine looks different ;) Also, I am certain that you can not conclude anything general and meaningful about any reordering based on this particular matrix. This is because your system size is very small (less than roughly 10x10x10 points), and you seem to have relatively long-range interactions between your particles. Hence, introducing periodicity into such small system has a much stronger bad effect on reordering than is seen in my structured case.
I would start the search for a good reordering by turning off periodicity. Once you have a reordering that satisfies you, introduce periodic interactions. In the system you showed there is almost nothing but periodicity: because it is very smal and because your interactions are fairly long-range, at least compared to my mesh. In much larger systems periodicity will have a smaller effect on the center of the model.
Smaller, but still negative. Maybe you could change your approach to periodicity? Instead of including periodic connectivities explicitly in the matrix, construct and reorder a matrix without those and introduce explicit equations binding the periodic particles together, e.g.:
V_particle1 = V_particle100
or in other words
V_particle1 - V_particle100 = 0
and add those equations at the end of your matrix. This method is called the Lagrange multipliers. Here is how it looks for my system
You keep the reordering of the non-periodic system and the periodic connectivities are localized in a block at the end of the matrix. Of course, you can use it for any other reorderings.
The next idea is you start with a reordered non-periodic system and explicitly eliminate matrix rows for the periodic nodes by adding them into the rows they are mapped onto. You should of course also eliminate the columns.
Whether you can use these depends on what you do with your matrix. Lagrange multiplier for example introduce 0 on the diagonal - not all solvers like that..
Anyway, this is very interesting research. I think that because of the specifics of your problem (as I understand it - irregularly placed particles in 3D, with fairly long-range interactions) make it very difficult to group the matrix entries. But I am very curious what you end up doing. Please let me know!
You can look for a data structure like kd-tree, R-tree, quadtree or a space filling curve. Especially a space filling curve can help because it reduce the dimension and also reorder the tiles and thus can add some new information to the grid. With a 9x9 grid it's probably good to look into peano curves. The z order morton curve is better for power of 2 grids.

Special scheduling Algorithm (pattern expansion)

Question
Do you think genetic algorithms worth trying out for the problem below, or will I hit local-minima issues?
I think maybe aspects of the problem is great for a generator / fitness-function style setup. (If you've botched a similar project I would love hear from you, and not do something similar)
Thank you for any tips on how to structure things and nail this right.
The problem
I'm searching a good scheduling algorithm to use for the following real-world problem.
I have a sequence with 15 slots like this (The digits may vary from 0 to 20) :
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
(And there are in total 10 different sequences of this type)
Each sequence needs to expand into an array, where each slot can take 1 position.
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
The constraints on the matrix is that:
[row-wise, i.e. horizontally] The number of ones placed, must either be 11 or 111
[row-wise] The distance between two sequences of 1 needs to be a minimum of 00
The sum of each column should match the original array.
The number of rows in the matrix should be optimized.
The array then needs to allocate one of 4 different matrixes, which may have different number of rows:
A, B, C, D
A, B, C and D are real-world departments. The load needs to be placed reasonably fair during the course of a 10-day period, not to interfere with other department goals.
Each of the matrix is compared with expansion of 10 different original sequences so you have:
A1, A2, A3, A4, A5, A6, A7, A8, A9, A10
B1, B2, B3, B4, B5, B6, B7, B8, B9, B10
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
D1, D2, D3, D4, D5, D6, D7, D8, D9, D10
Certain spots on these may be reserved (Not sure if I should make it just reserved/not reserved or function-based). The reserved spots might be meetings and other events
The sum of each row (for instance all the A's) should be approximately the same within 2%. i.e. sum(A1 through A10) should be approximately the same as (B1 through B10) etc.
The number of rows can vary, so you have for instance:
A1: 5 rows
A2: 5 rows
A3: 1 row, where that single row could for instance be:
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
etc..
Sub problem*
I'de be very happy to solve only part of the problem. For instance being able to input:
1 1 2 3 4 2 2 3 4 2 2 3 3 2 3
And get an appropriate array of sequences with 1's and 0's minimized on the number of rows following th constraints above.
Sub-problem solution attempt
Well, here's an idea. This solution is not based on using a genetic algorithm, but some ideas could be used in going in that direction.
Basis vectors
First of all, you should generate what I think of as the basis vectors. For instance, if your sequence were 3 numbers long rather than 15, the basis vectors would be:
v1 = [1 1 0]
v2 = [0 1 1]
v3 = [1 1 1]
Any solution for sequence length 3 would be a linear combination of these three vectors using only positive integers. In other words, the general solution would be
a*v1 + b*v2 + c*v3
where a, b and c are positive integers. For the sequence [1 2 1], the solution is v1 = 1, v2 = 1, v3 = 0. What you first want to do is find all of the possible basis vectors of length 15. From my rough calculations I think that there are somewhere between 300-400 basis vectors of length 15. I can give you some tips towards generating them if you want.
Finding solutions
Now, what you want to do is sort these basis vectors by their sums/magnitudes. Then in searching for your solution, you start with the basis vectors which have the largest sums. We start with the vectors that have the largest sums because they lead to having less total rows. We also have an array, veccoefs, which contains an entry for the linear coefficient for each basis vector. At the beginning of searching for the solution, all the veccoefs are 0.
So we take the first basis vector (the one with the largest sum/magnitude) and subtract this vector from the sequence until we either create an unsolvable result ( having a 0 1 0 in it for instance) or any of the numbers in the result is negative. We store the number of times we subtract the vector in veccoefs. We use the result after subtracting the basis vector from the sequence as the sequence for the next basis vector. If there are only zeros left in the result, then we stop the loop.
I'm not sure of the efficiency/accuracy of this method, but it might at least give you some ideas.
Other possible solutions
Another idea for solving this is to use the basis vectors and form the problem as an optimization/least squares problem. You form a matrix of the basis vectors such that the basic problem will be minimizing Sum[(Ax - b)^2] where A is the matrix of basis vectors, b is the input sequence, and x are the basis vector coefficients. However, you also want to minimize the number of rows, so you can add a term like x^T*x to the minimization function where x^T is the transpose of x. The hard part in my opinion is finding differentiable terms to add that will encourage integer vector coefficients. If you can think of a way to do that, then optimization could very well be a good way to do this.
Also, you might consider a Metropolis-type Monte Carlo solution. You would choose randomly whether to add a vector, remove a vector, or substitute a vector at each step. The vector to be added/removed/substituted would be chosen randomly. The probability of this change to be accepted would be a ratio of the suitabilities of the solutions before the change and after the change. The suitability could be equal to the difference between the current solution and the sequence, squared and summed, minus the number of rows/basis vectors involved in the solution. You would need to put in appropriate constants to for various terms to try to get the acceptance rate around 50%. I kind of doubt that this will work very well, but I thought that you should still consider it when looking for possible solutions.
GA can be applied to this problem, but it won't be 5 minute task. You need to put several things together, without knowing which implementation of each of them is best.
So:
Solution representation - how you will represent possible solution? Using matrix seems to be most straight forward. Using collection of one dimensional arrays is possible also.
But you have some constrains, so maybe SuperGene concept is worth considering?
You must use proper mutation/crossover operators for given gene representation.
How will you enforce constrains on solutions? Destroying those that are not proper? What if they contain valuable information? Maybe let them stay in population but add some penalty to fitness, so they will contribute to offspring, but won't go into next generations?
Anyway I think that GA can be applied to this problem. Is it worth? Usually GA are not best algorithm, but they are decent algorithm if others fail. I would go with GA, just because it would be most fun but I would look for alternative solution (just in case).
P.S. Personal insight: I was solving N Queens Problem, for 70 < N < 100 (board NxN, N queens). Algorithm was working fine for lower N (maybe it was trying all combination?), but with N in this range, I couldn't find proper solution. Fitness quickly jumped to about 90% of max, but in the end there were always two queens conflicting. But it was very naive implementation.

Resources