Evaluating the distribution of words in a grid

Evaluating the distribution of words in a grid - algorithm

I'm creating a word search and am trying to calculate quality of the generated puzzles by verifying the word set is "distributed evenly" throughout the grid. For example placing each word consecutively, filling them up row-wise is not particularly interesting because there will be clusters and the user will quickly notice a pattern.
How can I measure how 'evenly distributed' the words are?
What I'd like to do is write a program that takes in a word search as input and output a score that evaluates the 'quality' of the puzzle. I'm wondering if anyone has seen a similar problem and could refer me to some resources. Perhaps there is some concept in statistics that might help? Thanks.

The basic problem is distribution of lines in a square or rectangle. You can eighter do this geometrically or using integer arrays. I will try the integer arrays here.
Let M be a matrix of your puzzle,
A B C D
E F G H
I J K L
M N O P
Let the word "EFGH" be an existent word, as well as "CGKO". Then, create a matrix which will contain the count of membership in eighter words in each cell:
0 0 1 0
1 1 2 1
0 0 1 0
0 0 1 0
Apply a rule: the current cell value is equal to the sum of all neighbours (4-way) and multiply with the cell's original value, if the original value is 2 or higher.
0 0 1 0 1 2 2 2
1 1 2 1 -\ 1 3 8 2
0 0 1 0 -/ 1 2 3 2
0 0 1 0 0 1 1 1
And sum up all values in rows and columns the matrix:
1 2 2 2 = 7
1 3 8 2 = 14
1 2 3 2 = 8
0 1 1 1 = 3
| | | |
3 7 | 6
14
Then calculate the avarage of both result sets:
(7 + 14 + 8 + 3) / 4 = 32 / 4 = 8
(3 + 7 + 14 + 6) / 4 = 30 / 4 = 7.5
And calculate the avarage difference to the avarage of each result set:
3 <-> 7.5 = 4.5 7 <-> 8 = 1
7 <-> 7.5 = 0.5 14 <-> 8 = 6
14 <-> 7.5 = 6.5 8 <-> 8 = 0
6 <-> 7.5 = 1.5 3 <-> 8 = 5
___avg ___avg
3.25 3
And multiply them together:
3 * 3.25 = 9.75
Which you treat as a distributionscore. You might need to tweak it a little bit to make it work better, but this should calculate distributionscores quite nicely.
Here is an example of a bad distribution:
1 0 0 0 1 1 0 0 2
1 0 0 0 -\ 2 1 0 0 -\ 3 -\ C avg 2.5 -\ C avg-2-avg 0.5
1 0 0 0 -/ 2 1 0 0 -/ 3 -/ R avg 2.5 -/ R avg-2-avg 2.5
1 0 0 0 1 1 0 0 2 _____*
6 4 0 0 1.25 < score
Edit: calc. errors fixed.

Related

Picking out exacly one value from each row and column of a matrix

This is not exactly a question about code, but I need some help with the logic of the algorithm.
Given an NxN matrix which has at least one zero value on each row and column, how would you chose N zeros so that there is exactly one value on each row and each column? For example:
0 4 6 0 2
0 8 9 5 0
4 0 9 8 5
0 8 0 1 3
8 6 0 1 3
Clearly, you first have to choose the zeros that are singular on each row or column. I am not sure about the case when there is an equal number of zeros on several rows and columns. How would I pick the optimal values so that no line or column is left out?

This is the problem of finding a maximum cardinality matching in a bipartite graph: the rows represent one set of vertices u_1, u_2, ..., u_N, the columns the other set v_1, v_2, ..., v_N, and there is an edge u_i -- v_j whenever there is a 0 at matrix position (i, j).
It can be solved using maximum flow algorithms such as Ford-Fulkerson in O(N^3) time, or with the more specialised Hopcroft-Karp algorithm in O(N^2.5) time. In fact these algorithms solve a slightly more general problem: It will find a largest-possible set of unique (row, column) pairs such that each pair has a 0 in the matrix. (In your case, you happen to know that there is a solution with N such pairs: this is obviously best-possible.)

Select the row with least number of zeros.
For every zero in that row, pick the one whose column has the least number of zeros.
Mark that row and column in some way (maybe remove all zeors from them after storing the index of the selected zero? This one is up to you).
The marked rows and columns are skipped in the next iteration.
Repeat until all unmarked rows and columns are traversed, or until a further solution can't be built.
So for the sample problem, this is how the solution can be visualized ( < and ^ represent marked rows and columns ):
0 4 6 0 2
0 8 9 5 0
4 0 9 8 5
0 8 0 1 3
8 6 0 1 3 // Row with least zeros, and last one to be accessed
Iteration 1:
0 4 6 0 2
0 8 9 5 0
4 0 9 8 5
0 8 0 1 3
8 6 0 1 3 <
_ _ ^ _ _
Iteration 2:
0 4 6 0 2
0 8 9 5 0
4 0 9 8 5 <
0 8 0 1 3
8 6 0 1 3 <
_ ^ ^ _ _
Iteration 3:
0 4 6 0 2
0 8 9 5 0 <
4 0 9 8 5 <
0 8 0 1 3
8 6 0 1 3 <
_ ^ ^ _ ^
Iteration 4:
0 4 6 0 2 <
0 8 9 5 0 <
4 0 9 8 5 <
0 8 0 1 3
8 6 0 1 3 <
_ ^ ^ ^ ^
Iteration 5:
0 4 6 0 2 <
0 8 9 5 0 <
4 0 9 8 5 <
0 8 0 1 3 <
8 6 0 1 3 <
^ ^ ^ ^ ^

Understanding Spark MLlib LDA input format

I am trying to implement LDA using Spark MLlib.
But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
I followed
http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
I understand the output format of this as explained here.
My use case is very simple, I have one data file with some sentences.
I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().
My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?

First you need to convert your sentences into vectors.
val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val corpus = tfidf.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
Read more about TF_IDF vectorization here

Finding all subsets of a multiset

Suppose I have a bag which contains 6 balls (3 white and 3 black). I want to find all possible subsets of a given length, disregarding the order. In the case above, there are only 4 combinations of 3 balls I can draw from the bag:
2 white and 1 black
2 black and 1 white
3 white
3 black
I already found a library in my language of choice that does exactly this, but I find it slow for greater numbers. For example, with a bag containing 15 white, 1 black, 1 blue, 1 red, 1 yellow and 1 green, there are only 32 combinations of 10 balls, but it takes 30 seconds to yield the result.
Is there an efficient algorithm which can find all those combinations that I could implement myself? Maybe this problem is not as trivial as I first thought...
Note: I'm not even sure of the right technic words to express this, so feel free to correct the title of my post.

You can do significantly better than a general choose algorithm. The key insight is to treat each color of balls at the same time, rather than each of those balls one by one.
I created an un-optimized implementation of this algorithm in python that correctly finds the 32 result of your test case in milliseconds:
def multiset_choose(items_multiset, choose_items):
if choose_items == 0:
return 1 # always one way to choose zero items
elif choose_items < 0:
return 0 # always no ways to choose less than zero items
elif not items_multiset:
return 0 # always no ways to choose some items from a set of no items
elif choose_items > sum(item[1] for item in items_multiset):
return 0 # always no ways to choose more items than are in the multiset
current_item_name, current_item_number = items_multiset[0]
max_current_items = min([choose_items, current_item_number])
return sum(
multiset_choose(items_multiset[1:], choose_items - c)
for c in range(0, max_current_items + 1)
)
And the tests:
print multiset_choose([("white", 3), ("black", 3)], 3)
# output: 4
print multiset_choose([("white", 15), ("black", 1), ("blue", 1), ("red", 1), ("yellow", 1), ("green", 1)], 10)
# output: 32

No, you don't need to search through all possible alternatives. A simple recursive algorithm (like the one given by #recursive) will give you the answer. If you are looking for a function that actually outputs all of the combinations, rather than just how many, here is a version written in R. I don't know what language you are working in, but it should be pretty straightforward to translate this into anything, although the code might be longer, since R is good at this kind of thing.
allCombos<-function(len, ## number of items to sample
x, ## array of quantities of balls, by color
names=1:length(x) ## names of the colors (defaults to "1","2",...)
){
if(length(x)==0)
return(c())
r<-c()
for(i in max(0,len-sum(x[-1])):min(x[1],len))
r<-rbind(r,cbind(i,allCombos(len-i,x[-1])))
colnames(r)<-names
r
}
Here's the output:
> allCombos(3,c(3,3),c("white","black"))
white black
[1,] 0 3
[2,] 1 2
[3,] 2 1
[4,] 3 0
> allCombos(10,c(15,1,1,1,1,1),c("white","black","blue","red","yellow","green"))
white black blue red yellow green
[1,] 5 1 1 1 1 1
[2,] 6 0 1 1 1 1
[3,] 6 1 0 1 1 1
[4,] 6 1 1 0 1 1
[5,] 6 1 1 1 0 1
[6,] 6 1 1 1 1 0
[7,] 7 0 0 1 1 1
[8,] 7 0 1 0 1 1
[9,] 7 0 1 1 0 1
[10,] 7 0 1 1 1 0
[11,] 7 1 0 0 1 1
[12,] 7 1 0 1 0 1
[13,] 7 1 0 1 1 0
[14,] 7 1 1 0 0 1
[15,] 7 1 1 0 1 0
[16,] 7 1 1 1 0 0
[17,] 8 0 0 0 1 1
[18,] 8 0 0 1 0 1
[19,] 8 0 0 1 1 0
[20,] 8 0 1 0 0 1
[21,] 8 0 1 0 1 0
[22,] 8 0 1 1 0 0
[23,] 8 1 0 0 0 1
[24,] 8 1 0 0 1 0
[25,] 8 1 0 1 0 0
[26,] 8 1 1 0 0 0
[27,] 9 0 0 0 0 1
[28,] 9 0 0 0 1 0
[29,] 9 0 0 1 0 0
[30,] 9 0 1 0 0 0
[31,] 9 1 0 0 0 0
[32,] 10 0 0 0 0 0
>

Find row and column number of eight neighbors conditionally in Matlab

I have a 6 * 6 matrix
A=
3 8 8 8 8 8
4 6 1 0 7 -1
9 7 0 2 6 -1
7 0 0 5 4 4
4 -1 0 2 8 1
1 -1 0 8 3 9
I am interested in finding row and column number of neighbors starting from A(4,4)=5. But They will be linked to A(4,4) as neighbor only if A(4,4) has element 4 on right, 6 on left, 2 on top, 8 on bottom 1 on top left diagonally, 3 on top right diagonally, 7 on bottom left diagonally and 9 on bottom right diagonally. TO be more clear A(4,4) will have neighbors if the neighbors are surrounding A(4,4) as follows:
1 2 3;
6 5 4;
7 8 9;
And this will continue as each neighbor is found.
Also 0 and -1 will be ignored. In the end I want to have these cells' row and column number as shown in figure below. Is there any way to visualize this network as well. This is sample only. I really have a huge matrix.

A = [3 8 8 8 8 8;
4 6 1 0 7 -1;
9 7 0 2 6 -1;
7 0 0 5 4 4;
4 -1 0 2 8 1;
1 -1 0 8 3 9];
test = [1 2 3;
6 5 4;
7 8 9];
%//Pad A with zeros on each side so that comparing with test never overruns the boundries
%//BTW if you have the image processing toolbox you can use the padarray() function to handle this
P = zeros(size(A) + 2);
P(2:end-1, 2:end-1) = A;
current = zeros(size(A) + 2);
past = zeros(size(A) + 2);
%//Initial state (starting point)
current(5,5) = 1; %//This is A(4,4) but shifted up 1 because of the padding
condition = 1;
while sum(condition(:)) > 0;
%//get the coordinates of any new values added to current
[x, y] = find(current - past);
%//update past to last iterations current
past = current;
%//loop through all the coordinates returned by find above
for ii=1:size(x);
%//Make coord vectors that represent the current coordinate plus it 8 immediate neighbours.
%//Note that this is why we padded the side in the beginning, so if we hit a coordinate on an edge, we can still get 8 neighbours for it!
xcoords = x(ii)-1:x(ii)+1;
ycoords = y(ii)-1:y(ii)+1;
%//Update current based on comparing the coord and its neighbours against the test matrix, be sure to keep the past found points hence the OR
current(xcoords, ycoords) = (P(xcoords, ycoords) == test) | current(xcoords, ycoords);
end
%//The stopping condition is when current == past
condition = current - past;
end
%//Strip off the padded sides
FinalAnswer = current(2:end-1, 2:end-1)
[R, C] = find(FinalAnswer);
coords = [R C] %//This line is unnecessary, it just prints out the results at the end for you.
OK cool you got very close, so here is the final solution with the loops. It runs in about 0.002 seconds so it's pretty quick I think. The output is
FinalAnswer =
0 0 0 0 0 0
0 1 1 0 0 0
0 1 0 1 0 0
1 0 0 1 1 1
0 0 0 0 1 0
0 0 0 0 0 1
coords =
4 1
2 2
3 2
2 3
3 4
4 4
4 5
5 5
4 6
6 6

Adding zeros between every 2 elements of a matrix in matlab/octave

I am interested in how can I add rows and columns of zeros in a matrix so that it looks like this:
1 0 2 0 3
1 2 3 0 0 0 0 0
2 3 4 => 2 0 3 0 4
5 4 3 0 0 0 0 0
5 0 4 0 3
Actually I am interested in how can I do this efficiently, because walking the matrix and adding zeros takes a lot of time if you work with a big matrix.
Update:
Thank you very much.
Now I'm trying to replace the zeroes with the sum of their neighbors:
1 0 2 0 3 1 3 2 5 3
1 2 3 0 0 0 0 0 3 8 5 12... and so on
2 3 4 => 2 0 3 0 4 =>
5 4 3 0 0 0 0 0
5 0 4 0 3
as you can see i'm considering all the 8 neighbors of an element, but again using for and walking the matrix slows me down quite a bit, is there a faster way ?

Let your little matrix be called m1. Then:
m2 = zeros(5)
m2(1:2:end,1:2:end) = m1(:,:)
Obviously this is hard-wired to your example, I'll leave it to you to generalise.

Here are two ways to do part 2 of the question. The first does the shifts explicitly, and the second uses conv2. The second way should be faster.
M=[1 2 3; 2 3 4 ; 5 4 3];
% this matrix (M expanded) has zeros inserted, but also an extra row and column of zeros
Mex = kron(M,[1 0 ; 0 0 ]);
% The sum matrix is built from shifts of the original matrix
Msum = Mex + circshift(Mex,[1 0]) + ...
circshift(Mex,[-1 0]) +...
circshift(Mex,[0 -1]) + ...
circshift(Mex,[0 1]) + ...
circshift(Mex,[1 1]) + ...
circshift(Mex,[-1 1]) + ...
circshift(Mex,[1 -1]) + ...
circshift(Mex,[-1 -1]);
% trim the extra line
Msum = Msum(1:end-1,1:end-1)
% another version, a bit more fancy:
MexTrimmed = Mex(1:end-1,1:end-1);
MsumV2 = conv2(MexTrimmed,ones(3),'same')
Output:
Msum =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3
MsumV2 =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Evaluating the distribution of words in a grid - algorithm

Related

Picking out exacly one value from each row and column of a matrix

Understanding Spark MLlib LDA input format

Finding all subsets of a multiset

Find row and column number of eight neighbors conditionally in Matlab

Adding zeros between every 2 elements of a matrix in matlab/octave

Categories

Resources