Find top documents which matched the query of words - algorithm

So basically in this problem, we have 1000000 Documents:
Documents have:
-Text (contains a lot of words)
-Date
-DocId
.. and so on
and we have a query which has some words (max 1000):
So we now the problem is we have first find the intersection between Documents and Query and then top K top documents which have the most number of words matched.
For Example:
D1 - w1, w2, w3, w4, ... wn
D2 - w2, w4, w5, x2
D3 - a1, a2, w1, x1, x2
Q(w1,a1,w4,w5,x1,w5,w6)
so now doing the intersection of queries and docs
D1 - w1,w4,w5,w6 - 4 match
D2 - w4,w5 - 2 match
D3 - a1,x1,w1 - 3 match
So top 2 Docs are D1 and D3
I have tried to put words to document mapping in a 2d matrix.
D1 D2 D3
w1 1 1
w2 1 1
w3 1
.
.
.
a1 1
a2 1
x1 1
x2 1 1
From this matrix, I tried to find numbers but the interviewer was not happy.
Please help guys !!

If you have to program it yourself, you'd probably build a hash table with the 1000 words, then go through the documents and check all words for matches. Keep a list of the k best matches around and update it after each document.
In real life, I would stuff the documents into a PostgreSQL database, create a full text search index on the text and run an SQL query containing the search words. Why reinvent the wheel?

Related

Truth table of f(x1,x2,x3,x4) function from given two (4-1) multiplexers

Given two (4-1) multiplexers
How can I get the truth table of f(x1,x2,x3,x4) function??
A 4-1 multiplexer has the following general truth-table:
A1 A0 Y
0 0 I0
0 1 I1
1 0 I2
1 1 I3
The two control inputs A0 and A1 select which of the four inputs is switched through to the output.
To get your question solved, start with the left-hand multiplexer and write a truth-table for it.
In a second step write the overall truth-table by plugging in the intermediate signal values in the general table shown above.
The resulting truth-table has four input columns X1, X2, X3, X4.
There is one output column Y. Rather than using intermediate truth-tables you could figure out the output value for each of the 16 input combinations.

A feature ranking algorithm

if I have the following partitions or subsets with the corresponding scores as follows:
{X1,X2} with score C1
{X2,X3} with score C2
{X3,X4} with score C3
{X4,X1} with score C4
I want to write an algorithm that will rank the Xs based on the corresponding score of the subset they appeared in.
one way for example will be to do the following:
X1 = (C1 + C4)/2
X2 = (C1 + C2)/2
X3 = (C2 + C3)/2
X4 = (C3 + C4)/2
and then sort the results.
is there a more efficient or better ideas to do the ranking?
If you think that the score of a set is the sum of the scores of each object, you can write your equation in matrix form as :
C = M * X
where C is a vector of length 4 with components C1, C2, C3, C4, M is the matrix (in your case, as I understand this may vary)
1 1 0 0
0 1 1 0
0 0 1 1
1 0 0 1
and X is the unknown. You can then use Gaussian elimination to determine X and the get the ranking as you suggested.

What is the best way to distribute n forms in c categories between u users? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have asked this question in cstheory too
I have a form distribution problem. There is n forms in c categories (each form in 1 category). And there is u users, which each user can receive forms from at least one category (but maybe more than one category).
The goal is to distribute forms between users, so each user receive the same amount of forms. I prefer to equally use categories.
For example:
If categories are:
C1 : 20 forms
C2 : 3 forms
C3 : 8 forms
C4 : 2 forms
And users are:
U1 with access to C1 and C2
U2 with access to C2
U3 with access to C3
U4 with access to C1 and C3
U5 with access to C2 and C4
The answer should be:
U1: 1 x C1 + 1 x C2 | 2 x C1 (preferred)
U2: 2 x C2
U3: 2 x C3
U4: 1 x C1 + 1 x C3 | 2 x C1 (preferred) | 2 x C3
U5: 2 x C4
And 23 forms remains.
Do you have any suggestion on how can I write such algorithm?
There could be a second question, which in that some Categories have a SHOULD CONTRIBUTE option. Which if set, all remaining forms in that category will distribute between users who have access to that. for example if C1 have this option enabled, the answer should be:
U1: 1 x C1 + 1 x C2 + 9 C1
U2: 2 x C2
U3: 2 x C3
U4: 2 x C3 (to minimize remaining forms in C3 category) + 10 C1
U5: 2 x C4
and remaining forms would be 0 in C1, 0 in C2, 4 in C3 and 0 in C4.
I think its kinda Bin Packing algorithm, but I am not sure and I don't know how to solve it! :(
Note: The above answers are not best answers, these are just what I think!
It seems to me that if you fix a number N of forms per user and ask the question: can we give N forms to each user? then you can turn this into a http://en.wikipedia.org/wiki/Maximum_flow_problem problem, where each user can receive flow/forms from their subset of categories, and there is an outflow of capacity N from each user. Also, if you can solve this problem for N you can solve it for all lesser values of N.
So you could solve the first problem by running max-flow lg (maximum N) times, using a binary chop to find out what the best possible value of N is. Since you can solve it by max flow, you can also solve it by linear programming. Doing it this way, perhaps just for the critical value of N, might allow you to favour some assignments over others, or perhaps to see where there are neighbouring feasible solutions, and then see if you can mix them to use categories equally.
Example - Create a source, and link it to each of the categories Ci, with the capacity of the link being the number of forms available in that category, so C1 gets a link from the source of capacity 20. Create links with their source's capacity between users and categories, where the user has access to the category, so U1 gets links to C1 and C2, but U2 only gets a link to C2. Now create links of capacity N from each user to a single sink. If there is an assignment of forms to users that gives every user N forms, then this will produce a maximum flow that fills every link from user to sink, and you can look at the flows between users and categories to see how to assign forms. You could start off with N = 3, because user 2 only has access to a total of 3 forms, so the answer can't be greater than that. That won't work because you have said the right answer has N = 2, so the max flow won't fill all the N=3 capacity links. So your program tries again at 3/2 = 1, and finds a solution - you have provided a solution for N = 2, so there must be one for N = 1. Now the program knows there is a solution for N = 1 but not one for N = 3 so it tries one halfway between at N = (1 + 3) / 2 = 2, and finds your solution. There is one for N = 2 but not for N = 3 so the N = 2 solution is the best you can do.

Deciphering the key

Alice invents a key (s1, s2, s3, ... , sk). Bob makes a guess (g1, g2, g3, ... , gk).He is awarded one point for each si = gi.
Each s1 is an integer with the range of 0<=si<=11.
Given a q guesses with their scores bi
(g1, g2, g3, ... , gk) b1
(g1, g2, g3, ... , gk) b2
.
.
.
(g1, g2, g3, ... , gk) bq
Can you state if there is a key possible. Given 0<=si<=11, 1<=k<=11, 1<=q<=8.
For Example
2 2 1 1 2
1 1 2 2 1
For the guess 2 2 1 1 the score is 2
For the guess 1 1 2 2 the score is 1
Because there is a key possible let's say 2 1 1 3 which gives the desired scores.Hence the answer is yes
Another Example
1 2 3 4 4
4 3 2 1 1
For the guess 1 2 3 4 the score is 4
For the guess 4 3 2 1 the score is 1
This has no key which gives the desired scores hence answer is NO
I tried the brute force approach generating n^k such keys where n is the range of si.But it gave Time Limit exceeding error.
Its an interview puzzle. I have seen variants of this question but was not able to solve them.Can you tell me what should I read for such type of questions.
I don't know the best solution to this problem, but if you did a recursive search of the possible solution space, pruning branches which could not possibly lead to a solution, it would be much faster than trying all (n^k) keys.
Take your example:
1 2 3 4 4 -> 4
4 3 2 1 1 -> 1
The 3 possible values for g1 which could be significant are: 1, 4, and "neither 1 nor 4". Choose one of them, and then recursively look at the possible values for g2. Choose one, and recursively look at the possible values for g3, etc.
As you search, keep track of a cumulative score for each of the guesses from b1 to bq. Whenever you choose a value for a digit, increment the cumulative scores for all the guesses which have the same number in that position. Keep these cumulative scores on a stack (so you can back up).
When you reach a point where no solution is possible, back up and continue searching a different path. If you back all the way up to g1 and no more paths are left to search, then the answer is NO. If you find a solution, then the answer is YES.
When to stop searching a path and back up:
If the cumulative score of one of the guesses exceeds the given score
If the cumulative score of one of the guesses is less than the given score minus the number of levels left in the search tree (before you hit the bottom)
This approach could still be very slow, especially if "k" was large. But again, it will be far faster than generating (n^k) keys.

Special scheduling Algorithm (pattern expansion)

Question
Do you think genetic algorithms worth trying out for the problem below, or will I hit local-minima issues?
I think maybe aspects of the problem is great for a generator / fitness-function style setup. (If you've botched a similar project I would love hear from you, and not do something similar)
Thank you for any tips on how to structure things and nail this right.
The problem
I'm searching a good scheduling algorithm to use for the following real-world problem.
I have a sequence with 15 slots like this (The digits may vary from 0 to 20) :
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
(And there are in total 10 different sequences of this type)
Each sequence needs to expand into an array, where each slot can take 1 position.
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
The constraints on the matrix is that:
[row-wise, i.e. horizontally] The number of ones placed, must either be 11 or 111
[row-wise] The distance between two sequences of 1 needs to be a minimum of 00
The sum of each column should match the original array.
The number of rows in the matrix should be optimized.
The array then needs to allocate one of 4 different matrixes, which may have different number of rows:
A, B, C, D
A, B, C and D are real-world departments. The load needs to be placed reasonably fair during the course of a 10-day period, not to interfere with other department goals.
Each of the matrix is compared with expansion of 10 different original sequences so you have:
A1, A2, A3, A4, A5, A6, A7, A8, A9, A10
B1, B2, B3, B4, B5, B6, B7, B8, B9, B10
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
D1, D2, D3, D4, D5, D6, D7, D8, D9, D10
Certain spots on these may be reserved (Not sure if I should make it just reserved/not reserved or function-based). The reserved spots might be meetings and other events
The sum of each row (for instance all the A's) should be approximately the same within 2%. i.e. sum(A1 through A10) should be approximately the same as (B1 through B10) etc.
The number of rows can vary, so you have for instance:
A1: 5 rows
A2: 5 rows
A3: 1 row, where that single row could for instance be:
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
etc..
Sub problem*
I'de be very happy to solve only part of the problem. For instance being able to input:
1 1 2 3 4 2 2 3 4 2 2 3 3 2 3
And get an appropriate array of sequences with 1's and 0's minimized on the number of rows following th constraints above.
Sub-problem solution attempt
Well, here's an idea. This solution is not based on using a genetic algorithm, but some ideas could be used in going in that direction.
Basis vectors
First of all, you should generate what I think of as the basis vectors. For instance, if your sequence were 3 numbers long rather than 15, the basis vectors would be:
v1 = [1 1 0]
v2 = [0 1 1]
v3 = [1 1 1]
Any solution for sequence length 3 would be a linear combination of these three vectors using only positive integers. In other words, the general solution would be
a*v1 + b*v2 + c*v3
where a, b and c are positive integers. For the sequence [1 2 1], the solution is v1 = 1, v2 = 1, v3 = 0. What you first want to do is find all of the possible basis vectors of length 15. From my rough calculations I think that there are somewhere between 300-400 basis vectors of length 15. I can give you some tips towards generating them if you want.
Finding solutions
Now, what you want to do is sort these basis vectors by their sums/magnitudes. Then in searching for your solution, you start with the basis vectors which have the largest sums. We start with the vectors that have the largest sums because they lead to having less total rows. We also have an array, veccoefs, which contains an entry for the linear coefficient for each basis vector. At the beginning of searching for the solution, all the veccoefs are 0.
So we take the first basis vector (the one with the largest sum/magnitude) and subtract this vector from the sequence until we either create an unsolvable result ( having a 0 1 0 in it for instance) or any of the numbers in the result is negative. We store the number of times we subtract the vector in veccoefs. We use the result after subtracting the basis vector from the sequence as the sequence for the next basis vector. If there are only zeros left in the result, then we stop the loop.
I'm not sure of the efficiency/accuracy of this method, but it might at least give you some ideas.
Other possible solutions
Another idea for solving this is to use the basis vectors and form the problem as an optimization/least squares problem. You form a matrix of the basis vectors such that the basic problem will be minimizing Sum[(Ax - b)^2] where A is the matrix of basis vectors, b is the input sequence, and x are the basis vector coefficients. However, you also want to minimize the number of rows, so you can add a term like x^T*x to the minimization function where x^T is the transpose of x. The hard part in my opinion is finding differentiable terms to add that will encourage integer vector coefficients. If you can think of a way to do that, then optimization could very well be a good way to do this.
Also, you might consider a Metropolis-type Monte Carlo solution. You would choose randomly whether to add a vector, remove a vector, or substitute a vector at each step. The vector to be added/removed/substituted would be chosen randomly. The probability of this change to be accepted would be a ratio of the suitabilities of the solutions before the change and after the change. The suitability could be equal to the difference between the current solution and the sequence, squared and summed, minus the number of rows/basis vectors involved in the solution. You would need to put in appropriate constants to for various terms to try to get the acceptance rate around 50%. I kind of doubt that this will work very well, but I thought that you should still consider it when looking for possible solutions.
GA can be applied to this problem, but it won't be 5 minute task. You need to put several things together, without knowing which implementation of each of them is best.
So:
Solution representation - how you will represent possible solution? Using matrix seems to be most straight forward. Using collection of one dimensional arrays is possible also.
But you have some constrains, so maybe SuperGene concept is worth considering?
You must use proper mutation/crossover operators for given gene representation.
How will you enforce constrains on solutions? Destroying those that are not proper? What if they contain valuable information? Maybe let them stay in population but add some penalty to fitness, so they will contribute to offspring, but won't go into next generations?
Anyway I think that GA can be applied to this problem. Is it worth? Usually GA are not best algorithm, but they are decent algorithm if others fail. I would go with GA, just because it would be most fun but I would look for alternative solution (just in case).
P.S. Personal insight: I was solving N Queens Problem, for 70 < N < 100 (board NxN, N queens). Algorithm was working fine for lower N (maybe it was trying all combination?), but with N in this range, I couldn't find proper solution. Fitness quickly jumped to about 90% of max, but in the end there were always two queens conflicting. But it was very naive implementation.

Resources