I have a list of words from which I have to select n words such that the number of the different/unique letters in them are kept to a minimum. I have a feeling that there's already a well known algorithm for this, but I'm not able to find it. Could point me to the algorithm that can be used to solve this?
An example below to illustrate what I mean by unique letters
Say I have the list of words HELL, HELP and FAIL, and I have to select 2 words from them.
If I select HELL and HELP, the number of unique letters among them = 4
If I select HELL and FAIL, the number of unique letters among them = 6
If I select HELP and FAIL, the number of unique letters among them = 7
The algorithm should select HELL and HELP.
For my use-case, I expect there to be lists of about 15 words from which about 9 words would have to be selected.
Optimal solution can be found using a MIP (Mixed Integer Programming) model (or similar type of models).
Let i be the set of letters and w be the set of words. Furthermore, define binary variables:
x(w) = 1 if word w is selected
0 otherwise
y(i) = 1 if letter i is selected
0 otherwise
Then we can write:
min sum(i, y(i))
subject to
sum(w,x(w)) = K
y(i) >= x(w) for letters i in word w
y,z in {0,1}
When I solve this I see the following.
The data is organized as:
---- 8 SET i letters
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T
U, V, W, X, Y, Z
---- 8 SET w words
HELL, HELP, FAIL
---- 8 SET map
A E F H I L P
HELL YES YES YES
HELP YES YES YES YES
FAIL YES YES YES YES
---- 8 PARAMETER k = 2.000 number of words to select
And the solution looks like:
---- 29 VARIABLE x.L select word
HELL 1.000, HELP 1.000
---- 29 VARIABLE y.L selected letters
E 1.000, H 1.000, L 1.000, P 1.000
---- 29 VARIABLE z.L = 4.000 objective
MIP solvers are readily available.
i have made predicate schedule(A,B,C) that returns possible permutations at lists A,B,C with backtracking
| ?- schedule(A,B,C).
A = [im204,im212,im217]
B = [im209,im214,im218]
C = [im210,im216] ? ;
A = [im204,im212,im218]
B = [im209,im214,im217]
C = [im210,im216] ? ;
A = [im204,im212,im216]
B = [im209,im214,im218]
C = [im210,im217] ?
I also have the predicate schedule_score(A,B,C,S) which returns score(dont mind what the score means) from lists A,B,C to S.
| ?- score_schedule([im204,im209,im212],[im210,im214,im216],[im217,im218],S).
S = 578
In my new predicate
all_schedule_scores(A,B,C,S):-
schedule(A,B,C),
score_schedule(A,B,C,S).
it returns possible permutations along with the score
| ?- all_schedule_scores(A,B,C,S).
A = [im204,im212,im217]
B = [im209,im214,im218]
C = [im210,im216]
S = 342 ? ;
A = [im204,im212,im218]
B = [im209,im214,im217]
C = [im210,im216]
S = 371 ? ;
A = [im204,im212,im216]
B = [im209,im214,im218]
C = [im210,im217]
S = 294 ?
I was wondering if there was a way i can return only the permutations with the max score(or not return any permutations whose score isnt max).
It's not clear what Prolog implementation you're using. Here's a solution that uses setof/3 (which orders its results low to high):
max_scored(MaxA, MaxB, MaxC, MaxS) :-
setof((S,A,B,C), all_scheduled_scores(A,B,C,S), AllScoresLowToHigh),
reverse(AllScoresLowToHigh, [(MaxS,MaxA,MaxB,MaxC)|_]).
Sorting uses a natural ordering, so (S1,A1,B1,C1) is considered greater than (S2,A2,B2,C2) if S1 is greater than S2.
This solution just finds a single maximum result. If you have multiple that are at the same maximum, I'll leave that as an exercise for you. You would just need to choose the first elements of the 2nd argument to reverse/2 that have the same score.
If your Prolog supports it, you could use library(aggregate):
max_scored_schedule(ScA,ScB,ScC,Score) :-
aggregate(max(S,[A,B,C]), (schedule(A,B,C),score_schedule(A,B,C,S)), max(Score,[ScA,ScB,ScC])).
Tested with hardcoded data you provided:
?- max_scored_schedule(A,B,C,S).
A = [im204, im209, im212],
B = [im210, im214, im216],
C = [im217, im218],
S = 578.
As you can see, it's just a matter to properly order arguments...
Edit:
library(solutionsequences) allows for a SQL like query, that should solve your problem
?- order_by([desc(S)], group_by(S, (A,B,C), (schedule(A,B,C),score_schedule(A,B,C,S)), G)).
but the straightforward answer by #lurker is neat (+1), and doesn't require you to port another library to your Prolog
I'm working with prolog and i need to handle huge numerical values (i know, prolog is not originaly designed to handle numbers). I'm using ECLiPSe 6.1 and the documentation of some built in predicates as fd_global:ordered_sum\2 says:
Any input variables which do not already have finite bounds will be given default bounds of -10000000 to 10000000
How can i handle value greater than 10000000? (In general, not necessarily with ECLiPSe).
If you use library(ic), then generally variables get infinite bounds by default, when used in the basic constraints:
?- lib(ic).
Yes (0.13s cpu)
?- sum([X,Y,Z]) #= 0.
X = X{-1.0Inf .. 1.0Inf}
Y = Y{-1.0Inf .. 1.0Inf}
Z = Z{-1.0Inf .. 1.0Inf}
There is 1 delayed goal.
Yes (0.00s cpu)
However, the algorithms in some of the global constraint implementations cannot handle infinite bounds, and therefore impose the default bounds you mention:
?- ic_global:ordered_sum([X,Y,Z], 0).
X = X{-10000000 .. 0}
Y = Y{-5000000 .. 5000000}
Z = Z{0 .. 10000000}
There are 5 delayed goals.
Yes (0.06s cpu)
To avoid this, you can initialize the variables with larger finite bounds before invoking the global constraint:
?- [X,Y,Z] :: -1000000000000000..1000000000000000, ic_global:ordered_sum([X,Y,Z], 0).
X = X{-1000000000000000 .. 0}
Y = Y{-500000000000000 .. 500000000000000}
Z = Z{0 .. 1000000000000000}
There are 5 delayed goals.
Yes (0.00s cpu)
I am given 2 DFAs. * denotes final states and -> denotes the initial state, defined over the alphabet {a, b}.
1) ->A with a goes to A. -> A with b goes to *B. *B with a goes to *B. *B with b goes to ->A.
The regular expression for this is clearly:
E = a* b(a* + (a* ba* ba*)*)
And the language that it accepts is L1= {w over {a,b} | w is b preceeded by any number of a's followed by any number of a's or w is b preceeded by any number of a's followed by any number of bb with any number of a's in middle of(middle of bb), end or beginning.}
2) ->* A with b goes to ->* A. ->*A with a goes to *B. B with b goes to -> A. *B with a goes to C. C with a goes to C. C with b goes to C.
Note: A is both final and initial state. B is final state.
Now the regular expression that I get for this is:
E = b* ((ab) * + a(b b* a)*)
Finally the language that this DFA accepts is:
L2 = {w over {a, b} | w is n 1's followed by either k 01's or a followed by m 11^r0' s where n,km,r >= 0}
Now the question is, is there a cleaner way to represent the languages L1 and L2 because it does seem ugly. Thanks in advance.
E = a* b(a* + (a* ba* ba*)*)
= a*ba* + a*b(a* ba* ba*)*
= a*ba* + a*b(a*ba*ba*)*a*
= a*b(a*ba*ba*)*a*
= a*b(a*ba*b)*a*
This is the language of all strings of a and b containing an odd number of bs. This might be most compactly denoted symbolically as {w in {a,b}* | #b(w) = 1 (mod 2)}.
For the second one: the only way to get to state B is to see an a in A, and the only way to get to C from outside C is to see an a in B. C is a dead state and the only way to get to it is to see aa starting in A. That is: if you ever see two as in a row, the string is not in the language; the language is the set of all strings over a and b not containing the substring aa. This might be most compactly denoted symbolically as {(a+b)*aa(a+b)*}^c where ^c means "complement".
Say I have 5 collections that contain a bunch of strings (hundreds of lines).
Now I want to extract the minimum nr of lines from each of these collections to uniquely identify that 1 collection.
So if I have
Collection 1:
A
B
C
Collection 2:
B
B
C
Collection 3:
C
C
C
Then collection 1 would be identified by A.
Collection 2 would be identified by BC or BB.
Collection 3 would be identified by CC.
Is there any algorithm already out there that does this kind of thing? Name?
Thanks,
Wesley
If the order is not important, I would sort all Lists (Collections).
Then you could look whether all 5 start with the same element. You would group them by the first element:
Start - Character instead of Strings/Lines.:
T A L U D
N I O S A D
R A B E
T A U C
D A N E B
Sorted internally:
A D U L T
A D O N I S
A B E R
A C U T
A B E N D
Sorted:
A B E N D
A B E R
A C U T
A D U L T
A D O N I S
Grouped (2):
(A B) E N D
(A B) E R
(A C) U T # identified by 2 elements
(A D) U L T
(A D) O N I S
Rest grouped by 3 elements:
(A C) U T # identified by 2 elements
(A B E) N D
(A B E) R
(A D U) L T # only ADU...
(A D O) N I S # only ADO...
Rest grouped by 4 elements:
(A C) U T # AC..
(A D U) L T # ADU...
(A D O) N I S # ADO...
(A B E N) D
(A B E R)
This is an easy problem to solve. You have one multiset (collection 1) (it is a "multiset" because the same element can occur multiple times), and then a number of other multisets (collections 2..N), and you want to find a minimum-size subset of collection 1 that does not occur in any of the other collections (2..N).
It is an easy problem to solve because it can be solved by simple set theory. I'll explain this first without multisets, i.e. assuming that every line can occur only once in any given set, and then explain how it works with multiset.
Let's call your collection 1 set S and the other collections sets X1 .. XN. Now keeping in mind that for now the sets do not have multiple instances of any item, it is obvious that any singleton set { a } such that a ∉ Xi distinguishes S from Xi, and so it is enough to calculate the set differences A - X1, ..., A - XN and then pick up a minimum-size set R such that R shares an element with all these difference sets. This is then the SET COVER combinatorial optimization problem that is NP-complete but for your small problem (5 collections) can be handled easily by brute force.
Now then when the sets are actually multisets this only changes so that the distinguishing "singleton" sets are actually multisets containing 1 or more copies of the same element and thus they have different costs. You can still calculate the set differences as above (you subtract element counts), but now your SET COVER combinatorial optimization part has take into account the fact that the distinguishing elements can be multisets and not singletons. Here's an illustration how it works for your problem when we solve for collection 3:
S = {{ c, c, c }}
X1 = {{ a, b, c }}
X2 = {{ b, b, c }}
S - X1 distinguishers: {{ c, c }}
S - X2 distinguishers: {{ c, c }}
Minimum multiset covering a distinguisher for every set: {{ c, c }}
And here how it works for calculating for collection 1:
S = {{ a, b, c }}
X1 = {{ b, b, c }}
X2 = {{ c, c, c }}
S - X1 distinguishers: {{ a }}
S - X2 distinguishers: {{ a }}, {{ b }}
Minimum multiset covering a distinguisher for every set: {{ a }}