Algorithm for exclusion of numbers - algorithm

You are given a integer N which fits in long(less than 2^63-1) and 50 other integers. Your task is to find how many numbers from 1 to N contain none of the 50 numbers as its substring?
This question is from an interview.

You didn't specify a base, but I'll assume decimal without loss of generality.
First, recognize that this is a string problem, not a numeric one.
Build a finite automaton A to recognize the 50 integers as substrings of other strings. E.g., the two integers 44 and 3 are recognized as substrings by the RE
^.*(44|3).*$
Build a finite automaton B to recognize all numbers less than N. E.g., to recognize 1 through 27 (inclusive) in decimal, that can be achieved by compiling the RE
^([1-9]|1[0-9]|2[0-7])$
Compute the intersection C of the automata A and B, which is in turn an FA.
Use a dynamic programming algorithm to compute the size of the language recognized by C. Subtract that from the size of the language recognized by A, computed by the same algorithm.
(I am not claiming that this is asymptotically optimal, but it was a fast enough to solve lots of Project Euler problems :)

This is only an explanation of what larsmans already wrote. If you like this answer, please vote him up in addition.
A finite automaton, FA, is just a set of states, and rules saying that if you are in state S and the next character you are fed is c then you transition to state T. Two of the states are special. One means, "start here" and the other means "I successfully matched". One of the characters is special, and means "the string just ended". So you take a string and a finite automaton, start in the starting state, keep feeding characters into the machine and changing states. You fail to match if you give any state unexpected input. You succeed in matching if you ever reach the state "I successfully matched".
Now there is a well-known algorithm for converting a regular expression into a finite automaton that matches a string if and only if that regular expression matches. (If you've read about regular expressions, this is how DFA engines work.) To illustrate I'll use the pattern ^.*(44|3).*$ which means "start of the string, any number of characters, followed by either 44 or 3, followed by any number of characters, followed by the end of the string."
First let's label all of the positions we can be in in the regular expression when we're looking for the next character: ^A.*(4B4|3)C.*$
The states of our regular expression engine will be subsets of those positions, and the special state matched. The result of a state transition will be the set of states we could get to if we were at that position, and saw a particular character. Our starting position is at the start of the RE, which is {A}. Here are the states that can be reached:
S1: {A} # start
S2: {A, B}
S3: {A, C}
S4: {A, B, C}
S5: matched
Here are the transition rules:
S1:
3: S3
4: S2
end of string: FAIL
any other char: S1
S2:
3: S3
4: S3
end of string: FAIL
any other char: S1
S3:
4: S4
end of string: S5 (match)
any other char: S3
S4:
end of string: S5 (match)
any other char: S4
Now if you take any string, start that in state S1, and follow the rules, you'll match that pattern. The process can be long and tedious, but luckily can be automated. My guess is that larsmans has automated it for his own use. (Technical note, the expansion from "positions in the RE" to "sets of positions you might possibly be in" can be done either up front, as here, or at run time. For most REs it is better to do it up front, as here. But a tiny fraction of pathological examples will wind up with a very large number of states, and it can be better to do those at run-time.)
We can do this with any regular expression. For instance ^([1-9]|1[0-9]|2[0-7])$ can get labeled: ^A([1-9]|1B[0-9]|2C[0-7])D$ and we get the states:
T1: {A}
T2: {D}
T3: {B, D}
T4: {C, D}
and transitions:
T1:
1: T3
2: T4
3-9: T2
any other char: FAIL
T2:
end of string: MATCH
any other char: FAIL
T3:
0-9: T2
end of string: MATCH
any other char: FAIL
T4:
0-7: T2
end of string: MATCH
any other char: FAIL
OK, so we know what a regular expression is, what a finite automaton, and how they relate. What is the intersection of two finite automata? It is just a finite automaton that matches when both finite automata individually match, and otherwise fails to match. It is easy to construct, its set of states is just the set of pairs of a state in the one, and a state in the other. Its transition rule is to just apply the transition rule for each member independently, if either fails the whole does, if both match they both do.
For the above pair, let's actually execute the intersection on the number 13. We start in state (S1, T1)
state: (S1, T1) next char: 1
state: (S1, T3) next char: 3
state: (S3, T2) next char: end of string
state: (matched, matched) -> matched
And then on the number 14.
state: (S1, T1) next char: 1
state: (S1, T3) next char: 4
state: (S2, T2) next char: end of string
state: (FAIL, matched) -> FAIL
Now we come to the whole point of this. Given that final finite automata, we can use dynamic programming to figure out how many strings there are that match it. Here is that calculation:
0 chars:
(S1, T1): 1
-> (S1, T3): 1 # 1
-> (S1, T4): 1 # 2
-> (S3, T2): 1 # 3
-> (S2, T2): 1 # 4
-> (S1, T2): 5 # 5-9
1 chars:
(S1: T2): 5 # dead end
(S1, T3): 1
-> (S1, T2): 8 # 0-2, 5-9
-> (S2, T2): 1 # 3
-> (S3, T2): 1 # 4
(S1, T4): 1
-> (S1, T2): 6 # 0-2, 5-7
-> (S2, T2): 1 # 3
-> (S3, T2): 1 # 4
(S2, T2): 1 # dead end
(S3, T2): 1
-> match: 1 # end of string
2 chars:
(S1, T2): 14 # dead end
(S2, T2): 2 # dead end
(S3, T2): 2
-> match 2 # end of string
match: 1
-> match 1 # carry through the count
3 chars:
match: 3
OK, that's a lot of work, but we found that there are 3 strings that match both of those rules simultaneously. And we did it in a way that is automatable and scaleable to much larger numbers.
Of course the question we were originally posed was how many matched the second but not the first. Well we know that 27 match the second rule, 3 match both, so 24 must match the second rule but not the first.
As I said before, this is just larsmans solution elucidated. If you learned something, upvote him, vote for his answer. If this material sounds interesting, go buy a book like Progamming Language Pragmatics and learn a lot more about finite automata, parsing, compilation, and the like. It is a very good skillset to have, and far too many programmers don't.

Related

1^3^n for n>=1 turing machine

I want to make a turing machine which accept strings of 1's of length power of 3. 111, 111111111, 111111111111111111111111111, so on. But I am unable to make algorithm for this. So far I am able to make machine which accepts length of multiple of 3. Kindly help me
As with any programming task, your goal is to break the task into smaller problems you can solve, solve those, and put the pieces together to answer the harder problem. There are lots of ways to do this, generally, and you just have to find one that makes sense to you. Then, and only then, should you worry about getting a "better", "faster" program.
What makes a number a power of three?
the number one is a power of three (3^0)
if a number is a power of three, three times that number is a power of three (x = 3^k => 3x = 3^(k+1)).
We can "reverse" the direction of 2 above to give a recursive, rather than an inductive, definition: a number is a power of three if it is divisible by three and the number divided by three is a power of three: (3 | x && x / 3 = 3^k => x = 3^(k+1)).
This suggests a Turing machine that works as follows:
Check to see if the tape has a single one. If so, halt-accept.
Otherwise, divide by three, reset the tape head to the beginning, and start over.
How can we divide by three? Well, we can count a one and then erase two ones after it; provided that we end up erasing twice as many ones as we counted, we will have exactly one third the original number of ones. If we run out of ones to erase, however, we know the number isn't divisible by three, and we can halt-reject.
This is our design. Now is the time for implementation. We will have two phases: the first phase will check for the case of a single one; the other phase will divide by three and reset the tape head. When we divide, we will erase ones by introducing a new tape symbol, B, which we can distinguish from the blank cell #. This will be important so we can remember where our input begins and ends.
Q T | Q' T' D
----------+-----------------
// phase one: check for 3^0
----------+-----------------
q0 # | h_r # S // empty tape, not a power of three
q0 1 | q1 1 R // see the required one
q0 B | h_r B S // unreachable configuration; invalid tape
q1 # | h_a # L // saw a single one; 1 = 3^0, accept
q1 1 | q2 1 L // saw another one; must divide by 3
q1 B | q1 B R // ignore previously erased ones
----------+-----------------
// prepare for phase two
----------+-----------------
q2 # | q3 # R // reached beginning of tape
q2 1 | q2 1 L // ignore tape and move left
q2 B | q2 B L // ignore tape and move left
----------------------------
// phase two: divide by 3
----------------------------
q3 # | q6 # L // dividing completed successfully
q3 1 | q4 1 R // see the 1st one
q3 B | q3 B R // ignore previously erased ones
q4 # | h_r # S // dividing fails; missing 2nd one
q4 1 | q5 B R // see the 2nd one
q4 B | q4 B R // ignore previously erased ones
q5 # | h_r # S // dividing fails; missing 3rd one
q5 1 | q3 B R // see the 3rd one
q5 B | q5 B R // ignore previously erased one
----------+-----------------
// prepare for phase one
----------+-----------------
q6 # | q0 # R // reached beginning of tape
q6 1 | q6 1 L // ignore tape and move left
q6 B | q6 B L // ignore tape and move left
There might be some bugs in that but I think the idea should be basically sound.
Check the length of the input (111, 111111111,...) say strLen.
Check that the result of log(strLen) to base 3 is equal to an interger.
i.e,in code:
bool is_integer(float k)
{
return std::floor(k) == k;
}
if(is_integer(log(strLen)/log(3))){
//Then accept the string as it's length is a power of 3.
}

Odd DFA for Language

Can somebody please help me draw a NFA that accepts this language:
{ w | the length of w is 6k + 1 for some k ≥ 0 }
I have been stuck on this problem for several hours now. I do not understand where the k comes into play and how it is used in the diagram...
{ w | the length of w is 6k + 1 for some k ≥ 0 }
We can use the Myhill-Nerode theorem to constructively produce a provably minimal DFA for this language. This is a useful exercise. First, a definition:
Two strings w and x are indistinguishable with respect to a language L iff: (1) for every string y such that wy is in L, xy is in L; (2) for every string z such that xz is in L, wz is in L.
The insight in Myhill-Nerode is that if two strings are indistinguishable w.r.t. a regular language, then a minimal DFA for that language will see to it that the machine ends up in the same state for either string. Indistinguishability is reflexive, symmetric and transitive so we can define equivalence classes on it. Those equivalence classes correspond directly to the set of states in the minimal DFA. Now, to find the equivalence classes for our language. We consider strings of increasing length and see for each one whether it's indistinguishable from any of the strings before it:
e, the empty string, has no strings before it. We need a state q0 to correspond to the equivalence class this string belongs to. The set of strings that can come after e to reach a string in L is L itself; also written c(c^6)*
c, any string of length one, has only e before it. These are not, however, indistinguishable; we can add e to c to get ce = c, a string in L, but we cannot add e to e to get a string in L, since e is not in L. We therefore need a new state q1 for the equivalence class to which c belongs. The set of strings that can come after c to reach a string in L is (c^6)*.
It turns out we need a new state q2 here; the set of strings that take cc to a string in L is ccccc(c^6)*. Show this.
It turns out we need a new state q3 here; the set of strings that take ccc to a string in L is cccc(c^6)*. Show this.
It turns out we need a new state q4 here; the set of strings that take cccc to a string in L is ccc(c^6)*. Show this.
It turns out we need a new state q5 here; the set of strings that take ccccc to a string in L is cc(c^6)*. Show this.
Consider the string cccccc. What strings take us to a string in L? Well, c does. So does c followed by any string of length 6. Interestingly, this is the same as L itself. And we already have an equivalence class for that: e could also be followed by any string in L to get a string in L. cccccc and e are indistinguishable. What's more: since all strings of length 6 are indistinguishable from shorter strings, we no longer need to keep checking longer strings. Our DFA is guaranteed to have one the states q0 - q5 we have already identified. What's more, the work we've done above defines the transitions we need in our DFA, the initial state and the accepting states as well:
The DFA will have a transition on symbol c from state q to state q' if x is a string in the equivalence class corresponding to q and xc is a string in the equivalence class corresponding to q';
The initial state will be the state corresponding to the equivalence class to which e, the empty string, belongs;
A state q is accepting if any string (hence all strings) belonging to the equivalence class corresponding to the language is in the language; alternatively, if the set of strings that take strings in the equivalence class to a string in L includes e, the empty string.
We may use the notes above to write the DFA in tabular form:
q x q'
-- -- --
q0 c q1 // e + c = c
q1 c q2 // c + c = cc
q2 c q3 // cc + c = ccc
q3 c q4 // ccc + c = cccc
q4 c q5 // cccc + c = ccccc
q5 c q0 // ccccc + c = cccccc ~ e
We have q0 as the initial state and the only accepting state is q1.
Here's a NFA which goes 6 states forward then if there is one more character it stops on the final state. Otherwise it loops back non-deterministcally to the start and past the final state.
(Start) S1 -> S2 -> S3 -> S5 -> S6 -> S7 (Final State) -> S8 - (loop forever)
^ |
^ v |_|
|________________________| (non deterministically)

Regular expression to NFA [A-E]|[A-E] [A-E] [A-E]|[A-E][A-E][A-E] [A-E]

I have this regular expression
[A-E]|[A-E]{3}|[A-E]{4}
[A-E]|[A-E] [A-E] [A-E]|[A-E][A-E][A-E] [A-E]
it recognizes strings of A,B, ABC, BCD, BCDE, etc.
I want to construct the NFA but have no idea if i am correct
I have done this
or this
Which one is correct ?
my [A-E] NFA is
The minimal DFA is the following
0 -> 1 -> 2 -> 3 -> 4
with every transition arch signed by [A-E] and with final states = {1,3,4}
In fact this DFA is equivalent with both your NFA.
Nevertheless I find the second to be clearer.

PLC check byte unknown

I'm trying to understand how a old machine (PLC) generates a check byte in its data exchange, but i can't figure what and how is done or what kind of algorithm is using.
I have a very sparse documentation about the machine and i already try some algorithms like normal crc, ccitt crc, xmodem crc type... and no one is given the right result.
The message is formed like this:
M*NNNNNNwwSSdd
where:
M* - is fixed
NNNNNN - N is a number or a space
ww - w is a number too or a space
SS - S is a char or a space
dd - d a number or a space
Some of the examples generate the following byte check (where de byte '×' is realy the space char ' ', i use this char only to be easier to identify the number of spaces):
a:
M*614976××××12 -> a
M*615138×××××× -> a
b:
M*615028××××12 -> b
M*615108×××××× -> b
c:
M*614933×××××× -> c
M*614956××××12 -> c
d:
M*614934×××××× -> d
M*614951××××12 -> d
e:
M*614942×××××× -> e
M*615079×××××× -> e
f:
M*614719××××12 -> f
M*614936×××××× -> f
g:
M*614718××××12 -> g
M*614937×××××× -> g
h:
M*614727×××××× -> h
M*614980××××12 -> h
i:
M*614734××××12 -> i
M*614939×××××× -> i
M*×××××××××××× -> i
z:
M*××××××××SC12 -> z
j:
M*××××××××××12 -> j
y:
M*××××××××SC×× -> y
There are more combinations but these ones are enough.
Another particularity is that the check byte result exists only in a defined range - from char 0x60 to 0x7F and no more (the current solution is working because i loop in this range until the machine gives me an ok)
So my question is, do you know how this check byte is calculated? can you point me some simpler algorithms to calculate the integrity of data in PLC machines, it must be simpler that the result byte check is only one char.
Thanks
It seems to me that if I xor together all the characters in the message, treating them as ascii and replacing your odd quasi-x with space, and then xor in 0xe, I get the character in the checksum. At the very least I suggest that you construct a table showing the xor of all the characters in the message, and the checksum character, written out as hex. Something like this is quite plausible considering the block check described in www.bttautomatyka.com.pl/pdf/HA466357.pdf
(I had actually written a mod-2 equation solver and was going to look for a 5-bit CRC, when this popped out!)

stable matching

I'm reading a algorithm book in my lesure time. Here is a question I have my own answer but not quite sure. What's your opinion? Thanks!
Question:
There are 2 television network company, let as assume A and B, each are planning the TV progame schedule in n time slots of a day. Each of them are putting their n programs in those slots. While each program have a rate based on the popularity in the past year, and these rate are distinct to each other. The company win a slot when its show have a higher rate then its opponent's. Is there a schedule matching that A made a schedule S and B made a schedule T, and (S, T) is stable that neither network can unilaterally change its own schedule and win more time slots.
There is no stable matching, unless one station has all its programs contiguous in the ratings (ie. the other station has no program which is rated better than one program on the first station but worse than another on the first station).
Proof
Suppose a station can improve its score, and the result is a stable matching. But then the other station could improve its own score by reversing the rearrangement. So it wasn't a stable matching. Contradiction.
Therefore a stable matching can not be reached by a station improving its score.
Therefore a stable matching can't be made worse (for either station), because then the lower state could be improved to a stable matching (which I just showed was not allowed).
Therefore every program rearrangement of a stable matching must give equal scores to both stations.
The only program sets which can't have scores changed by rearrangement are the ones where one of the stations' programs are contiguous in the ratings. (Proof left to reader)
Solution in Haskell:
hasStable :: Ord a => [a] -> [a] -> Bool
hasStable x y =
score x y + score y x == 0
-- score is number of slots we win minus number of slots they win
-- in our best possible response schedule
score :: Ord a => [a] -> [a] -> Integer
score others mine =
scoreSorted (revSort others) (revSort mine)
where
-- revSort is sort from biggest to smallest
revSort = reverse . sort
scoreSorted :: Ord a => [a] -> [a] -> Integer
scoreSorted (o:os) (m:ms)
| m > o =
-- our best show is better then theirs
-- we use it to beat theirs and move on
1 + score os ms
| otherwise =
-- their best show is better then ours
-- we match it with our worst show
-- so we could make use of our big guns
-1 + score os (m : ms)
scoreSorted _ _ = 0 -- there are no shows left
> hasStable [5,3] [4,2]
False
> hasStable [5,2] [3,4]
True
My own answer is no stable matching. Supposing there are only 2 time slots.
A have program p1(5.0) p2(3.0);
B have program p3(4.0) p4(2.0);
The schedule of A includes:
S1: p1, p2
S2: p2, p1
The schedule of B includes:
T1: p3, p4
T2: p4, p3
So the matching include:
(S1, T1)(S1, T2)(S2, T1)(S2, T2)
while the results are
(S1, T1) - (p1, p3)(p2, p4) 2:0 - not stable, becuase B can change its schedule to T2 and the result is : (S1, T2) - (p1, p4)(p2, p3) 1:0
Vise versa and so does the other matching.
Let each TV channel having 2 shows.
TV-1:
Show1 has a rating of 20 pts.
show2 has a rating of 40 pts.
TV-2:
Show1 has a rating of 30 pts.
Show2 has a rating of 50 pts.
Then it clearly shows the matching is unstable.

Resources