I've run into a peculiar problem which I don't seem to be able to wrap my head around. I'll get right into it.
The problem is matching a set of cards to a set of rules.
It is possible to define a set of rules as a string. It is composed of comma separated tuples of <suit>:<value>. For example H:4, S:1 should match Four of Hearts and Ace of Spades. It is also possible to wildcard, for example *:* matches any card, D:* matches any card with in the diamond suit, and *:2matches a Two in any suit. Rules can be combined with comma: *:*,*:*,H:4 would match a set of cards if it held 2 random cards and a Four of Hearts.
So far so good. A parser for this is easy and straight forward to write. Here comes the tricky part.
To make it easy to compose these rules, two more constructions can be used for suit and value. These are < (legal for suit and value) and +n (legal only for value) where n is a number. < means "the same as previous match" and +n means "n higher than previous match". An example:
*:*, <:*, *:<
Means: match any card, then match a card with the same suit as the first match, next match another card with the same value as the second match. This hand would match:
H:4,H:8,C:8
Because Hearts of Four and Hearts of Eight is the same suit, while Eight of Hearts and Eight of Clubs is the same value.
It is allowed to have more cards as long as all rules match (so, adding C:10 to the above hand would still match the rule).
My first approach at solving this is basically taking the set of cards which should be matched, attempting to apply the first rule to it. If it matched, I moved on to the next rule and attempted to match it from the set of cards, and so on until either all rules were matched, or I found a rule that didn't match. This approach have (at least) one flaw, consider example above above: *:*,<:*,*:<, but with the cards in this order: H:8,C:8,H:4.
It would match the H:8 of for the first rule. Matched: H:8
Next it attempts to find one with the same suit (Hearts). There is a Four of Hearts. Matched: H:8, H:4
Moving on, it want to find a card with the same value (Four), and fails.
I don't want the way the set of cards is ordered to have any impact on the result as it does in the above example. I could sort the set of cards if I could think of any great strategy that worked well with any set of rules.
I have no knowledge of the quantity of cards or number oof rules, so a brute force approach is not feasible.
Thank you for reading this far, I am grateful for any tip or insight.
Your problem is actually an ordering problem. Here's a simple version for it:
given an input sequence of numbers and a pattern, reorder them so that they fit the pattern. The pattern can contain "*", meaning "any number" and ">", meaning "bigger than the previous number.
For example, given pattern [* * > >] and sequence [10 10 2 1] such an ordering exists and it is [10 1 2 10]. Some inputs might give no outputs, others 1, while even others many (think the input [10 10 2 1] and the pattern [* * * *]).
I'd say that once you have the solution for this simplified problem, switching to your problem is just a matter of adding another dimension and some operators. Sorry for not being of more help :/ .
LE. keep in mind that if the allowed character symbols are finite (i.e. 4) and also the allowed numbers (i.e. 9) things might get easier.
Related
I have a system that is confined to two alphanumeric characters. Some simple math shows that we get 1,296 combinations if we use all possible permutations 0-9 and a-z. Lower case letters cannot be distinguished from upper case, special characters (including a blank character) cannot be used.
Is there any creative mapping, perhaps to an external reference, to create a way to take this two character field significantly beyond 1,296 combinations?
Examples of identifers would be `00, OO, AZ, Z4, etc.'
Thanks!
I'm afraid not, no more than you could get a 3 bit number to represent more than 8 different numbers. If you're interested in the details you can look up information theory or Kolmogorov complexity. Essentially with only 1,296 combinations then you can only label 1,296 possible pieces of information.
As an example, consider if you had 1,297 things. All of those two letter combinations would take up the first 1,296 so what combination would be associated with the next one? It would have to be a repeat of something which you had earlier.
Shor also has some good material on this, and the implications of that sort of thing form the basis for a lot of file compression systems.
You could maybe squeeze out one more combination if you cheat, and allow a 'null' value to represent a different possibility, but thats not totally relevant to the idea of the question.
If you are restricted to two characters taken from an alphabet of 36, then you are limited to 36² distinct symbols, that's it.
More context is required to find workarounds, like stealing bits elsewhere, using symbols in pairs, breaking the case limitation, exploiting the history of transations...
The precise meaning of "a system that is confined to two alphanumeric characters" needs to be known to be able to suggest a workaround. Is that a space constraint? Do you need the restriction to 2 chars for efficiency? Does it need to work with other code that accepts or generates 2 char indexes?
If you have up to 1295 identifiers that are used often, and some others that occur only occasionally, you could choose an identifier, e.g. "ZZ", to indicate that another identifier is following. So "00" through to "ZY" would be 1295 simple 2-char identifiers, and "ZZ00" though to "ZZZZ" would be a further 1296 combined 4-char identifiers. (Or "ZZ0000" through to "ZZZZZZ" for a further 1296*1296 identifiers ...)
This could work for space constraints. For efficiency, it depends on whether the additional check to see if the identifier is "ZZ" is too expensive or not.
I have to find in string a phone number with conditions:
Start with 0
with 10 or 11 number 0-9
with maximum 2 character "-" (Not at start or end)
Example: 01234567890, 01-234567890, 03-1234-12345.
My regex, but it not work:
/\d+{10,11}|(\d+\-\d+){11,12}|(\d+\-\d+\-\d+){12,13}/
It is a bit tricky. First, your regexp kind of has the right idea. Given that the length changes with number of dashes, we need to check each case separately. (There might be a better way, but I can't think of one.) However, (\d+-\d+){11,12} does not mean "length being 11-12", but "11-12 repetitions of \d+-\d+, giving you way more than 11-12 characters. Even if it were correct, because of the order of the disjunction, you would not be able to match 0123456789-1, because 10 digits would be found first, and ten digits followed by dash and another digit would not even be checked.
If you were trying to validate the whole string, it would have been easier, as you can use anchors ^ and $ to find the end. Without it, it is a little trickier:
(?=[\d-]{13,14}(?![\d-]))0\d+-\d+-\d+(?![\d-])|(?=[\d-]{12,13}(?!-|\d))0\d+-\d+(?![\d-])|\d{10,11}
The first part, (?=[\d-]{13,14}(?![\d-]))0\d+-\d+-\d+(?![\d-]), checks for the two-dash pattern. (?=[\d-]{13,14}(?![\d-])) checks whether you have 13-14 digit-or-dash characters after which you don't have a digit nor a dash. After making sure there is such a region, we make sure there are exactly two dashes in between digits (and making sure the whole thing is, again, not followed by a digit-or-dash - this anchor synchronises the condition in our lookahead and in the main pattern).
The second part, (?=[\d-]{12,13}(?!-|\d))0\d+-\d+(?![\d-]), is analogous, checking for one-dash matches. The third part, \d{10,11}, is trivially simple, and finds no-dash matches.
All of this is under the assumption that sawa's needling is on-point: that 0123456789- is not a match. If it is, you will need to change some plusses into stars.
Rubular
EDIT: The Rubular pattern still has the wrong \d{11,12} for the dashless case, can't be bothered to generate another Rubular :P
EDIT2: Thought of a better way.
(?=(?:\d-?){10,11}(?![\d-]))\d+(-\d+){0,2}(?![\d-])
Make sure there's 10-11 digits, and make sure there's 0-2 dashes. The anchor idea is the same as in the previous one.
Rubular.
I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.
I have a set of pairs of character strings, e.g.:
abba - aba,
haha - aha,
baa - ba,
exb - esp,
xa - za
The second (right) string in the pair is somewhat similar to the first (left) string.
That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters.
There's no simple rule for this character-to-character mapping, although there are some patterns.
Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings?
The solution can be approximate, working correctly for, say, 80-95% of the strings.
Would you recommend to use some kind of a genetic algorithm? If so, how?
If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. If you had such tables, you could align the characters using http://en.wikipedia.org/wiki/Dynamic_time_warping. One approach is therefore to guess an alignment (e.g. one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum.
There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify.
You can use text to speech to create sound waves. then compare sound waves with other's and match them with percentages.
This is my theory how Google has such a advanced spell checker.
I am looking for an existign path truncation algorithm (similar to what the Win32 static control does with SS_PATHELLIPSIS) for a set of paths that should focus on the distinct elements.
For example, if my paths are like this:
Unit with X/Test 3V/
Unit with X/Test 4V/
Unit with X/Test 5V/
Unit without X/Test 3V/
Unit without X/Test 6V/
Unit without X/2nd Test 6V/
When not enough display space is available, they should be truncated to something like this:
...with X/...3V/
...with X/...4V/
...with X/...5V/
...without X/...3V/
...without X/...6V/
...without X/2nd ...6V/
(Assuming that an ellipsis generally is shorter than three letters).
This is just an example of a rather simple, ideal case (e.g. they'd all end up at different lengths now, and I wouldn't know how to create a good suggestion when a path "Thingie/Long Test/" is added to the pool).
There is no given structure of the path elements, they are assigned by the user, but often items will have similar segments. It should work for proportional fonts, so the algorithm should take a measure function (and not call it to heavily) or generate a suggestion list.
Data-wise, a typical use case would contain 2..4 path segments anf 20 elements per segment.
I am looking for previous attempts into that direction, and if that's solvable wiht sensible amount of code or dependencies.
I'm assuming you're asking mainly about how to deal with the set of folder names extracted from the same level of hierarchy, since splitting by rows and path separators and aggregating by hierarchy depth is simple.
Your problem reminds me a lot of the longest common substring problem, with the differences that:
You're interested in many substrings, not just one.
You care about order.
These may appear substantial, but if you examine the dynamic-programming solution in the article you can see that it revolves around creating a table of "character collisions" and then looking for the longest diagonal in this table. I think that you could instead enumerate all diagonals in the table by the order in which they appear, and then for each path replace, by order, all appearances of these strings with ellipses.
Enforcing a minimal substring length of 2 will return a result similar to what you've outlined in your question.
It does seem like it requires some tinkering with the algorithm (for example, ensuring a certain substring is first in all strings), and then you need to invoke it over your entire set... I hope this at least gives you a possible direction.
Well, the "natural number" ordering part is actually easy, simply replace all numbers with formatted number where there is enough leading zeroes, eg. Test 9V -> Test 000009V and Test 12B -> Test 000012B. These are now sortable by standard methods.
For the actual ellipsisizing. Unless this is actually a huge system, I'd just add manual ellipsisizing "list" (of regexes, for flexibility and pain) that'd turn certain words into ellipses. This does requires continuous work, but coming up with the algorithm eats your time too; there are myriads of corner cases.
I'd probably try a "Floodfill" approach. Arrange first level of directories as you would a bitmap, every letter is a pixel. iterate over all characters that are in names of directories. with all of them, "paint" this same character, then "paint" the next character from first string such that it follows this previous character (and so on etc.) Then select the longest painted string that you find.
Example (if prefixed with *, it's painted)
Foo
BarFoo
*Foo
Bar*Foo
*F*oo
Bar*F*oo
...
note that:
*ofoo
b*oo
*o*foo
b*oo
.. painting of first 'o' stops since there are no continuing characters.
of*oo
b*oo
...
And then you get to to second "o" and it will find a substring of at least 2.
So you will have to iterate over most possible character instances (one optimization is to stop in each string at position Length-n, where n is the longest already found common substring. But then there is yet another problem (here with "Beta Beta")
| <- visibility cutout
Alfa Beta Gamma Delta 1
Alfa Beta Gamma Delta 2
Alfa Beta Beta 1
Alfa Beta Beta 2
Beta Beta 1
Beta Beta 2
Beta Beta 3
Beta Beta 4
What do you want to do? Cut Alfa Beta Gamma Delta or Alfa Beta or Beta Beta or Beta?
This is a bit rambling, but might be entertaining :).