Algorithm for planning the creation of an object from dependencies - algorithm

I am an alchemist. I can make things out of other things according to my recipe book. For instance:
2 lead + 1 bismuth -> 1 carbon
1 oxygen + 5 hydrogen + 3 nitrogen -> 2 carbon
5 carbon + 5 titanium -> 1 gold
...etc.
My recipe book contains thousands of recipes, each of which consumes some discrete amount of one or more inputs and produces a discrete amount of one output. Being a lazy alchemist, I don't want to remember all my recipes. I want to write a computer program to solve this problem for me. The input to the program is a description of what I want, like "2 gold", and a description of what I have in stock, like "5 titanium, 6 lead, 3 bismuth, 2 carbon, 1 gold". The output should be either "cannot be made" or a sequence of instructions for creating the thing. For the example given here, the output could be:
make 2 carbon out of 4 lead + 2 bismuth
make 1 gold out of 4 carbon + 4 titanium
Then, combined with the 1 gold I already have, I have the 2 gold I wanted.
One last note: the recipes are weighted; e.g. I prefer to make carbon out of lead and bismuth if I can.
Is there an elegant way to formulate and solve this problem? A naive recursive solution looks tempting, but I can think of recipe sets that would cause it to do an exponential amount of work.
(And, as a followup, someday my research might uncover a circular set of recipes---maybe I can make 1 hydrogen out of 1 helium and 1 helium out of 1 hydrogen---and I would like to be able to handle this case as well.)

The problem is NP-hard.
Given an instance of CNF-SAT, prepare alchemical tables with reagents for
each variable
each literal
each clause (unsatisfied version)
each clause (satisfied version)
the output.
The reactions are
variable to large supply of corresponding positive literal
variable to large supply of corresponding negative literal
clause (unsatisfied version) and satisfying literal to clause (satisfied version)
all clauses (satisfied versions) to the output.
The question is whether we can make the output given one of each variable and one of each clause (unsatisfied version).
This problem is related to the problem of determining reachability of vector addition systems/Petri nets; my reduction is based in part on reductions that appeared in that literature.

Related

Constraint Satisfaction Problems with solvers VS Systematic Search

We have to solve a difficult problem where we need to check a lot of complex rules from multiple sources against a system in order to decide if the system satisfy those rules or how it should be changed to satisfy them.
We initially started using Constraint Satisfaction Problems algorithms (using Choco) to try to solve it but since the number of rules and variables would be smaller than anticipated, we are looking to build a list of all possibles configurations on a database and using multiple requests based on the rules to filter this list and find the solutions this way.
Is there limitations or disadvantages of doing a systematic search compared to using a CSP solver algorithms for a reasonable number of rules and variables? Will it impact performances significantly? Will it reduce the kind of constraints we can implement?
As examples :
You have to imagine it with a much bigger number of variables, much bigger domains of definition (but always discrete) and bigger number of rules (and some much more complex) but instead of describing the problem as :
x in (1,6,9)
y in (2,7)
z in (1,6)
y = x + 1
z = x if x < 5 OR z = y if x > 5
And giving it to a solver we would build a table :
X | Y | Z
1 2 1
6 2 1
9 2 1
1 7 1
6 7 1
9 7 1
1 2 6
6 2 6
9 2 6
1 7 6
6 7 6
9 7 6
And use queries like (this is just an example to help understand, actually we would use SPARQL against a semantic database) :
SELECT X, Y, Z WHERE Y = X + 1
INTERSECT
SELECT X, Y, Z WHERE (Z = X AND X < 5) OR (Z = Y AND X > 5)
CSP allows you to combine deterministic generation of values (through the rules) with heuristic search. The beauty happens when you customize both of those for your problem. The rules are just one part. Equally important is the choice of the search algorithm/generator. You can cull a lot of the search space.
While I cannot make predictions about the performance of your SQL solution, I must say that it strikes me as somewhat roundabout. One specific problem will happen if your rules change - you may find that you have to start over from scratch. Also, the RDBMS will fully generate all of the subqueries, which may explode.
I'd suggest to implement a working prototype with CSP, and one with SQL, for a simple subset of your requirements. You then will get a good feeling what works and what does not. Be sure to think about long term maintenance as well.
Full disclaimer: my last contact with CSP was decades ago in university as part of my master's (I implemented a CSP search engine not unlike choco, of course a bit more rudimentary, and researched a bit on that topic). But the field will certainly have evolved since then.

ZPL - Code 128 Understanding better how to use Subsets B and C

I'm getting involved with ZPL (a little bit) since a few days, so I'm sorry if the questions will look stupid.
I've got to build a bar code 128 and I finally realized: I got to make it as shorter as possible.
My main question is: is it possible to switch to subset C and then back to B for just 2 digits? I read the documentation and subset C will ready digits from 00 to 99, so in theory it should work, practically, will it be worth it?
Basically when I translate a bar code with Zebra designer, and print it to a file, it doesn't bother to switch to subset C for just a couple of digits.
This is the text I need to see in the bar code:
AB1C234D567890123456
By the documentation I read, I would build something like this:
FD>:AB1C>523>64D>5567890123456
Instead Zebra Designer does:
FD>:AB1C234D>5567890123456
So the other question is, will the bar code be the same length? Actually, will mine be shorter? [I don't have a printer with me at the moment]
Last question:
Let's say I don't want to spend much time scripting this up, will the following work ok, or will it make the bar code larger?
AB1C>523>64D>556>578>590>512>534>556
So I can just build a very simple script which checks two chars per time, if they're both numbers, then add >5 to the string.
Thank you :)
Ah, some nice loose terminology. Do you mean couple="exactly 2" or couple="a few"?
Changing from one subset to another takes one code element, so for exactly 2 digits, you'd need one element to change and one to represent the 2 digits in subset C. On the other hand, staying with your original subset would take 2 elements - so no, it's not worth the change.
Further, if you were to change to C for 2 digits and then back to your original, the change would actually be costly - C(12)B = 3 elements whereas 12 would only be 2.
If you repeat the exercise for 4 digits, then switching to C would generate C(12)(34) = 3 elements against 4 to stay with what you have; or C(12)(34)B = 4 elements if you switch and change back, or 4 elements if you stick - so no gain.
With 6 or more successive numerics, then you gain regardless of whether or not you switch back.
So overall,
2-digit terminal : No difference
2-digit other : code is longer
4-digit terminal : code is shorter
4-digit other : no difference
more than 4 digits : code is shorter.
And an ODD number of digits would need to be output in code A or B for the first digit and then the above table applies to the remainder.
This may not be the answer you're looking for, but specifying A (Automatic Mode) as the final parameter to the ^BC command will make the printer do this for you.
Example:
^XA
^FO100,100
^BY3
^BCN,100,N,N,A
^FD0123456789^FS
^XZ

What methods can I use to analyse and guess 4-bit checksum algorithm?

[Background Story]
I am working with a 5 year old user identification system, and I am trying to add IDs to the database. The problem I have is that the system that reads the ID numbers requires some sort of checksum, and no-one working here now has ever worked with it, so no-one knows how it works.
I have access to the list of existing IDs, which already have correct checksums. Also, as the checksum only has 16 possible values, I can create any ID I want and run it through the authentication system up to 16 times until I get the correct checksum (but this is quite time consuming)
[Question]
What methods can I use to help guess the checksum algorithm of used for some data?
I have tried a few simple methods such as XORing and summing, but these have not worked.
So my question is: if I have data (in hexadecimal) like this:
data checksum
00029921 1
00013481 B
00026001 3
00004541 8
What methods can I use work out what sort of checksum is used?
i.e. should I try sequential numbers such as 00029921,00029922,00029923,... or 00029911,00029921,00029931,... If I do this what patterns should I look for in the changing checksum?
Similarly, would comparing swapped digits tell me anything useful about the checksum?
i.e. 00013481 and 00031481
Is there anything else that could tell me something useful? What about inverting one bit, or maybe one hex digit?
I am assuming that this will be a common checksum algorithm, but I don't know where to start in testing it.
I have read the following links, but I am not sure if I can apply any of this to my case, as I don't think mine is a CRC.
stackoverflow.com/questions/149617/how-could-i-guess-a-checksum-algorithm
stackoverflow.com/questions/2896753/find-the-algorithm-that-generates-the-checksum
cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html
[ANSWER]
I have now downloaded a much larger list of data, and it turned out to be simpler than I was expecting, but for completeness, here is what I did.
data:
00024901 A
00024911 B
00024921 C
00024931 D
00042811 A
00042871 0
00042881 1
00042891 2
00042901 A
00042921 C
00042961 0
00042971 1
00042981 2
00043021 4
00043031 5
00043041 6
00043051 7
00043061 8
00043071 9
00043081 A
00043101 3
00043111 4
00043121 5
00043141 7
00043151 8
00043161 9
00043171 A
00044291 E
From these, I could see that when just one value was increased by a value, the checksum was also increased by the same value as in:
00024901 A
00024911 B
Also, two digits swapped did not change the checksum:
00024901 A
00042901 A
This means that the polynomial value (for these two positions at least) must be the same
Finally, the checksum for 00000000 was A, so I calculated the sum of digits plus A mod 16:
( (Σxi) +0xA )mod16
And this matched for all the values I had. Just to check that there was nothing sneaky going on with the first 3 digits that never changed in my data, I made up and tested some numbers as Eric suggested, and those all worked with this too!
Many checksums I've seen use simple weighted values based on the position of the digits. For example, if the weights are 3,5,7 the checksum might be 3*c[0] + 5*c[1] + 7*c[2], then mod 10 for the result. (In your case, mod 16, since you have 4 bit checksum)
To check if this might be the case, I suggest that you feed some simple values into your system to get an answer:
1000000 = ?
0100000 = ?
0010000 = ?
... etc. If there are simple weights based on position, this may reveal it. Even if the algorithm is something different, feeding in nice, simple values and looking for patterns may be enlightening. As Matti suggested, you/we will likely need to see more samples before decoding the pattern.

Aggregating automatically-generated feature vectors

I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.
Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.
Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.
You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).

intelligent path truncation/ellipsis for display

I am looking for an existign path truncation algorithm (similar to what the Win32 static control does with SS_PATHELLIPSIS) for a set of paths that should focus on the distinct elements.
For example, if my paths are like this:
Unit with X/Test 3V/
Unit with X/Test 4V/
Unit with X/Test 5V/
Unit without X/Test 3V/
Unit without X/Test 6V/
Unit without X/2nd Test 6V/
When not enough display space is available, they should be truncated to something like this:
...with X/...3V/
...with X/...4V/
...with X/...5V/
...without X/...3V/
...without X/...6V/
...without X/2nd ...6V/
(Assuming that an ellipsis generally is shorter than three letters).
This is just an example of a rather simple, ideal case (e.g. they'd all end up at different lengths now, and I wouldn't know how to create a good suggestion when a path "Thingie/Long Test/" is added to the pool).
There is no given structure of the path elements, they are assigned by the user, but often items will have similar segments. It should work for proportional fonts, so the algorithm should take a measure function (and not call it to heavily) or generate a suggestion list.
Data-wise, a typical use case would contain 2..4 path segments anf 20 elements per segment.
I am looking for previous attempts into that direction, and if that's solvable wiht sensible amount of code or dependencies.
I'm assuming you're asking mainly about how to deal with the set of folder names extracted from the same level of hierarchy, since splitting by rows and path separators and aggregating by hierarchy depth is simple.
Your problem reminds me a lot of the longest common substring problem, with the differences that:
You're interested in many substrings, not just one.
You care about order.
These may appear substantial, but if you examine the dynamic-programming solution in the article you can see that it revolves around creating a table of "character collisions" and then looking for the longest diagonal in this table. I think that you could instead enumerate all diagonals in the table by the order in which they appear, and then for each path replace, by order, all appearances of these strings with ellipses.
Enforcing a minimal substring length of 2 will return a result similar to what you've outlined in your question.
It does seem like it requires some tinkering with the algorithm (for example, ensuring a certain substring is first in all strings), and then you need to invoke it over your entire set... I hope this at least gives you a possible direction.
Well, the "natural number" ordering part is actually easy, simply replace all numbers with formatted number where there is enough leading zeroes, eg. Test 9V -> Test 000009V and Test 12B -> Test 000012B. These are now sortable by standard methods.
For the actual ellipsisizing. Unless this is actually a huge system, I'd just add manual ellipsisizing "list" (of regexes, for flexibility and pain) that'd turn certain words into ellipses. This does requires continuous work, but coming up with the algorithm eats your time too; there are myriads of corner cases.
I'd probably try a "Floodfill" approach. Arrange first level of directories as you would a bitmap, every letter is a pixel. iterate over all characters that are in names of directories. with all of them, "paint" this same character, then "paint" the next character from first string such that it follows this previous character (and so on etc.) Then select the longest painted string that you find.
Example (if prefixed with *, it's painted)
Foo
BarFoo
*Foo
Bar*Foo
*F*oo
Bar*F*oo
...
note that:
*ofoo
b*oo
*o*foo
b*oo
.. painting of first 'o' stops since there are no continuing characters.
of*oo
b*oo
...
And then you get to to second "o" and it will find a substring of at least 2.
So you will have to iterate over most possible character instances (one optimization is to stop in each string at position Length-n, where n is the longest already found common substring. But then there is yet another problem (here with "Beta Beta")
| <- visibility cutout
Alfa Beta Gamma Delta 1
Alfa Beta Gamma Delta 2
Alfa Beta Beta 1
Alfa Beta Beta 2
Beta Beta 1
Beta Beta 2
Beta Beta 3
Beta Beta 4
What do you want to do? Cut Alfa Beta Gamma Delta or Alfa Beta or Beta Beta or Beta?
This is a bit rambling, but might be entertaining :).

Resources