Aggregating automatically-generated feature vectors - algorithm

I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.

Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.

Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.

You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).

Related

Constraint Satisfaction Problems with solvers VS Systematic Search

We have to solve a difficult problem where we need to check a lot of complex rules from multiple sources against a system in order to decide if the system satisfy those rules or how it should be changed to satisfy them.
We initially started using Constraint Satisfaction Problems algorithms (using Choco) to try to solve it but since the number of rules and variables would be smaller than anticipated, we are looking to build a list of all possibles configurations on a database and using multiple requests based on the rules to filter this list and find the solutions this way.
Is there limitations or disadvantages of doing a systematic search compared to using a CSP solver algorithms for a reasonable number of rules and variables? Will it impact performances significantly? Will it reduce the kind of constraints we can implement?
As examples :
You have to imagine it with a much bigger number of variables, much bigger domains of definition (but always discrete) and bigger number of rules (and some much more complex) but instead of describing the problem as :
x in (1,6,9)
y in (2,7)
z in (1,6)
y = x + 1
z = x if x < 5 OR z = y if x > 5
And giving it to a solver we would build a table :
X | Y | Z
1 2 1
6 2 1
9 2 1
1 7 1
6 7 1
9 7 1
1 2 6
6 2 6
9 2 6
1 7 6
6 7 6
9 7 6
And use queries like (this is just an example to help understand, actually we would use SPARQL against a semantic database) :
SELECT X, Y, Z WHERE Y = X + 1
INTERSECT
SELECT X, Y, Z WHERE (Z = X AND X < 5) OR (Z = Y AND X > 5)
CSP allows you to combine deterministic generation of values (through the rules) with heuristic search. The beauty happens when you customize both of those for your problem. The rules are just one part. Equally important is the choice of the search algorithm/generator. You can cull a lot of the search space.
While I cannot make predictions about the performance of your SQL solution, I must say that it strikes me as somewhat roundabout. One specific problem will happen if your rules change - you may find that you have to start over from scratch. Also, the RDBMS will fully generate all of the subqueries, which may explode.
I'd suggest to implement a working prototype with CSP, and one with SQL, for a simple subset of your requirements. You then will get a good feeling what works and what does not. Be sure to think about long term maintenance as well.
Full disclaimer: my last contact with CSP was decades ago in university as part of my master's (I implemented a CSP search engine not unlike choco, of course a bit more rudimentary, and researched a bit on that topic). But the field will certainly have evolved since then.

What is the most optimal arrangement of if statements?

If I have for example:
if(a<50){A}
if(a<20){B}
if(a<10){C}
For this set, where the number represents the percentage of time a certain statement will run, and they all not mutually exclusive, in what way should I arrange my if statements in order to make the program as efficient as possible?
What I want to achieve is, if a = 5 I only want C to run, if I get 15, I only want B to run.
I am well aware of how to create this, what I want is the most efficient way to do it. Assume A,B,C... take the same computing. Also no side effects are ignored since if I run C, it will overwrite what A and B did.
Thank you
Your post is self contradictory. You say the cases are:
not mutually exclusive
But then you say:
if a = 5 I only want C to run, if I get 15, I only want B to run
That is mutually exclusive, which you can achieve thusly:
if(a<10){C} // do C and only C if a < 10
else if(a<20){B} // do B and only B if a >= 10 && a < 20
else if(a<50){A} // do A and only A if a >= 20 && a < 50
This is micro-optimisation. You're comparing the performance of a minuscule portion of code in isolation. Any decent compiler will optimise this extensively. Even if you could improve on the performance it is likely that you'll have wasted far more time on the exercise than you will ever recover in improved performance. As with all optimisations the only way to be sure is to profile the different options. This is not a worthwhile exercise for this code.

Algorithm for planning the creation of an object from dependencies

I am an alchemist. I can make things out of other things according to my recipe book. For instance:
2 lead + 1 bismuth -> 1 carbon
1 oxygen + 5 hydrogen + 3 nitrogen -> 2 carbon
5 carbon + 5 titanium -> 1 gold
...etc.
My recipe book contains thousands of recipes, each of which consumes some discrete amount of one or more inputs and produces a discrete amount of one output. Being a lazy alchemist, I don't want to remember all my recipes. I want to write a computer program to solve this problem for me. The input to the program is a description of what I want, like "2 gold", and a description of what I have in stock, like "5 titanium, 6 lead, 3 bismuth, 2 carbon, 1 gold". The output should be either "cannot be made" or a sequence of instructions for creating the thing. For the example given here, the output could be:
make 2 carbon out of 4 lead + 2 bismuth
make 1 gold out of 4 carbon + 4 titanium
Then, combined with the 1 gold I already have, I have the 2 gold I wanted.
One last note: the recipes are weighted; e.g. I prefer to make carbon out of lead and bismuth if I can.
Is there an elegant way to formulate and solve this problem? A naive recursive solution looks tempting, but I can think of recipe sets that would cause it to do an exponential amount of work.
(And, as a followup, someday my research might uncover a circular set of recipes---maybe I can make 1 hydrogen out of 1 helium and 1 helium out of 1 hydrogen---and I would like to be able to handle this case as well.)
The problem is NP-hard.
Given an instance of CNF-SAT, prepare alchemical tables with reagents for
each variable
each literal
each clause (unsatisfied version)
each clause (satisfied version)
the output.
The reactions are
variable to large supply of corresponding positive literal
variable to large supply of corresponding negative literal
clause (unsatisfied version) and satisfying literal to clause (satisfied version)
all clauses (satisfied versions) to the output.
The question is whether we can make the output given one of each variable and one of each clause (unsatisfied version).
This problem is related to the problem of determining reachability of vector addition systems/Petri nets; my reduction is based in part on reductions that appeared in that literature.

2D PHP Pattern Creator Altgorithm

I'm looking for an algorithm to help me build 2D patterns based on rules. The idea is that I could write a script using a given site of parameters, and it would return a random, 2-dimensional sequence up to a given length.
My plan is to use this to generate image patterns based on rules. Things like image fractals or sprites for game levels could possibly use this.
For example, lets say that you can use A, B, C, & D to create the pattern. The rule is that C and A can never be next to each other, and that D always follows C. Next, lets say I want a pattern of size 4x4. The result might be the following which respects all the rules.
A B C D
B B B B
C D B B
C D C D
Are there any existing libraries that can do calculations like this? Are there any mathematical formulas I can read-up on?
While pretty inefficient concering runtime, backtracking is an often used algorithm for such a problem.
It follows a simple pattern, and if written correctly, you can easily replace a rule set into it.
Define your rule data structures; i.e., define the set of operations that the rules can encapsulate, and define the available cross-referencing that can be done. Once you've done this, you should have a clearer view of what type of algorithms to use to apply these rules to a potential result set.
Supposing that your rules are restricted to "type X is allowed to have type Y immediately to its left/right/top/bottom" you potentially have situations where generating possible patterns is computationally difficult. Take a look at Wang Tiles (a good source is the book Tilings and Patterns by Grunbaum and Shephard) and you'll see that with the states sets of rules you might define sets of Wang Tiles. Appropriate sets of these are Turing Complete.
For small rectangles, or your sets of rules, this may only be of academic interest. As mentioned elsewhere a backtracking approach might be appropriate for your ruleset - in which case you may want to consider appropriate heuristics for the order in which new components are added to your grid. Again, depending on your rulesets, other approaches might work. E.g. if your ruleset admits many solutions you might get a long way by randomly allocating many items to the grid before attempting to fill in remaining gaps.

Best way to calculate the result of a formula?

I currently have an application which can contain 100s of user defined formulae. Currently, I use reverse polish notation to perform the calculations (pushing values and variables on to a stack, then popping them off the stack and evaluating). What would be the best way to start parallelizing this process? Should I be looking at a functional language?
The calculations are performed on arrays of numbers so for example a simple A+B could actually mean 100s of additions. I'm currently using Delphi, but this is not a requirement going forward. I'll use the tool most suited to the job. Formulae may also be dependent on each other So we may have one formula C=A+B and a second one D=C+A for example.
Let's assume your formulae (equations) are not cyclic, as otherwise you cannot "just" evaluate them. If you have vectorized equations like A = B + C where A, B and C are arrays, let's conceptually split them into equations on the components, so that if the array size is 5, this equation is split into
a1 = b1 + c1
a2 = b2 + c2
...
a5 = b5 + c5
Now assuming this, you have a large set of equations on simple quantities (whether integer, rational or something else).
If you have two equations E and F, let's say that F depends_on E if the right-hand side of F mentions the left-hand side of E, for example
E: a = b + c
F: q = 2*a + y
Now to get towards how to calculate this, you could always use randomized iteration to solve this (this is just an intermediate step in the explanation), following this algorithm:
1 while (there is at least one equation which has not been computed yet)
2 select one such pending equation E so that:
3 for every equation D such that E depends_on D:
4 D has been already computed
5 calculate the left-hand side of E
This process terminates with the correct answer regardless on how you make your selections on line // 2. Now the cool thing is that it also parallelizes easily. You can run it in an arbitrary number of threads! What you need is a concurrency-safe queue which holds those equations whose prerequisites (those the equations depend on) have been computed but which have not been computed themselves yet. Every thread pops out (thread-safely) one equation from this queue at a time, calculates the answer, and then checks if there are now new equations so that all their prerequisites have been computed, and then adds those equations (thread-safely) to the work queue. Done.
Without knowing more, I would suggest taking a SIMD style approach if possible. That is, create threads to compute all formulas for a single data set. Trying to divide the computation of formulas to parallelise them wouldn't yield much speed improvement as the logic required to be able to split up the computations into discrete units suitable for threading would be hard to write and harder to get right, the overhead would cancel out any speed gains. It would also suffer quickly from diminishing returns.
Now, if you've got a set of formulas that are applied to many sets of data then the parallelisation becomes easier and would scale better. Each thread does all computations for one set of data. Create one thread per CPU core and set its affinity to each core. Each thread instantiates one instance of the formula evaluation code. Create a supervisor which loads a single data set and passes it an idle thread. If no threads are idle, wait for the first thread to finish processing its data. When all data sets are processed and all threads have finished, then exit. Using this method, there's no advantage to having more threads than there are cores on the CPU as thread switching is slow and will have a negative effect on overall speed.
If you've only got one data set then it is not a trivial task. It would require parsing the evaluation tree for branches without dependencies on other branches and farming those branches to separate threads running on each core and waiting for the results. You then get problems synchronizing the data and ensuring data coherency.

Resources