We have to solve a difficult problem where we need to check a lot of complex rules from multiple sources against a system in order to decide if the system satisfy those rules or how it should be changed to satisfy them.
We initially started using Constraint Satisfaction Problems algorithms (using Choco) to try to solve it but since the number of rules and variables would be smaller than anticipated, we are looking to build a list of all possibles configurations on a database and using multiple requests based on the rules to filter this list and find the solutions this way.
Is there limitations or disadvantages of doing a systematic search compared to using a CSP solver algorithms for a reasonable number of rules and variables? Will it impact performances significantly? Will it reduce the kind of constraints we can implement?
As examples :
You have to imagine it with a much bigger number of variables, much bigger domains of definition (but always discrete) and bigger number of rules (and some much more complex) but instead of describing the problem as :
x in (1,6,9)
y in (2,7)
z in (1,6)
y = x + 1
z = x if x < 5 OR z = y if x > 5
And giving it to a solver we would build a table :
X | Y | Z
1 2 1
6 2 1
9 2 1
1 7 1
6 7 1
9 7 1
1 2 6
6 2 6
9 2 6
1 7 6
6 7 6
9 7 6
And use queries like (this is just an example to help understand, actually we would use SPARQL against a semantic database) :
SELECT X, Y, Z WHERE Y = X + 1
INTERSECT
SELECT X, Y, Z WHERE (Z = X AND X < 5) OR (Z = Y AND X > 5)
CSP allows you to combine deterministic generation of values (through the rules) with heuristic search. The beauty happens when you customize both of those for your problem. The rules are just one part. Equally important is the choice of the search algorithm/generator. You can cull a lot of the search space.
While I cannot make predictions about the performance of your SQL solution, I must say that it strikes me as somewhat roundabout. One specific problem will happen if your rules change - you may find that you have to start over from scratch. Also, the RDBMS will fully generate all of the subqueries, which may explode.
I'd suggest to implement a working prototype with CSP, and one with SQL, for a simple subset of your requirements. You then will get a good feeling what works and what does not. Be sure to think about long term maintenance as well.
Full disclaimer: my last contact with CSP was decades ago in university as part of my master's (I implemented a CSP search engine not unlike choco, of course a bit more rudimentary, and researched a bit on that topic). But the field will certainly have evolved since then.
Related
I have got a data-management problem. I have a database where "EDSS.1","EDSS.2",... represent a numeric variable, scaled from 0 to 10 (0.5 scale), where higher number stand for higher disability. For each EDSS variable, I have a "VISITDATE.1", "VISITDATE.2",...
EDSS
VISITDATE
Now I am interested in assessing the CONFIRMED DISABILITY PROGRESSION (CDP), which is an increased i 1 poin on the EDSS. To make things more difficult, this increment need to be confirmed in the subsequent visit (e.g. EDSS.3) which has to be >= 6 months (which is, VISITDATE.3 - VISITDATE.2 > 6 months.
To do so, I am creating a nested ifelse statement, as showed below.
prova <- prova %>% mutate(
CDP = ifelse(EDSS.2 > EDSS.1 & EDSS.3>=EDSS.2 & difftime(VISITDATE.3,VISITDATE.2,
units = "weeks") > 48,
print(ymd(VISITDATE.2)),0))
However, I am facing the following main problems:
How can I print the VISIT.DATE of my interest instead of 1 or 0?
How can I shift my code to the EDSS.2,EDSS.3, and so on? I am interested in finding all the confirmed disability progressions (CDPs).
Many thanks to everyone who find the time to answer me.
I wanna use multiple variables to predict multiple target. Note multiple target here doesn't mean multi-label.
Let's go for an example like this:
# In this example: x1,x2,x3 are used to predict y1,y2
pd.DataFrame({'x1':[1,2,3],'x2':[2,3,4],'x3':[1,1,1],'y1':[1,2,1],'y2':[2,3,3]})
x1 x2 x3 y1 y2
0 1 2 1 1 2
1 2 3 1 2 3
2 3 4 1 1 3
In my limited experience with data mining, I found two solutions might help:
Build two xgboost models respectively to predict y1,y2
Using a full-connected layer to embed [x1,x2,x3] into [y1,y2], which seems it's a promising solution.
Wanted to know if it's good practice to do that and what would be the better way to predict multiple target?
Regardless of your approach, two outputs means you need two functions. I hope it's clear that a layer producing two outputs is equivalent to two layers producing an output each.
The only thing worth taking into account here (only relevant for deeper models) is whether you want to build intermediate representations of your input that are shared for predicting both outputs, i.e. x → h1 → h2 → .. → hN, hN → y1, hN → y2. Doing so would enforce your hN representation to act as a task-indifferent, multi-purpose encoder while simultaneously reducing the complexity of having two models learn the same thing.
For shallow architectures, such as the single-layer one you described, this is meaningless.
I am an alchemist. I can make things out of other things according to my recipe book. For instance:
2 lead + 1 bismuth -> 1 carbon
1 oxygen + 5 hydrogen + 3 nitrogen -> 2 carbon
5 carbon + 5 titanium -> 1 gold
...etc.
My recipe book contains thousands of recipes, each of which consumes some discrete amount of one or more inputs and produces a discrete amount of one output. Being a lazy alchemist, I don't want to remember all my recipes. I want to write a computer program to solve this problem for me. The input to the program is a description of what I want, like "2 gold", and a description of what I have in stock, like "5 titanium, 6 lead, 3 bismuth, 2 carbon, 1 gold". The output should be either "cannot be made" or a sequence of instructions for creating the thing. For the example given here, the output could be:
make 2 carbon out of 4 lead + 2 bismuth
make 1 gold out of 4 carbon + 4 titanium
Then, combined with the 1 gold I already have, I have the 2 gold I wanted.
One last note: the recipes are weighted; e.g. I prefer to make carbon out of lead and bismuth if I can.
Is there an elegant way to formulate and solve this problem? A naive recursive solution looks tempting, but I can think of recipe sets that would cause it to do an exponential amount of work.
(And, as a followup, someday my research might uncover a circular set of recipes---maybe I can make 1 hydrogen out of 1 helium and 1 helium out of 1 hydrogen---and I would like to be able to handle this case as well.)
The problem is NP-hard.
Given an instance of CNF-SAT, prepare alchemical tables with reagents for
each variable
each literal
each clause (unsatisfied version)
each clause (satisfied version)
the output.
The reactions are
variable to large supply of corresponding positive literal
variable to large supply of corresponding negative literal
clause (unsatisfied version) and satisfying literal to clause (satisfied version)
all clauses (satisfied versions) to the output.
The question is whether we can make the output given one of each variable and one of each clause (unsatisfied version).
This problem is related to the problem of determining reachability of vector addition systems/Petri nets; my reduction is based in part on reductions that appeared in that literature.
Given (3AC) in base-14. Convert it into BASE-7.
A simple approach is to convert first 3AC into BASE-10 & then to BASE-7 which results in 2105.
I was just wondering that does there exist any direct way of conversion from BASE-14 to BASe-7?
As others have said, there is no straightforward technique, because 14 is not a power of 7.
However, you don't need to go through base-10. One approach is to write routines that perform base-14 arithmetic (specifically addition and multiplication), and then use them to process each base-7 digit in turn: multiply it by the relevant power-of-7, and then add it to an accumulator.
I have found one approach.
There is no need to calculate for base 10 and then base 7. It can be done using this formula!
If a no X is represented in base 14 as
X = an a(n-1) a(n-2) .... a(0)
then in base 7 we can write it as
X=.....rqp
where
p=(2^0)a(0)%7;
q=((2^1)a(1) + p/7)%7
r=((2^2)a(2) + q/7)%7
..........
nth term=((2^n)a(n) + (n-1)th term/7)%7
(will go further because a no. in base 14 will require more digits in base 7).
The logic is simple, just based on properties of bases, and taking into account the fact that 7 is half of 14. Else it would have been a tedious task.
Eg. here it is given 3AC.
C =12;
so last digit is (2^0 * 12)%7 = 5
A=10
next digit is (2^1 * 10 + 12/7)%7 = (20+1)%7=21%7=0
next is 3;
next digit is (2^2 * 3 + 21/7)%7 = (12+3)%7=15%7=1
next is nothing(0);
next digit is (2^3 * 0 + 15/7)%7 = (0+2)%7=2%7=2
Hence, in base 7 number will be, 2105. This method may seem confusing and difficult, but with a little practice, it may come very handy in solving similar types of problems! Also, even if the number is very long, like 287AC23B362, we don't have to unnecessarily find base 10, which may consume atleast some time, and directly compute base 7!
No, there's not really an easy way to do as you wish because 14 is not a power of 7.
The only tricks that I know of for something like this (ex easily going from hex to binary) require that one base be a power of the other.
Link gives a reasonable clear answer. In short, it's a bit of a pain from the methods I know.
I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.
Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.
Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.
You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).