Pattern recognition algorithms - algorithm

In the past I had to develop a program which acted as a rule evaluator. You had an antecedent and some consecuents (actions) so if the antecedent evaled to true the actions where performed.
At that time I used a modified version of the RETE algorithm (there are three versions of RETE only the first being public) for the antecedent pattern matching. We're talking about a big system here with million of operations per rule and some operators "repeated" in several rules.
It's possible I'll have to implement it all over again in other language and, even though I'm experienced in RETE, does anyone know of other pattern matching algorithms? Any suggestions or should I keep using RETE?

The TREAT algorithm is similar to RETE, but doesn't record partial matches. As a result, it may use less memory than RETE in certain situations. Also, if you modify a significant number of the known facts, then TREAT can be much faster because you don't have to spend time on retractions.
There's also RETE* which balances between RETE and TREAT by saving some join node state depending on how much memory you want to use. So you still save some assertion time, but also get memory and retraction time savings depending on how you tune your system.
You may also want to check out LEAPS, which uses a lazy evaluation scheme and incorporates elements of both RETE and TREAT.
I only have personal experience with RETE, but it seems like RETE* or LEAPS are the better, more flexible choices.

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

What is the time complexity of Rete Algorithm?

Rete Algorithm is an efficient pattern matching algorithm that compares a large collection of patterns to a large collection of objects. It is also used in one of the expert system shell that I am exploring right now: is drools.
What is the time complexity of the algorithm, based on the number of rules I have?
Here is a link for Rete Algorithm: http://www.balasubramanyamlanka.com/rete-algorithm/
Also for Drools: https://drools.org/
Estimating the complexity of RETE is a non-trivial problem.
Firstly, you cannot use the number of rules as a dimension. What you should look at are the single constraints or matches the rules have. You can see a rule as a collection of constraints grouped together. This is all what RETE reasons about.
Once you have a rough estimate of the amount of constraints your rule base has, you will need to look at those which are inter-dependent. Inter-dependent constraints are the most complex matches and are pretty similar in concept as JOINS in SQL queries. Their complexity varies based on their nature as well as the state of your working memory.
Then you will need to look at the size of your working memory. The amount of facts you assert within a RETE based expert system strongly influence its performance.
Lastly, you need to consider the engine conflict resolution strategy. If you have several conflicting rules, it might take a lot of time to figure out in which order to execute them.
Regarding RETE performance, there is a very good PhD dissertation I'd suggest you to look at. The author is Robert B. Doorenbos and the title is "Production matching for large learning systems".

Multiple short rules pattern matching algorithm

As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains:
Long dictionary: 256
Short but not fixed length rules (from 1 to 3 or 4 bytes depth at most)
Small (150) number of rules (if 3 bytes) or moderate (~1K) if 4
Better performance than current AC-DFA used in Snort or than AC-DFA-Split again used by Snort
Software based (recent COTS systems like E3 of E5)
Ideally would like to employ some SIMD / SSE stuff due to the fact that currently they are 128 bit wide and in near future they will be 256 in opposition to CPU's 64
We started this project by prefiltering Snort AC with algorithm shown on Sigmatch paper but sadly the results have not been that impressive (~12% improvement when compiling with GCC but none with ICC)
Afterwards we tried to exploit new pattern matching capabilities present in SSE 4.2 through IPP libraries but no performance gain at all (guess doing it directly in machine code would be better but for sure more complex)
So back to the original idea. Right now we are working along the lines of Head Body Segmentation AC but are aware unless we replace the proposed AC-DFA for the head side will be very hard to get improved performance, but at least would be able to support much more rules without a significant performance drop
We are aware using bit parallelism ideas use a lot of memory for long patterns but precisely the problem scope has been reduce to 3 or 4 bytes long at most thus making them a feasible alternative
We have found Nedtries in particular but would like to know what do you guys think or if there are better alternatives
Ideally the source code would be in C and under an open source license.
IMHO, our idea was to search for something that moved 1 byte at a time to cope with different sizes but do so very efficiently by taking advantage of most parallelism possible by using SIMD / SSE and also trying to be the less branchy as possible
I don't know if doing this in a bit wise manner or byte wise
Back to a proper keyboard :D
In essence, most algorithms are not correctly exploiting current hardware capabilities nor limitations. They are very cache inneficient, very branchy not to say they dont exploit capabilities now present in COTS CPUs that allow you to have certain level of paralelism (SIMD, SSE, ...)
This is preciselly what we are seeking for, an algorithm (or an implementation of an already existing algorithm) that properly considers all that, with the advantag of not trying to cover all rule lengths, just short ones
For example, I have seen some papers on NFAs claming that this days their performance could be on pair to DFAs with much less memory requirements due to proper cache efficiency, enhanced paralelism, etc
Please take a look at:
http://www.slideshare.net/bouma2
Support of 1 and 2 bytes is similar to what Baxter wrote above. Nevertheless, it would help if you could provide the number of single-byte and double-byte strings you expect to be in the DB, and the kind of traffic you are expecting to process (Internet, corporate etc.) - after all, too many single-byte strings may end up in a match for every byte. The idea of Bouma2 is to allow the incorporation of occurrence statistics into the preprocessing stage, thereby reducing the false-positives rate.
It sounds like you are already using hi-performance pattern matching. Unless you have some clever new algorithm, or can point to some statistical bias in the data or your rules, its going to be hard to speed up the raw algorithms.
You might consider treating pairs of characters as pattern match elements. This will make the branching factor of the state machine huge but you presumably don't care about RAM. This might buy you a factor of two.
When running out of steam algorithmically, people often resort to careful hand coding in assembler including clever use of the SSE instructions. A trick that might be helpful to handle unique sequences whereever found is to do a series of comparisons against the elements and forming a boolean result by anding/oring rather than conditional branching, because branches are expensive. The SSE instructions might be helpful here, although their alignment requirements might force you to replicate them 4 or 8 times.
If the strings you are searching are long, you might distribute subsets of rules to seperate CPUs (threads). Partitioning the rules might be tricky.

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

What can be parameters other than time and space while analyzing certain algorithms?

I was interested to know about parameters other than space and time during analysing the effectiveness of an algorithms. For example, we can focus on the effective trap function while developing encryption algorithms. What other things can you think of ?
First and foremost there's correctness. Make sure your algorithm always works, no matter what the input. Even for input that the algorithm is not designed to handle, you should print an error mesage, not crash the entire application. If you use greedy algorithms, make sure they truly work in every case, not just a few cases you tried by hand.
Then there's practical efficiency. An O(N2) algorithm can be a lot faster than an O(N) algorithm in practice. Do actual tests and don't rely on theoretical results too much.
Then there's ease of implementation. You usually don't need the best intro sort implementation to sort an array of 100 integers once, so don't bother.
Look for worst cases in your algorithms and if possible, try to avoid them. If you have a generally fast algorithm but with a very bad worst case, consider detecting that worst case and solving it using another algorithm that is generally slower but better for that single case.
Consider space and time tradeoffs. If you can afford the memory in order to get better speeds, there's probably no reason not to do it, especially if you really need the speed. If you can't afford the memory but can afford to be slower, do that.
If you can, use existing libraries. Don't roll your own multiprecision library if you can use GMP for example. For C++, stuff like boost and even the STL containers and algorithms have been worked on for years by an army of people and are most likely better than you can do alone.
Stability (sorting) - Does the algorithm maintain the relative order of equal elements?
Numeric Stability - Is the algorithm prone to error when very large or small real numbers are used?
Correctness - Does the algorithm always give the correct answer? If not, what is the margin of error?
Generality - Does the algorithm work in many situation (e.g. with many different data types)?
Compactness - Is the program for the algorithm concise?
Parallelizability - How well does performance scale when the number of concurrent threads of execution are increased?
Cache Awareness - Is the algorithm designed to maximize use of the computer's cache?
Cache Obliviousness - Is the algorithm tuned for particulary cache-sizes / cache-line-sizes or does it perform well regardless of the parameters of the cache?
Complexity. 2 algorithms being the same in all other respects, the one that's much simpler is going to be a much better candidate for future customization and use.
Ease of parallelization. Depending on your use case, it might not make any difference or, on the other hand, make the algorithm useless because it can't use 10000 cores.
Stability - some algorithms may "blow up" with certain test conditions, e.g. take an inordinately long time to execute, or use an inordinately large amount of memory, or perhaps not even terminate.
For algorithms that perform floating point operations, the accumulation of round-off error is often a consideration.
Power consumption, for embedded algorithms (think smartcards).
One important parameter that is frequently measure in the analysis of algorithms is that of Cache hits and cache misses. While this is a very implementation and architecture dependent issue, it is possible to generalise somewhat. One particularly interesting property of the algorithm is being Cache-oblivious, which means that the algorithm will use the cache optimally on multiple machines with different cache sizes and structures without modification.
Time and space are the big ones, and they seem so plain and definitive, whereby they should often be qualified (1). The fact that the OP uses the word "parameter" rather than say "criteria" or "properties" is somewhat indicative of this (as if a big O value on time and on space was sufficient to frame the underlying algorithm).
Other criteria include:
domain of applicability
complexity
mathematical tractability
definitiveness of outcome
ease of tuning (may be tied to "complexity" and "tactability" afore mentioned)
ability of running the algorithm in a parallel fashion
(1) "qualified": As hinted in other answers, a -technically- O(n^2) algorithm may be found to be faster than say an O(n) algorithm, in 90% of the cases (which, btw, may turn out to be 100% of the practical cases)
worst case and best case are also interesting, especially when linked to some conditions in the input. if your input data shows some properties, an algorithm, by taking advantage of this property, may perform better that another algorithm which performs the same task but does not use that property.
for example, many sorting algorithm perform very efficiently when input are partially ordered in a specific way which minimizes the number of operations the algorithm has to execute.
(if your input is mostly sorted, an insertion sort will fit nicely, while you would never use that algorithm otherwise)
If we're talking about algorithms in general, then (in the real world) you might have to think about CPU/filesystem(read/write operations)/bandwidth usage.
True they are way down there in the list of things you need worry about these days, but given a massive enough volume of data and cheap enough infrastructure you might have to tweak your code to ease up on one or the other.
What you are interested aren’t parameters, rather they are intrinsic properties of an algorithm.
Anyway, another property you might be interested in, and analyse an algorithm for, concerns heuristics (or rather, approximation algorithms), i.e. algorithms which don’t find an exact solution but rather one that is (hopefully) good enough.
You can analyze how far a solution is from the theoretical optimal solution in the worst case. For example, an existing algorithm (forgot which one) approximates the optimal travelling salesman tour by a factor of two, i.e. in the worst case it’s twice as long as the optimal tour.
Another metric concerns randomized algorithms where randomization is used to prevent unwanted worst-case behaviours. One example is randomized quicksort; quicksort has a worst-case running time of O(n2) which we want to avoid. By shuffling the array beforehand we can avoid the worst-case (i.e. an already sorted array) with a very high probability. Just how high this probability is can be important to know; this is another intrinsic property of the algorithm that can be analyzed using stochastic.
For numeric algorithms, there's also the property of continuity: that is, whether if you change input slightly, output also changes only slightly. See also Continuity analysis of programs on Lambda The Ultimate for a discussion and a link to an academical paper.
For lazy languages, there's also strictness: f is called strict if f _|_ = _|_ (where _|_ denotes the bottom (in the sense of domain theory), a computation that can't produce a result due to non-termination, errors etc.), otherwise it is non-strict. For example, the function \x -> 5 is non-strict, because (\x -> 5) _|_ = 5, whereas \x -> x + 1 is strict.
Another property is determinicity: whether the result of the algorithm (or its other properties, such as running time or space consumption) depends solely on its input.
All these things in the other answers about the quality of various algorithms are important and should be considered.
But time and space are two things that vary at some rate compared to the size of the input (n). So what else can vary according to n?
There are several that are related to I/O. For example, the number of writes to a disk is an important one, which may not be directly shown by space and time estimates alone. This becomes particularly important with flash memory, where the number of writes to the same memory location is the significant metric in some algorithms.
Another I/O metric would be "chattiness". A networking protocol might send shorter messages more often adding up to the same space and time as another networking protocol, but some aspect of the system (perhaps billing?) might make minimizing either the size or number of the messages desireable.
And that brings us to Cost, which is a very important algorithmic consideration sometimes. The cost of an algorithm may be affected by both space and time in different amounts (consider the separate costing of server storage space and gigabits of data transfer), but the cost is the thing that you wish to minimize overall, so it may have its own big-O estimations.

Resources