String search algorithms

String search algorithms - algorithm

I have a project on benchmarking String Matching Algorithms and I would like to know if there is a standard for every algorithm so that I would be able to get fair results with my experimentation. I am planning to use java's system.nanotime in getting the running time of every algorithm. Any comment or reactions regarding my problem is very much appreciated. Thanks!

I am not entirely sure what you're asking. However, I am guessing you are asking how to get the most realistic results. You need to run your algorithm hundreds, or even thousands of iterations to get an average. It is also very important to turn off any caching that your language may do, and don't reuse objects, unless it is part of your algorithm.

I am not entirely sure what you're asking. However, another interpretation of what you are asking can be answered by trying to work out how a given algorithm performs as you increase the size of the problem. Using raw time to compare algorithms at a given string size does not necessarily allow for accurate comparison. Instead, you could try each algorithm with different string sizes and see how the algorithm behaves as string size varies.
And Mark's advice is good too. So you are running repeated trials for many different string lengths to get a picture of how one algorithm works, then repeating that for the next algorithm.

Again, it's not clear what you're asking, but here's another thought in addition to what Tony and Mark said:
Be very careful about testing only "real" input or only "random" input. Some algorithms are tuned to do well on typical input (searching for a word in English text), while others are tuned for working well on pathologically hard cases. You'll need a huge mix of possible inputs of all different types and sizes to do a truly good benchmark.

Related

What is the difference between Genetic Algorithm and Iterated Local Search Algorithm?

I'm basically trying to use Genetic Algorithm or Iterated Local Search Algorithm to get an optimal solution for a question.Can someone please explain what is the basic difference between these two algorithms and is there any situations where one of them is better than the other?

Let me start from the second question. I believe that there is no way to determine a better algorithm for a given problem without any trials and tests. The behavior of an algorithm heavily depends on problem's properties. If we are talking about complex problems with hundreds and thousands of variables, it's just too difficult to predict anything. I'm not talking about your engineer's intuition, some deep problem understanding, previous experience, etc, they are not really measurable.
The main difference between global and local search is quite straightforward - local search considers just one or a few of possible solutions at a single point of time and it tries to improve them with some modifications. Thus, each iteration it considers just a small portion of a search space (=local neighboorhood). Global search tries to take into account whole problem with all its parameters at the same time. For example, PSO samples huge amount of candidates and tries to move all of them into the global optimum's direction using some simple formula.

Decoding Permutated English Strings

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.

This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)

first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.

Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.

First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

Sorting Rotated Strings of Burrows Wheeler Transform (BWT) Algorithm

I'm currently working with BWT for fun. :-)
I've learn the BWT and I think BWT isn't complicated theoretically. But, until now I don't know how actually to sort the rotated strings in real implementation.
Should I hold all rotated strings first in to an array so that I can sort them using naive sorting algorithm like Bubble Sort, Selection, or others? Someone tell me that it's bad practice, since saving N elements in to an array need more times.
So, How can I sort my rotated strings while I'm rotating the strings?
Anyone who can answer this, will be very appreciated!
Thank You In Advance!
Thompson

Not quite an answer, but when I implemented a BWT algorithm for a client I used the code presented here as a base.
One item of historical note, it appeared the C qsort was a lot quicker than the C++ std::sort algorithm. Someone on CodeGuru suggested using the std::stable_sort and that pushed the performance up to where the C qsort was. This was in VC6.
Also run tests to find the ideal length of string - the sorting is not linear. I was writing a compression routine for a transmission protocol so the compression had to be enough to pay for itself. If memory serves me right that worked about 4kb on a 733MHz machine.

BWT is a pretty easy to implement, but its weakness is it get slower as the data to be compressed get larger.
I've been do bit a quick analysis to this algorithm and the result is (correct me if i'm wrong) it takes O(n^2) in the worst case, but can achieve constant time in the best case.
It turns out that, the much consume time for BWT is when sorting the rotated strings. Improving the sorting now day seems to be a hot issues for those who love playing with algorithms. :-)
Okay, when you are encoding the data using BWT, the first thing you should do is to place an unique character to the data. It's used to tell the encoder terminated the sorting process when it founds this character. For example: say that you want to compress string "BANANA", and the steps:
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
Rotated strings: $BANANA
Sorting the string with an unique character (EOF) like "$BANANA" will be faster then not using any unique character.
I've been post a poor article about this algorithm...
http://philipstel.wordpress.com/2010/02/10/discussion-of-burrows-wheeler-transform-algorithm/

Detect duplicated/similar text among large datasets?

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?
We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

http://d3s.mff.cuni.cz/~holub/sw/shash/
http://matpalm.com/resemblance/simhash/

This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.
The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.
EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.

Look at shngle-min-hash techniques. Here is a presentation that could help you.

One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?

You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.

Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler

When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).

Take a look at Koza's voluminous tomes on these matters.

There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio