Calculating regex efficiency - ruby

How would you go about calculating / finding the number of operations a regex takes to match over a given string? I'd like to develop a program that would allow you to rank regexs in order of efficiency.
Also, is it possible to break out of a regex if the number of operations exceeds a given threshold? I'm hoping to turn this into a web app, so I don't want users entering regexes that could potentially kill the server (if that's even possible).
Many thanks.
Edit: Just to clarify, I'm referring to the superset of plain regexes that includes backtracking (which is therefore non-linear).

The way to find out how many operations it will take to parse a given string is to parse it and count the number of operations. You could do somewhat limited static analysis, but a definitive answer would be tantamount to solving the halting problem.
Trying to rank expressions for any input is even more complex. Take the expression A[0-9]+
The string "A999" will match, and take roughly O(n) time.
The string "B943" will immediately fail, taking O(1) time.
A regular expression parser is fundamentally just a program. It is almost always not possible to say one program is faster than another in general, only for specific input.
You could try to use static analysis based on some understanding of what the input might be. For example, an expression which can immediately eliminate a large portion of the common inputs might be faster than one which doesn't. I would say that the only way to do this is to also accept a dataset of expressions with a similar distribution to those being parsed and either do benchmarks [easy] or analysis [hard] using that data.

Related

What concepts or algorithms exist for parallelizing parsers?

It seems easy to parallelize parsers for large amounts of input data that is already given in a split format, e.g. a large list of individual database entries, or is easy to split by a fast preprocessing step, e.g. parsing the grammatical structure of sentences in large texts.
A bit harder seems to be parallel parsing that already requires quite some effort to locate sub-structures in a given input. Common programming language code looks like a good example. In languages like Haskell, that use layout/indentation for separating individual definitions, you could probably check the number of leading spaces of each line after you've found the start of a new definition, skip all lines until you find another definition and pass each skipped chunk to another thread for full parsing.
When it comes to languages like C, JavaScript etc., that use balanced braces to define scopes, the amount of work for doing the preprocessing would be much higher. You'd need to go through the whole input, thereby counting braces, taking care of text inside string literals and so on. Even worse with languages like XML, where you also need to keep track of tag names in the opening/closing tags.
I found a parallel version of the CYK parsing algortihm that seems to work for all context-free grammars. But I'm curious what other general concepts/algorithms do exist that make it possible to parallelize parsers, including such things as the brace counting described above which would only work for a limited set of languages. This question is not about specific implementations but the ideas such implementations are based on.
I think you will find McKeeman's 1982 paper on Parallel LR Parsing quite interesting, as it appears to be practical and applies to a broad class of grammars.
The basic scheme is standard LR parsing. What is clever is that the (presumably long) input is divided into roughly N equal sized chunks (for N processors), and each chunk is parsed separately. Because the starting point for a chunk may (must!) be in the middle of some of productions, McKeemans individual parsers, unlike classic LR parsers, start with all possible left contexts (requiring that the LR state machine be augmented) to determine which LR items apply to the chunk. (It shouldn't take very many tokens before an individual parser has determined what states really apply, so this isn't very inefficient). Then the results of all the parsers are stitched together.
He sort of ducks the problem of partitioning the input in the middle of a token. (You can imagine an arbitrarily big string literal containing text that looks like code, to fool the parser the starts in the middle). What appears to happen is that parser runs into an error, and abandons its parse; the parser to its left takes up the slack. One can imagine the chunk splitter to use a little bit of smarts to mostly avoid this.
He goes to demonstrate a real parser in which speedups are obtained.
Clever, indeed.

Decoding Permutated English Strings

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.
This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)
first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.
Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.
First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

find repeated word in infinite stream of words

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to store.This was the question asked to me in the interview .please help me to verify my answer.
Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.
If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)
As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.
If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.
There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

String representations: improvements over ropes?

I want a representation for strings with fast concatenation and editing operations. I have read the paper "Ropes: an Alternative to Strings", but have there been any significant improvements in this area since 1995?
EDIT: One possibility I've considered before is using a 2-3 finger tree with strings as leaves, but I have not done a detailed analysis of this; this gives amortized constant-time addition/deletion on ends and logarithmic (in the number of chunks of the smaller string) concatenation, as opposed to vice versa for ropes.
This is an old question! I wonder if anyone reads this. But still it's intrigueing.
In your comments, you say you look for:
Faster asymptotics, or constant
factors, or less memory use
Well, ropes have O(1) insertion, and O(n) iteration. You can't do better than that. Substrings and indexing is obviously going to be more costly. But most use cases for large documents don't require editing or random access. If you only concatenate at the end, a 1D vector/list of strings could improve the insertion time constant. I used to use this in JavaScript because it had such slow string concatentation.
It is said that memory representation is less efficient than using strings.
I doubt that: If you work in a language that has garbage collection, the rope allows you to use the same string fragment instance in multiple places. In a rope that represents a HTML document, there will be many DIV's, SPAN's and LINK elements. This might even happen automatically assuming these tags are compile time constants, and you add them to the rope directly. Even for such short phrases, the rope document will reduce in size significantly, to the same order of magnitude as the original string. Longer strings will produce a net gain.
If you also make the tree elemenst read only, you can create subropes (longer phrases expressed as ropes), that occur multiple times or are shared across rope based strings. The downside of this sharing is that such shard rope sections can't be changed: to edit them, or to balance the tree you need to copy the object graph. But that does not matter if you mostly concatenate and iterate. In a web server, you can keep a subrope that repesents the CSS stylesheet declaration that is shared across all HTML documents served by that server.

String search algorithms

I have a project on benchmarking String Matching Algorithms and I would like to know if there is a standard for every algorithm so that I would be able to get fair results with my experimentation. I am planning to use java's system.nanotime in getting the running time of every algorithm. Any comment or reactions regarding my problem is very much appreciated. Thanks!
I am not entirely sure what you're asking. However, I am guessing you are asking how to get the most realistic results. You need to run your algorithm hundreds, or even thousands of iterations to get an average. It is also very important to turn off any caching that your language may do, and don't reuse objects, unless it is part of your algorithm.
I am not entirely sure what you're asking. However, another interpretation of what you are asking can be answered by trying to work out how a given algorithm performs as you increase the size of the problem. Using raw time to compare algorithms at a given string size does not necessarily allow for accurate comparison. Instead, you could try each algorithm with different string sizes and see how the algorithm behaves as string size varies.
And Mark's advice is good too. So you are running repeated trials for many different string lengths to get a picture of how one algorithm works, then repeating that for the next algorithm.
Again, it's not clear what you're asking, but here's another thought in addition to what Tony and Mark said:
Be very careful about testing only "real" input or only "random" input. Some algorithms are tuned to do well on typical input (searching for a word in English text), while others are tuned for working well on pathologically hard cases. You'll need a huge mix of possible inputs of all different types and sizes to do a truly good benchmark.

Resources