data structure for NFA representation - data-structures

In my lexical analyzer generator I use McNaughton and Yamada algorithm for NFA construction, and one of its properties that transition form I to J marked with char at J position.
So, each node of NFA can be represented simply as list of next possible states.
Which data structure best suit for storing this type of data? It must provide fast lookup for all possible states and use less space, but insertion time is not so important.

My understanding is that you want to encode a graph, where the nodes are states and the edges are transitions, and where every edge is labelled with a character. Is that correct?
The dull but practical answer is to have a object for each state, and to encode the transitions in some little structure in that object.
The simplest one would be an array, indexed by character code: that's as fast as it gets, but not naturally space-efficient. You can make it more space efficient by using a sort of offset, truncated array: store only the part of the array which contains transitions, along with the start and end indices of that part. When looking up a character in it, check that its code is within the bounds; if it isn't, treat it as a null edge (or an edge back to the start state or whatever), and if it is, fetch the element at index (character code - start). Does that make sense?
A more complex option would be a little hashtable, which would be more compact but slightly slower. I would suggest closed hashing, because collision lists will use too much memory; linear probing should be enough. You could look into using perfect hashing (look it up), which takes a lot of time to generate the table but then gives collision-free lookup. The generation process is quite complex, though.
A clever approach is to use both arrays and hashtables, and to pick one or the other based on the number of edges: if the compacted array would be more than, say, a third full, use it, but if not, use a hashtable.
Now, something a bit more radical you could do would be to use arrays, but to overlap them - if they're sparse, they'll have lots of holes in, and if you're clever, you can arrange them so that the entries in each array lines up with holes in the others. That will give you fast lookups, but also excellent memory efficiency. You will need some scheme for distinguishing when a lookup has found something from when it's found an empty slot with some other state's transition in, but i'm sure you can think of something.

Related

Generate or find a shortest text given list of words

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

Gomoku Board representation

I'm working on a Gomoku game and I need an efficient data structure to store the boards state,
I've thought about storing it in a 2D array, but I'm sure that there is a more efficient way.
Thanks
In terms of time efficiency, since I believe you'll mainly be doing index lookups, an array would be pretty much the best choice - it supports this lookup in constant time, with a low constant factor.
In terms of space efficiency:
Each square can be either empty, or populated by either player. So there are a maximum of 3 possibilities. For maximum space efficiency, we could store our entire board in base-3 representation, but, since a computer works in binary, we'd need to process the entire board to determine the value of some given square (thus a simply index lookup will take time proportional to the size of the board - if time really isn't a problem, you could consider this). Instead, I'd recommend using 2 bits per square, which would allow us to indicate one of 4 possibilities (the 4th being unused).
Many languages have some sort of bitset implementation, allowing you to work with an array of bits, which would be perfect for the above.
You'd also just want a single bitset (not 2D) as there's usually a bit of memory overhead involved in working with 2D structures. The conversion from 2D to 1D is simple - we could convert the 2D index to 1D with either x*height + y or y*width + x.
Although I'd recommend first being sure that you need to perform this optimization - I believe Gomoku boards are typically small, so even a bulky representation would work perfectly (although some AI techniques make many copies of the board, so, if you're doing that, a minimal representation would make sense).

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Best Data Structure to Store Large Amounts of Data with Dynamic and Non-unique Keys?

Basically, I have a large number of C structs to keep track of, that are essentially:
struct Data {
int key;
... // More data
};
I need to periodically access lots (hundreds) of these, and they must be sorted from lowest to highest key values. The keys are not unique and they will be changed over the course of the program. To make matters even more interesting, the majority of the structures will be culled (based on criteria completely unrelated to the key values) from the pool right before being sorted, but I still need to keep references to them.
I've looked into using a binary search tree to store them, but the keys are not guaranteed to be unique and I'm not entirely sure how to restructure the tree once a key is changed or how to cull specific structures.
To recap in case that was unclear above, I need to:
Store a large number of structures with non-unique and dynamic keys.
Cull a large percentage of the structures (but not free them entirely because different structures are culled each time).
Sort the remaining structures from highest to lowest key value.
What data structure/algorithms would you use to solve this problem? The method needs to be as fast and/or memory efficient as possible, since this is a real-time application.
EDIT: The culling is done by iterating over all of the objects and making a decision for each one. The keys change between the culling/sorting runs. I should have stated that they don't change a lot, but they do change, and they can change multiple times between the culling/sorting runs. (If it helps, the key for each structure is actually a z-order for a Sprite. They need to be sorted before each drawing loop so the Sprites with lower z-orders are drawn first.)
Just stick 'em all in a big array.
When the time comes to do the cull and sort, start by doing the sort. Do an insertion sort. That's right - nothing clever, just an insertion sort.
After the sort, go through the sorted array, and for each object, make the culling decision, then immediately output the object if it isn't culled.
This is about as memory-efficient as it gets. It should also require very little computation: there's no bookkeeping on updates between cull/sort passes, and the sort will be cheap - because insertion sort is adaptive, and for an almost-sorted array like this, it will be almost O(n). The one thing it doesn't do is cache locality: there will be two separate passes over the array, for the sort, and the cull/output.
If you demand more cleverness, then instead of an insertion sort, you could use another adaptive, in-place sort that's faster. Timsort and smoothsort are good candidates; both are utterly fiendish to implement.
The big alternative to this is to only sort unculled objects, using a secondary, temporary, list of such objects which you sort (or keep in a binary tree or whatever). But the thing is, if the keys don't change that much, then the win you get from using an adaptive sort on an almost-sorted array will (i reckon!) outweigh the win you would get from sorting a smaller dataset. It's O(n) vs O(n log n).
The general solution to this type of problem is to use a balanced search tree (e.g. AVL tree, red-black tree, B-tree), which guarantees O(log n) time (almost constant, but not quite) for insertion, deletion, and lookup, where n is the number of items currently stored in the tree. Guaranteeing no key is stored in the tree twice is quite trivial, and is done automatically by many implementations.
If you're working in C++, you could try using std::map<int, yourtype>. If in C, find or implement some simple binary search tree code, and see if it's fast enough.
However, if you use such a tree and find it's too slow, you could look into some more fine-tuned approaches. One might be to put your structs in one big array, radix sort by the integer key, cull on it, then re-sort per pass. Another approach might be to use a Patricia tree.

Resources