implementing a basic search engine with prefix tree - algorithm

The problem is the implementing a prefix tree (Trie) in functional language without using any storage and iterative method.
I am trying to solve this problem. How should I approach this problem ? Can you give me exact algorithm or link which shows already implemented one in any functional language?
Why I am trying to do => creating a simple search engine with an feature of
adding word to tree
searching a word in tree
deleting a word in tree
Why I want to use functional language => I want improve my problem-solving ability a bit further.
NOTE : Since it is my hobby project, I will first implement basic features.
EDIT:
i.) What I mean about "without using storage" => I don't want use variable storage ( ex int a ), reference to a variable, array . I want calculate the result by recursively then showing result to the screen.
ii.) I have wrote some line but then I have erased because what I wrote is made me angry. Sorry for not showing my effort.

Take a look at haskell's Data.IntMap. It is purely functional implementation of
Patricia trie and it's source is quite readable.
bytestring-trie package extends this approach to ByteStrings
There is accompanying paper Fast Mergeable Integer Maps which is also readable and through. It describes implementation step-by-step: from binary tries to big-endian patricia trees.
Here is little extract from the paper.
At its simplest, a binary trie is a complete binary tree of depth
equal to the number of bits in the keys, where each leaf is either
empty, indicating that the corresponding key is unbound, or full, in
which case it contains the data to which the corresponding key is
bound. This style of trie might be represented in Standard ML as
datatype 'a Dict =
Empty
| Lf of 'a
| Br of 'a Dict * 'a Dict
To lookup a value in a binary trie, we simply read the bits of the
key, going left or right as directed, until we reach a leaf.
fun lookup (k, Empty) = NONE
| lookup (k, Lf x) = SOME x
| lookup (k, Br (t0,t1)) =
if even k then lookup (k div 2, t0)
else lookup (k div 2, t1)

The key point in immutable data structure implementations is sharing of both data and structure. To update an object you should create new version of it with the most possible number of shared nodes. Concretely for tries following approach may be used.
Consider such a trie (from Wikipedia):
Imagine that you haven't added word "inn" yet, but you already have word "in". To add "inn" you have to create new instance of the whole trie with "inn" added. However, you are not forced to copy the whole thing - you can create only new instance of the root node (this without label) and the right banch. New root node will point to new right banch, but to old other branches, so with each update most of the structure is shared with the previous state.
However, your keys may be quite long, so recreating the whole branch each time is still both time and space consuming. To lessen this effect, you may share structure inside one node too. Normally each node is a vector or map of all possible outcomes (e.g. in a picture node with label "te" has 3 outcomes - "a", "d" and "n"). There are plenty of implementations for immutable maps (Scala, Clojure, see their repositories for more examples) and Clojure also has excellent implementation of an immutable vector (which is actually a tree).
All operations on creating, updating and searching resulting tries may be implemented recursively without any mutable state.

Related

What abstract data type is this?

Is the following a common data type (i.e. does it have a name)?
Its unique characteristic is, unlike a regular Set, that it contains the "universe" on initialisation with O(C) memory overhead, and a max memory overhead of O(N/2) (which only occurs when you remove every-other element):
> s = new Structure(701)
s = Structure(0-700)
> s.remove(100)
s = Structure(0-99, 101-700)
> s.add(100)
s = Structure(0-700)
> s.remove(200)
s = Structure(0-199, 201-700)
> s.remove(202)
s = Structure(0-199, 201, 203-700)
> s.removeAll()
s = Structure()
Does something like this have a standard name?
I've used this many times in the past and seen it used in things like plane-sweep algorithms for polygon clipping.
Sometimes the abstract data type it represents is just a set, and the data structure is an optimization. I use this for representing the set of matching characters given by a regex expression like [^a-zA-z0-9.-], for example, and to perform intersection, union, and other operations on those sets.
This sort of data structure is implemented on top of some other ordered set or map structure, by simply storing the keys where membership in the set changes instead of the keys in the set itself. In all the other cases where I've seen this sort of thing done, the authors refer to that underlying structure instead of giving a name to the concept itself.
I like the idea of having a name for it, though, since as I said I've used it myself many times. Maybe I would call it an "in & out set" in honor of the hamburger chain I liked the best back when I ate hamburgers.
It's a Compressed Bit Set or Compressed Bitmap.
A Bit Set or Bitmap is a set specifically designed for storing Integers. Most languages offer standard implementations of these. They typically work by assigning a 1 to the Nth bit in an internal array of Integers where N is the number you're adding to the set. 0 indicates the value is not present. The memory usage for these types of Bit Sets is dictated by the largest number you store.
A Compressed Bit Set is one that compacts ranges of 0s and 1s.
In this case, the question demonstrates a type of compression called "run-length-encoding" (thank you #Ralf Kleberhoff), so it is specifically a Run-length Encoded Bitmap.
Common implementations of Compressed Bitmaps (from newest-to-oldest) are:
Roaring Bitmaps (only one to provide "good random access")
EWAH
WAH
Oracle BBC

Constant-time list concatenation in OCaml

Is it possible to implement constant-time list concatenation in OCaml?
I imagine an approach where we deal directly with memory and concatenate lists by pointing the end of the first list to the beginning of the second list. Essentially, we're creating some type of linked-list like object.
With the normal list type, no, you can't. The algorithm you gave is exactly the one implemented ... but you still have to actually find the end of the first list...
There are various methods to implement constant time concatenation (see Okazaki for fancy details). I will just give you names of ocaml libraries that implement it: BatSeq, BatLazyList (both in batteries), sequence, gen, Core.Sequence.
Pretty sure there is a diff-list implementation somewhere too.
Lists are already (singly) linked lists. But list nodes are immutable. So you change any node's pointer to point to anything different. In order to concatenate two lists you must therefore copy all the nodes in the first list.

Build Heap function

In my university notes the pseudocode of Build Heap is written almost like this (Only difference were parenthesis I have brackets):
And I searched on the internet and there are several like this:
But shouldn't be something like that?
BuildHeap(A) {
heapsize <- length[A]
for i <- floor(length[A]/2) downto 1
Heapify(A,i)
}
Why they writing heap_size[A] = length[A]?
If you have many heaps, A, B, C. And only one variable heap-size, How will you remember the sizes of all the heaps? You will have an attribute heap-size for all the heaps.
In many pseudocode the attributes of an object O are written as Attriubute[O] or Attribute(O) , (Sometimes they are also written as O.attribute ).
The first example assumes that you are storing the heap size of a particular heap as an attribute of the heap.
The second example might be storing the heap size in a local variable which gets its value from the length attribute (Length[A]) of the heap.
Here is a text about pseudocode from Introduction To Algorithms:
Compound data are typically organized into objects, which are comprised of attributes or fields . A particular field is accessed using the field name followed by the name of its object in square brackets. For example, we treat an array as an object with the attribute length indicating how many elements it contains. To specify the number of elements in an array A, we write length[A]. Although we use square brackets for both array indexing and object attributes, it will usually be clear from the context which interpretation is intended

Data structure for multi-language dictionary?

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).
Say you want to build some data structure(s) to implement a multi-language dictionary for let's say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question).
The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems.
Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as 'ss', then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.
Task definition: Whatever data structure(s) you use, you must do both of the following:
The main operation in lookup is to quickly get an indication 'Yes this is a valid word in languages A,B and F but not C,D or E'. So, if N=40 languages, your structure quickly returns 40 Booleans.
The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/... It could be language-specific or language-independent e.g. a shared definition of pizza)
And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like "keep a separate hash for each word" or "keep a separate for each language, and each word within that language".
So:
What are the possible data structures, how do they rank on the
lookup speed/compactness curve?
Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into
another etc? or just N separate structures (which would allow you to
Huffman-encode )?
What representation do you use for characters, accents and language-specific special characters?
Ideally, give link to algorithm or code, esp. Python or else C. -
(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: "Estimation of English and non-English Language Use on the WWW" - Grefenstette & Nioche. And one list of multi-language dictionaries)
Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).
Languages to include:
Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh
Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what's achievable.
I will not win points here, but some things.
A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, .... Different usages require different data to be collected, for instance classifying "went" as passed tense.
First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.
Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (#smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.
The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.
An answer
Requirement:
Map of word to list of [language, definition reference].
List of definitions.
Several words can have the same definition, hence the need for a definition reference.
The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).
One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).
Remarks
As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.
So a over-general abstract data structure would be:
Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
Map<String /*Language/Locale/""*/, List<Definition>> definitions;
class Definition {
String content;
}
class DefinitionEntry {
String language;
Ref<Definition> definition;
}
The concrete data structure:
The wordDefinitions are best served with an optimised hash map.
Please let me add:
I did come up with a concrete data structure at last. I started with the following.
Guava's MultiMap is, what we have here, but Trove's collections with primitive types is what one needs, if using a compact binary representation in core.
One would do something like:
import gnu.trove.map.*;
/**
* Map of word to DefinitionEntry.
* Key: word.
* Value: offset in byte array wordDefinitionEntries,
* 0 serves as null, which implies a dummy byte (data version?)
* in the byte arrary at [0].
*/
TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.
void walkEntries(String word) {
int value = wordDefinitions.get(word);
if (value == 0)
return;
DataInputStream in = new DataInputStream(
new ByteArrayInputStream(wordDefinitionEntries));
in.skipBytes(value);
int entriesCount = in.readShort();
for (int entryno = 0; entryno < entriesCount; ++entryno) {
int language = in.readByte();
walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
}
}
I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.
A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.
Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.
A common solution to this in the field of NLP is finite automata. See http://www.stanford.edu/~laurik/fsmbook/home.html.
Easy.
Construct a minimal, perfect hash function for your data (union of all dictionaries, construct the hash offline), and enjoy O(1) lookup for the rest of eternity.
Takes advantage of the fact your keys are known statically. Doesn't care about your accents and so on (normalize them prior to hashing if you want).
I had a similar (but not exactly) task: implement a four-way mapping for sets, e.g. A, B, C, D
Each item x has it's projections in all of the sets, x.A, x.B, x.C, x.D;
The task was: for each item encountered, determine which set it belongs to and find its projections in other sets.
Using languages analogy: for any word, identify its language and find all translations to other languages.
However: in my case a word can be uniquely identified as belonging to one language only, so no false friends such as burro in Spanish is donkey in English, whereas burro in Italian is butter in English (see also https://www.daytranslations.com/blog/different-meanings/)
I implemented the following solution:
Four maps/dictionaries matching the entry to its unique id (integer)
AtoI[x.A] = BtoI[x.B] = CtoI[x.C] = DtoI[x.D] = i
Four maps/dictionaries matching the unique id to the corresponding language
ItoA[i] = x.A;
ItoB[i] = x.B;
ItoC[i] = x.C;
ItoD[i] = x.D;
For each encounter x, I need to do 4 searches at worst to get its id (each search is O(log(N))); then 3 access operations, each O(log(N)). All in all, O(log(N)).
I have not implemented this, but I don't see why hash sets cannot be used for either set of dictionaries to make it O(1).
Going back to your problem:
Given N concepts in M languages (so N*M words in total)
My approach adapts in the following way:
M lookup hashsets, that give you integer id for every language (or None/null if the word does not exist in the language).
Overlapped case is covered by the fact that lookups for different languages will yield different ids.
For each word, you do M*O(1) lookups in the hash sets corresponding to languages, yielding K<=M ids, identifying the word as belonging to K languages;
for each id you need to do (M-1)*O(1) lookups in actual dictionaries, mapping K ids to M-1 translations each)
In total, O(MKM) which I think is not bad, given your M=40 and K will be much smaller than M in most cases (K=1 for quite a lot of words).
As for storage: NM words + NM integers for the id-to-word dictionaries, and the same amount for reverse lookups (word-to-id);

Huffman compression algorithm

I've implemented file compression using huffman's algorithm, but the problem I have is that to enable decompression of the compressed file, the coding tree used, or the codes itself should be written to the file too. The question is: how do i do that? What is the best way to write the coding tree at the beggining of the compressed file?
There's a pretty standard implementation of Huffman Coding in the Basic Compression Library (BCL), including a recursive function that writes the tree out to a file. Look at huffman.C. It just writes out the leaves in order so the decoder can reconstruct the same tree.
BCL is also nice because there are some other pretty straightforward compression algorithm pieces in there, too. It's quite handy if you need to roll your own algorithm.
First off, have you considered using a standard compression Stream (like GZipStream in .net) ?
About how/where to write your data, you can manipulate a Streams position with Seek (even reserve space that way). If you know the size of the Tree ahead of time you can start writing after that position. But you may want to position the coding tree after the actual data, and just make sure you know where that starts. Ie, reserve a little space in front, write the compressed data, record the position, write the tree, go to the front and write out the position.
Assuming you compress on 8-bit symbols (i.e. bytes) and the algorithm is non-adaptive, the simplest way would be to store not the tree but the distribution of the values. For example by storing how often you found byte 0, how often byte 1, ..., how often byte 255. Then when reading back the file you can re-assemble the tree. This is the simplest solution, but requires the most storage space (e.g. to cover large files, you would need 4 bytes per value, i.e. 1kb).
You could optimize this by not storing exactly how often each byte was found in the file, but instead normalizing the values to 0..255 (0 = found least, ...), in which case you would only need to save 256 bytes. Re-assembling of the tree based on these values would result in the same tree. (This is not going to work as pointed out by Edmund and in question 759707 - see there for further links and answers to your question)
P.S.: And as Henk said, using seek() allows you to keep space at the beginning of the file to store the values in later.
Most implementations are using canonical huffman encoding. You have only to store the symbol lengths in a compact way. Hier an implementation: shcodec.
Another way is using a semi-static huffman encoding (periodic rescale), then you have not to store any tree.
Instead of writing the code tree to the file, write how often each character was found, so the decompression program can generate the same tree.
The most naive solution would be to parse the compression tree in pre-order and write the 256 values in the header of your file.
Since every node in a huffman tree is either a branch with two children, or a leaf, you can use a single bit to represent each node unambiguously. For a leaf, follow immediately with the 8 bits for that node.
e.g. for this tree:
/\
/\ A
B /\
C D
You could store 001[B]01[C]1[D]1[A]
(Turns out this is exactly what happens in the huffman.c example posted earlier, but not how it was described above).
it is better to send the frequencies of characters and build the tree at the receiving end. This data will be of constant size for a fixed alphabet. I guess this must be serializable and put in the file. Sending the tree depends on its implementation, for what I have tried, an array based approach leads to more memory left unused for the tree since, the tree may not be a balanced tree most of the time. If the tree was balanced then array representation would have generated the best option.
Harisankar Krishna swamy
Did you try adaptive Huffman coding? From first look it seems the tree need not be sent at all, but more work to optimize and synchronize the tress.

Resources