How to translate the AST generated by ANTLR to its source code - antlr3

I am writing an simple parser&rewriter tools for PL/SQL, and I've complete the parser and get the AST,but now I got two problems :
How can I get certain nodes of the AST ,so as to change the value of them?
After change the nodes ,how can i regenerate the SQL from the updated AST
does the ANTLR AST provide similar interface to do this?
Example SQL: select a,b from t where a=2
After parser the sql and get the ast, I want to change the sql into
select fun(a),b from t where a = fun1(2);
BTW, I generate the AST for C with ANTLR,
Thank you for any suggestion!

See my SO answer on how to regenerate source code from an AST.
Its a lot more work than you think.
ANTLR provides some help in the form of string templates but you may find these a mixed blessing: while they can generate code text, they will generate precisely what is in the template, and you may want to regenerate the code according to its original layout,
which the layout of the string template wants to override.

The following code will walk the AST and print all AST nodes to stderr.
The same tree walker is the basis for a tree transformer that can replace tree nodes.
Allocate new tree nodes with: (pANTLR3_BASE_TREE)(psr->adaptor->nilNode(psr->adaptor));
Delete AST nodes with: parentASTnode->deleteChild(parentASTnode, nodeIndex);
[deleteChild does not free the deleted nodes]
Replace nodes with: parentASTnode->replaceChildren(parentASTnode, nStartChildIndex, nStopChildIndex, newASTnode);
[you cannot insert nodes in the middle of an AST tree level, you can only replace nodes or add to the end of the parent nodes child list]
void printTree(pANTLR3_BASE_TREE t, int indent)
{
pANTLR3_BASE_TREE child = NULL;
int children = 0;
char * tokenText = NULL;
string ind = "";
int i = 0;
if ( t != NULL )
{
children = t->getChildCount(t);
for ( i = 0; i < indent; i++ )
ind += " ";
for ( i = 0; i < children; i++ )
{
child = (pANTLR3_BASE_TREE)(t->getChild(t, i));
tokenText = (char *)child->toString(child)->chars;
fprintf(stderr, "%s%s\n", ind.c_str(), tokenText);
if (tokenText == "<EOF>")
break;
printTree(child, indent+1);
}
}
}
// Run the parser
pANTLR3_BASE_TREE langAST = (psr->start_rule(psr)).tree;
// Print the AST
printTree(langAST, 0);
// Get the Parser Errors
int nErrors = psr->pParser->rec->state->errorCount;

Related

Algorithm / data structure for resolving nested interpolated values in this example?

I am working on a compiler and one aspect currently is how to wait for interpolated variable names to be resolved. So I am wondering how to take a nested interpolated variable string and build some sort of simple data model/schema for unwrapping the evaluated string so to speak. Let me demonstrate.
Say we have a string like this:
foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}
That has 1, 2, and 3 levels of nested interpolations in it. So essentially it should resolve something like this:
wait for x, y, one, two, and c to resolve.
when both x and y resolve, then resolve a{x}-{y} immediately.
when both one and two resolve, resolve baz{one}-{two}.
when a{x}-{y}, baz{one}-{two}, and c all resolve, then finally resolve the whole expression.
I am shaky on my understanding of the logic flow for handling something like this, wondering if you could help solidify/clarify the general algorithm (high level pseudocode or something like that). Mainly just looking for how I would structure the data model and algorithm so as to progressively evaluate when the pieces are ready.
I'm starting out trying and it's not clear what to do next:
{
dependencies: [
{
path: [x]
},
{
path: [y]
}
],
parent: {
dependency: a{x}-{y} // interpolated term
parent: {
dependencies: [
{
}
]
}
}
}
Some sort of tree is probably necessary, but I am having trouble figuring out what it might look like, wondering if you could shed some light on that with some pseudocode (or JavaScript even).
watch the leaf nodes at first
then, when the children of a node are completed, propagate upward to resolving the next parent node. This would mean once x and y are done, it could resolve a{x}-{y}, but then wait until the other nodes are ready before doing the final top-level evaluation.
You can just simulate it by sending "events" to the system theoretically, like:
ready('y')
ready('c')
ready('x')
ready('a{x}-{y}')
function ready(variable) {
if ()
}
...actually that may not work, not sure how to handle the interpolated nodes in a hacky way like that. But even a high level description of how to solve this would be helpful.
export type SiteDependencyObserverParentType = {
observer: SiteDependencyObserverType
remaining: number
}
export type SiteDependencyObserverType = {
children: Array<SiteDependencyObserverType>
node: LinkNodeType
parent?: SiteDependencyObserverParentType
path: Array<string>
}
(What I'm currently thinking, some TypeScript)
Here is an approach in JavaScript:
Parse the input string to create a Node instance for each {} term, and create parent-child dependencies between the nodes.
Collect the leaf Nodes of this tree as the tree is being constructed: group these leaf nodes by their identifier. Note that the same identifier could occur multiple times in the input string, leading to multiple Nodes. If a variable x is resolved, then all Nodes with that name (the group) will be resolved.
Each node has a resolve method to set its final value
Each node has a notify method that any of its child nodes can call in order to notify it that the child has been resolved with a value. This may (or may not yet) lead to a cascading call of resolve.
In a demo, a timer is set up that at every tick will resolve a randomly picked variable to some number
I think that in your example, foo, and a might be functions that need to be called, but I didn't elaborate on that, and just considered them as literal text that does not need further treatment. It should not be difficult to extend the algorithm with such function-calling features.
class Node {
constructor(parent) {
this.source = ""; // The slice of the input string that maps to this node
this.texts = []; // Literal text that's not part of interpolation
this.children = []; // Node instances corresponding to interpolation
this.parent = parent; // Link to parent that should get notified when this node resolves
this.value = undefined; // Not yet resolved
}
isResolved() {
return this.value !== undefined;
}
resolve(value) {
if (this.isResolved()) return; // A node is not allowed to resolve twice: ignore
console.log(`Resolving "${this.source}" to "${value}"`);
this.value = value;
if (this.parent) this.parent.notify();
}
notify() {
// Check if all dependencies have been resolved
let value = "";
for (let i = 0; i < this.children.length; i++) {
const child = this.children[i];
if (!child.isResolved()) { // Not ready yet
console.log(`"${this.source}" is getting notified, but not all dependecies are ready yet`);
return;
}
value += this.texts[i] + child.value;
}
console.log(`"${this.source}" is getting notified, and all dependecies are ready:`);
this.resolve(value + this.texts.at(-1));
}
}
function makeTree(s) {
const leaves = {}; // nodes keyed by atomic names (like "x" "y" in the example)
const tokens = s.split(/([{}])/);
let i = 0; // Index in s
function dfs(parent=null) {
const node = new Node(parent);
const start = i;
while (tokens.length) {
const token = tokens.shift();
i += token.length;
if (token == "}") break;
if (token == "{") {
node.children.push(dfs(node));
} else {
node.texts.push(token);
}
}
node.source = s.slice(start, i - (tokens.length ? 1 : 0));
if (node.children.length == 0) { // It's a leaf
const label = node.texts[0];
leaves[label] ??= []; // Define as empty array if not yet defined
leaves[label].push(node);
}
return node;
}
dfs();
return leaves;
}
// ------------------- DEMO --------------------
let s = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
const leaves = makeTree(s);
// Create a random order in which to resolve the atomic variables:
function shuffle(array) {
for (var i = array.length - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
[array[j], array[i]] = [array[i], array[j]];
}
return array;
}
const names = shuffle(Object.keys(leaves));
// Use a timer to resolve the variables one by one in the given random order
let index = 0;
function resolveRandomVariable() {
if (index >= names.length) return; // all done
console.log("\n---------------- timer tick --------------");
const name = names[index++];
console.log(`Variable ${name} gets a value: "${index}". Calling resolve() on the connected node instance(s):`);
for (const node of leaves[name]) node.resolve(index);
setTimeout(resolveRandomVariable, 1000);
}
setTimeout(resolveRandomVariable, 1000);
your idea of building a dependency tree it's really likeable.
Anyway I tryed to find a solution as simplest possible.
Even if it already works, there are many optimizations possible, take this just as proof of concept.
The background idea it's produce a List of Strings which you can read in order where each element it's what you need to solve progressively. Each element might be mandatory to solve something that come next in the List, hence for the overall expression. Once you solved all the chunks you have all pieces to solve your original expression.
It's written in Java, I hope it's understandable.
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Objects;
public class StackOverflow {
public static void main(String[] args) {
String exp = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
List<String> chunks = expToChunks(exp);
//it just reverse the order of the list
Collections.reverse(chunks);
System.out.println(chunks);
//output -> [c, two, one, baz{one}-{two}, y, x, a{x}-{y}]
}
public static List<String> expToChunks(String exp) {
List<String> chunks = new ArrayList<>();
//this first piece just find the first inner open parenthesys and its relative close parenthesys
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
//this if put an end to recursive calls
if(begin > 0 && begin < exp.length() && end > 0) {
//add the chunk to the final list
String substring = exp.substring(begin, end);
chunks.add(substring);
//remove from the starting expression the already considered chunk
String newExp = exp.replace("{" + substring + "}", "");
//recursive call for inner element on the chunk found
chunks.addAll(Objects.requireNonNull(expToChunks(substring)));
//calculate other chunks on the remained expression
chunks.addAll(Objects.requireNonNull(expToChunks(newExp)));
}
return chunks;
}
}
Some details on the code:
The following piece find the begin and the end index of the first outer chunk of expression. The background idea is: in a valid expression the number of open parenthesys must be equal to the number of closing parenthesys. The count of open(+1) and close(-1) parenthesys can't ever be negative.
So using that simple loop once I find the count of parenthesys to be 0, I also found the first chunk of the expression.
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
The if condition provide validation on the begin and end indexes and stop the recursive call when no more chunks can be found on the remained expression.
if(begin > 0 && begin < exp.length() && end > 0) {
...
}

Insert and delete in a multi level sorted linked list

2->7->8->11
|
13->16->17->21
|
22->23->27->29
|
30->32
Sorted Linked List given like above where each node has 2 pointers next and down. For each row starting nodes down points to next row start. Each row has 4 elements, except last one which can have <= 4 elements. Next rows start element is greater than previous rows end element. We need to design and code for it insert of new value at correct place and delete operation. I could not solve this problem.
Structure representation and Pseudo code for the add operation is as follows
And we can implement the delete recursively using the add data as example
typedef struct sibling{
int data;
struct sibling *nxt;
} t_sibling
typedef struct children {
struct sibling *sibling;
struct children *nxt;
} t_children;
add_element(t_children **head, int newdata)
{
t_children *walk_down = *head;
t_children *parent = NULL;
while (walk_down != NULL) {
if(parent == NULL && Compare newdata < head of current walk_down->sibling) {
// Code comes here when we add 1 to above mentioned list example
newdata is added to begining to head of walk_down->sibling
sibling_list_count++;
if (sibling_list_count > 4) {
taildata = delete_end from tail of walk_down->sibling
add_element(&walk_down, taildata)
}
break;
}
else if(newdata < head of current walk_down->sibling) {
if (Compare newdata > tail of parent sibling) {
// Code comes here when we add 12 to above mentioned list
newdata is added to begining to head of walk_down->sibling
if (sibling_list_count > 4) {
taildata = delete_end from tail of walk_down->sibling
add_element(&walk_down, taildata)
}
}
else {
// Code comes here when we add 6 to above mentioned list
newdata is added to the appropriate location of parent of sibling
Since above step disturbs the <= 4 property we
taildata = delete_end from tail of parent->sibling
add_element(&walk_down, taildata)
}
break;
}
parent = walk_down;
walk_down = walk_down->nxt;
}
}

Implementing an Iterative Single Stack Binary Tree Copy Function

As a thought exercise I am trying to implement an iterative tree (binary or binary search tree) copy function.
It is my understanding that it can be achieved trivially:
with a single stack
without using a wrapper (that contains references to the copy and original nodes)
without a node having a reference to it's parent (would a parent reference in a node be counter to a true definition of a tree [which I believe is a DAG]?)
I have written different implementations that meet the inverse of the above constraints but I am uncertain how to approach the problem with the constraints.
I did not see anything in Algorithms 4/e and have not seen anything online (beyond statements of how trivial it is). I considered using the concepts from in order and post order of a current/previous var but I did not see a way to track accurately when popping the stack. I also briefly considered a hash map but I feel this is still just extra storage like the extra stack.
Any help in understanding the concepts/idioms behind the approach that I am not seeing is gratefully received.
Thanks in advance.
Edit:
Some requests for what I've tried so far. Here is the 2 stack solution which I believe is supposed to be able to turn into the 1 stack the most trivially.
It's written in C++. I am new to the language (but not programming) and teaching myself using C++ Primer 5/e (Lippman, Lajole, Moo) [C++11] and the internet. If any of the code from a language perspective is wrong, please let me know (although I'm aware Code Review Stack Exchange is the place for an actual review).
I have a template Node that is used by other parts of the code.
template<typename T>
struct Node;
typedef Node<std::string> tree_node;
typedef std::shared_ptr<tree_node> shared_ptr_node;
template<typename T>
struct Node final {
public:
const T value;
const shared_ptr_node &left = m_left;
const shared_ptr_node &right = m_right;
Node(const T value, const shared_ptr_node left = nullptr, const shared_ptr_node right = nullptr) : value(value), m_left(left), m_right (right) {}
void updateLeft(const shared_ptr_node node) {
m_left = node;
}
void updateRight(const shared_ptr_node node) {
m_right = node;
}
private:
shared_ptr_node m_left;
shared_ptr_node m_right;
};
And then the 2 stack implementation.
shared_ptr_node iterativeCopy2Stacks(const shared_ptr_node &node) {
const shared_ptr_node newRoot = std::make_shared<tree_node>(node->value);
std::stack<const shared_ptr_node> s;
s.push(node);
std::stack<const shared_ptr_node> copyS;
copyS.push(newRoot);
shared_ptr_node original = nullptr;
shared_ptr_node copy = nullptr;
while (!s.empty()) {
original = s.top();
s.pop();
copy = copyS.top();
copyS.pop();
if (original->right) {
s.push(original->right);
copy->updateRight(std::make_shared<tree_node>(original->right->value));
copyS.push(copy->right);
}
if (original->left) {
s.push(original->left);
copy->updateLeft(std::make_shared<tree_node>(original->left->value));
copyS.push(copy->left);
}
}
return newRoot;
}
I'm not fluent in c++, so you'll have to settle with pseudocode:
node copy(treenode n):
if n == null
return null
node tmp = clone(n) //no deep clone!!!
stack s
s.push(tmp)
while !s.empty():
node n = s.pop()
if n.left != null:
n.left = clone(n.left)
s.push(n.left)
if n.right != null:
n.right = clone(n.right)
s.push(n.right)
return tmp
Note that clone(node) is not a deep-clone. The basic idea is to start with a shallow-clone of the root, then iterate over all children of that node and replace those nodes (still references to the original node) by shallow copies, replace those nodes children, etc.. This algorithm traverses the tree in a DFS-manner. In case you prefer BFS (for whatever reason) you could just replace the stack by a queue. Another advantage of this code: it can be altered with a few minor changes to work for arbitrary trees.
A recursive version of this algorithm (in case you prefer recursive code over my horrible prosa):
node copyRec(node n):
if n.left != null:
n.left = clone(n.left)
copyRec(n.left)
if n.right != null:
n.right = clone(n.right)
copyRec(n.right)
return n
node copy(node n):
return copyRec(clone(n))
EDIT:
If you want to have a look at working code, I've created an implementation in python.

How to terminate a while loop when an Xpath query returns a null reference html agility pack

I'm trying to loop through every row of a variable length table on the a webpage (http://www.oddschecker.com/golf/the-masters/winner) and extract some data
The problem is I can't seem to catch the null reference and terminate the loop without it throwing an exception!
int i = 1;
bool test = string.IsNullOrEmpty(doc.DocumentNode.SelectNodes(String.Format("//*[#id='t1']/tr[{0}]/td[3]/a[2]", i))[0].InnerText);
while (test != true)
{
string name = doc.DocumentNode.SelectNodes(String.Format("//*[#id='t1']/tr[{0}]/td[3]/a[2]", i))[0].InnerText;
//extract data
i++;
}
try-catch statements don't catch it either:
bool test = false;
try
{
string golfersName = doc.DocumentNode.SelectNodes(String.Format("//*[#id='t1']/tr[{0}]/td[3]/a[2]", i))[0].InnerText;
}
catch
{
test = true;
}
while (test != true)
{
...
The code logic is a bit off. With the original code, if test evaluated true the loop will never terminates. It seems that you want to do checking in every loop iteration instead of only once at the beginning.
Anyway, there is a better way around. You can select all relevant nodes without specifying each <tr> indices, and use foreach to loop through the node set :
var nodes = doc.DocumentNode.SelectNodes("//*[#id='t1']/tr/td[3]/a[2]");
foreach(HtmlNode node in nodes)
{
string name = node.InnerText;
//extract data
}
or using for loop instead of foreach, if index of each node is necessary for the "extract data" process :
for(i=1; i<=nodes.Count; i++)
{
//array index starts from 0, unlike XPath element index
string name = nodes[i-1].InnerText;
//extract data
}
Side note : To query single element you can use SelectSingleNode("...") instead of SelectNodes("...")[0]. Both methods return null if no nodes match XPath criteria, so you can do checking against the original value returned instead of against InnerText property to avoid exception :
var node = doc.DocumentNode.SelectSingleNode("...");
if(node != null)
{
//do something
}

Build trie faster

I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.
Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.
Here's the code that builds the trie.
public SimpleTrie makeTrie(String file) throws Exception {
String line;
SimpleTrie trie = new SimpleTrie();
BufferedReader br = new BufferedReader(new FileReader(file));
while( (line = br.readLine()) != null) {
trie.insert(line);
}
br.close();
return trie;
}
The insert method which runs on O(length of key)
public void insert(String key) {
TrieNode crawler = root;
for(int level=0 ; level < key.length() ; level++) {
int index = key.charAt(level) - 'A';
if(crawler.children[index] == null) {
crawler.children[index] = getNode();
}
crawler = crawler.children[index];
}
crawler.valid = true;
}
I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.
Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?
Any suggestions are appreciated. Thanks in advance.
EDIT
Someone suggested using Java Serialization. I tried it, but it was very slow with this code:
public void serializeTrie(SimpleTrie trie, String file) {
try {
ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
out.writeObject(trie);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public SimpleTrie deserializeTrie(String file) {
try {
ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
SimpleTrie trie = (SimpleTrie)in.readObject();
in.close();
return trie;
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
Can this above code be made faster?
My trie: http://pastebin.com/QkFisi09
Word list: http://www.isc.ro/lists/twl06.zip
Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand
Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.
Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).
You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.
public class SimpleTrie {
public class TrieNode {
boolean valid;
int[] children;
}
private TrieNode[] nodes;
private int numberOfNodes;
private TrieNode getNode() {
TrieNode t = nodes[++numberOnNodes];
return t;
}
}
Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.
Prefix look-up example:
Compare method:
private static int compare(String string, String prefix) {
if (prefix.length()>string.length()) return Integer.MIN_VALUE;
for (int i=0; i<prefix.length(); i++) {
char s = string.charAt(i);
char p = prefix.charAt(i);
if (s!=p) {
if (p<s) {
// prefix is before string
return -1;
}
// prefix is after string
return 1;
}
}
return 0;
}
Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)
private static int recursiveFind(String[] strings, String prefix, int start, int end) {
if (start == end) {
String lastValue = strings[start]; // start==end
if (compare(lastValue,prefix)==0)
return start; // start==end
return Integer.MAX_VALUE;
}
int low = start;
int high = end + 1; // zero indexed, so add one.
int middle = low + ((high - low) / 2);
String middleValue = strings[middle];
int comp = compare(middleValue,prefix);
if (comp == Integer.MIN_VALUE) return comp;
if (comp==0)
return middle;
if (comp>0)
return recursiveFind(strings, prefix, middle + 1, end);
return recursiveFind(strings, prefix, start, middle - 1);
}
Gets a String array and prefix, prints out occurrences of prefix in array
private static boolean testPrefix(String[] strings, String prefix) {
int i = recursiveFind(strings, prefix, 0, strings.length-1);
if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
// not found
return false;
}
// Found an occurrence, now search up and down for other occurrences
int up = i+1;
int down = i;
while (down>=0) {
String string = strings[down];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
down--;
}
while (up<strings.length) {
String string = strings[up];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
up++;
}
return true;
}
Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".
For example, the input
TREE \b\bIE \b\b\bOO \b\b\b
gives the word list
TREE
TRIE
TOO
. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).
serialize(node):
if node is valid: put(' ')
for letter in A-Z:
if node has a child under letter:
put(letter)
serialize(child)
put('\b')
This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.
I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:
#include <string>
#include <fstream>
#define USE_NODE_POOL
#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif
struct Node {
void insert(const std::string &s) { insert_helper(s, 0); }
void insert_helper(const std::string &s, int idx) {
if (idx >= s.length()) return;
int char_idx = s[idx] - 'A';
if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
children[char_idx] = &node_pool[node_pool_idx++];
#else
children[char_idx] = new Node();
#endif
}
children[char_idx]->insert_helper(s, idx + 1);
}
Node *children[26] = {};
};
int main() {
#ifdef USE_NODE_POOL
node_pool = new Node[400000];
#endif
Node n;
std::ifstream fin("TWL06.txt");
std::string word;
while (fin >> word) n.insert(word);
}
Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.
Some tries will optimize by having one node to represent a long string, and break that string up only when needed.
Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.
I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.
Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?
If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your
for(int level=0 ; level < key.length() ; level++) {
loop from the 5-th level.
Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.
Trie:
http://en.wikipedia.org/wiki/Trie
Patricia/Radix trie:
http://en.wikipedia.org/wiki/Radix_tree
You didn't mention a language but here are two implementations of prefix tries in Java.
Regular trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java
Patricia/Radix (space-effecient) trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java
Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.
Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.

Resources