Trie implementation with wildcard values - algorithm

I'm implementing an algorithm to do directory matching. So I'm given a set of valid paths that can include wildcards (denoted by "X"). Then when I pass in an input I need to know if that input matches with one of the paths in my valid set. I'm running into a problem with the wildcards when a wildcard value that is passed in actually matches with another valid value. Here is an example:
Set of valid paths:
/guest
/guest/X
/X/guest
/member
/member/friends
/member/X
/member/X/friends
Example input:
/member/garbage/friends
/member/friends/friends
These should both return true, however only the first one does. In the first case because "garbage" does not match with another other possible paths but there is an option for a wildcard at this point it will skip it and move on, then it sees "friends" and it knows it's a match. However, my second case does not work because it sees friends and goes down a different path in my trie, not the wildcard path. Because there is a valid path with "friends" in this position it thinks it will be that. Then it sees "friends" again but from this point in the trie there are no valid paths with "friends" so it returns false.
My question is, how can I get it to recognize the other valid path even when it goes down the wrong branch in the trie. My search code and example trie diagram is below.
The search algorithm for my trie is as follows:
private TrieNode searchNode(String input) {
Map<String, TrieNode> children = root.children;
TrieNode t = null;
// break up input into individual tokens
String[] tokenizedLine = input.split("/");
for(int i = 0; i < tokenizedLine.length; i++){
String current = tokenizedLine[i];
if(children.containsKey(current)) {
t = children.get(current);
children = t.children;
}else if(children.containsKey("X")){
t = children.get("X");
children = t.children;
}else{
return null;
}
}
return t;
}
An image of the trie that would be built with my sample path set:
When I input /member/friends/friends my algorithm is going down the highlighted path and stopping because it does not see another "friends" after, but how can I get it to recognize the first "friends" as a wildcard value instead, so then it will continue and see the second "friends" after the wildcard?
Thanks for any help!

Figured it out. I implemented some backtracking to keep track of the last node where it saw two possible paths. If it finds a dead end on one path it will go back to the last time it saw two possible paths and try the other. New Algorithm looks like this:
private TrieNode searchNode(String input) {
Map<String, TrieNode> children = root.children;
TrieNode t = null;
// Variables for backtrack function
TrieNode alternativeWildcardNode = null;
int wildcardIndex = 0;
Map<String, TrieNode> wildcardChildren = root.children;
// break up input into individual tokens
String[] tokenizedLine = input.split("/");
for(int i = 0; i < tokenizedLine.length; i++){
String current = tokenizedLine[i];
//System.out.println(current);
//System.out.println(Integer.toString(i));
if(children.containsKey(current) && children.containsKey("X")) {
// store current variable state in case we need to back track here
alternativeWildcardNode = children.get("X");
wildcardIndex = i;
wildcardChildren = alternativeWildcardNode.children;
t = children.get(current);
children = t.children;
}else if(children.containsKey(current)) {
t = children.get(current);
children = t.children;
}else if(children.containsKey("X")){
t = children.get("X");
children = t.children;
}else if(alternativeWildcardNode != null){
// if we've reached a branch with no match, but had a possible wildcard previously
// call reset state to the wildcard option instead of static
t = alternativeWildcardNode;
alternativeWildcardNode = null;
i = wildcardIndex;
children = wildcardChildren;
}else{
return null;
}
}
return t;
}

Related

Difference between one pass (scan) and two pass(scan)

I had an Interview, a day before.
The Interviewer told me to , " Write a program to add a node at the end of a linked list ".
I had given him a solution. but he told me to implement it in one pass (one scan).
Can Anybody explain me, whats the meaning of one pass, and how to find the program written is in one pass or two pass?
Here is my code
public void atLast(int new_data)
{
Node new_node=new Node(new_data);
if(head==null)
{
head=new Node(new_data);
return;
}
new_node.next=null;
Node last=head;
while(last.next!=null)
{
last=last.next;
}
last.next=new_node;
return;
}
If that is the code you gave the interviewer must have misread it because it is a single pass.
In your case a "pass" would be your while loop. It could also be done with recursion, for, or any other type of loop that goes through the elements in the array (or other form of a list of items).
In your code you run through the list of Node and insert the element at the end. This is done in one loop making it a single pass.
Now to look at a case with two passes. Say for example you were asked to remove the element with the largest value and wrote something similar to this:
int index = 0;
int count = 0;
int max = 0;
while(temp_node != null)
{
if(temp_node.data > max)
{
index = count;
max = temp_node.data;
}
count++;
temp_node = temp_node.next;
}
for(int i = 0; i < count; i++)
{
if(i == index)
{
//Functionality to remove node.
}
}
The first pass (while) detects the Node which has the maximum value. The second pass (for) removes this Node by looping through all the elements again until the correct one is found.
I'd imagine "two passes" here means that you iterated through the whole list twice in your code. You shouldn't need to do that to add a new node.

i want to create generic list from string

i want to create generic list from string
input string is (a(b,c,u),d,e(f),g(),h,i(j(k,l,m(n))),r)
my class is
public class Node
{
public string Name; // method name
// public decimal Time; // time spent in method
public List<Node> Children;
}
child node is represent in ().
Example: a is parent node and b,c u are child nodes; will be saved in List<Node> in the same way as its parent which has j as child and k,l and m as children j.
it is similary like tree
<.>
|---<a>
| |--- b
| |--- c
| |--- u
|--- d
|---<e>
| |--- f
|---<g>
|--- h
|---<i>
| |---<j>
| | |--- k
| | |--- l
| | |---<m>
| | | |--- n
|--- r
The end result for you Node data structure will end up looking similar to a tree. To achieve what you want you could use a recursive function.
Here is an example of such a function (with comments):
//Recursive Function that creates a list of Node objects and their
//children from an input string
//Returns: List of Nodes
//Param string input: The input string used to generate the List
//Param int index: The index to start looping through the string
public List<Node> CreateNodeList(string input, int index)
{
List<Node> nodeList = new List<Node>();
Node currentNode = new Node();
StringBuilder nameBuilder = new StringBuilder();
//Start looping through the characters in the string at the
//specified index
for(int i = index; i < array.size; i++)
{
switch(input[i])
{
//If we see an open bracket we need to use the
//proceeding list to create the current nodes children
//We do this by recursively calling this function
//(passing the input string and the next index as
//parameters) and setting the children property to b
//the return value
case ‘(‘:
currentNode.Children = CreateNodeList(input, i+1);
i = input.IndexOf(‘)’, i);
break;
//If we see a closed bracket we create a new Node based
//of the existing string, add it to the list, and then
//return the list
case ‘)’:
currentNode.Name = nameBuilder.ToString();
nodeList.Add(currentNode);
nameBuilder.Clear();
return nodeList;
//if we see a comma we create a new node object based
//off the current string and add it to the list
case ‘,’:
currentNode.Name = nameBuilder.ToString();
nodeList.Add(currentNode);
nameBuilder.Clear();
currentNode = new Node();
break;
//Any other character we see must be for the name
//of a node, so we will append it to our string
//builder and continue looping
default:
nameBuilder.Append(input[i]);
}
}
//We will probably never reach here since your input string
//usually ends in ‘)’ but just in case we finish by returning
//the list
return nodeList;
}
//An example of how to use this recursive function
public static void main()
{
//Your input string
string input = “(a(b,c,u),d,e(f),g(),h,i(j(k,l,m(n))),r)”;
//Call our function with the input string and 1 as arguments
//We use 1 to skip the first ‘(‘ which is a index 0
List<Node> nodeList = CreateNodeList(input, 1);
//Do stuff with list here
}
This function keeps track of characters for the names of nodes, creating new ones and adding them to the list every time it sees a ',' or ')' (returning the List when seeing a ')')) and it also populates a Nodes children when it sees a '(' character by recursively calling the function and using its return value. The one major downside being you have keep track off the index you're on.
This function was written free hand but it's meant to be very similiar to C# (you didn't specify a language so I hope this helps.)
I hope this helps and is what your'e looking for :)

All anagrams in a File

Source : Microsoft Interview Question
We are given a File containing words.We need to determine all the Anagrams Present in it .
Can someone suggest most optimal algorithm to do this.
Only way i know is
Sorting all the words,then checking .
It would be good to know more about data before suggesting an algorithm, but lets just assume that the words are in English in the single case.
Lets assign each letter a prime number from 2 to 101. For each word we can count it's "anagram number" by multiplying its letter corresponding numbers.
Lets declare a dictionary of {number, list} pairs. And one list to collect resulting anagrams into.
Then we can collect anagrams in two steps: simply traverse through the file and put each word to a dictionary's list according to its "anagram number"; traverce the map and for every pairs list with length more then 1 store it's contents in a single big anagram list.
UPDATE:
import operator
words = ["thore", "ganamar", "notanagram", "anagram", "other"]
letter_code = {'a':2, 'b':3, 'c':5, 'd':7, 'e':11, 'f':13, 'g':17, 'h':19, 'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43,
'o':47, 'p':53, 'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':83, 'x':89, 'y':97, 'z':101}
def evaluate(word):
return reduce( operator.mul, [letter_code[letter] for letter in word] )
anagram_map = {}
anagram_list = []
for word in words:
anagram_number = evaluate(word)
if anagram_number in anagram_map:
anagram_map[ anagram_number ] += [word]
else:
anagram_map[ anagram_number ] = [word]
if len(anagram_map[ anagram_number ]) == 2:
anagram_list += anagram_map[ anagram_number ]
elif len(anagram_map[ anagram_number ]) > 2:
anagram_list += [ word ]
print anagram_list
Of course the implementation can be optimized further. For instance, you don't really need a map of anagrams, just a counters would do fine. But I guess the code illustrates the idea best as it is.
You can use "Tries".A trie (derived from retrieval) is a multi way search tree. Tries use pattern matching algorithms. It's basic use is to create spell check programs, but I think it can help your case..
Have a look at this link http://ww0.java4.datastructures.net/handouts/Tries.pdf
I just did this one not to long ago, in a different way.
split the file content into an array of words
create a HashMap that maps a key string to a linked list of strings
for each word in the array, sort the letters in the word and use that as the key to a linked list of anagrams
public static void allAnagrams2(String s) {
String[] input = s.toLowerCase().replaceAll("[^a-z^\s]", "").split("\s");
HashMap> hm = new HashMap>();
for (int i = 0; i < input.length; i++) {
String current = input[i];
char[] chars = current.toCharArray();
Arrays.sort(chars);
String key = new String(chars);
LinkedList<String> ll = hm.containsKey(key) ? hm.get(key) : new LinkedList<String>();
ll.add(current);
if (!hm.containsKey(key))
hm.put(key, ll);
}
}
Slightly different approach from the one above. Returning a Hashmap of anagrams instead.
Public static Hashmap<String> anagrams(String [] list){
Hashmap<String, String> hm = new Hashmap<String, String>();
Hashmap<String> anagrams = new Hashmap<String>();
for (int i=0;i<list.length;i++){
char[] chars = list[i].toCharArray();
Arrays.sort(chars);
String k = chars.toString();
if(hm.containsKey(k)){
anagrams.put(k);
anagrams.put(hm.get(k));
}else{
hm.put(k, list[i]);
}
}
}

Algorithm to quickly traverse a large binary file

I have a problem to solve involving reading large files, and I have a general idea how to approach it, but would like to see it there might be a better way.
The problem is following: I have several huge disk files (64GB each) filled with records of 2.5KB each (around 25,000,000 of records total). Each record has, among other fields, a timestamp, and a isValid flag indicating whether the timestamp is valid or not. When the user enters a timespan, I need to return all records for which the timestamp is withing the specified range.
The layout of the data is such that, for all records marked as "Valid", timestamp monotonically increases. Invalid records should not be considered at all. So, this is how the file generally looks like (although ranges are far larger):
a[0] = { Time=11, IsValid = true };
a[1] = { Time=12, IsValid = true };
a[2] = { Time=13, IsValid = true };
a[3] = { Time=401, IsValid = false }; // <-- should be ignored
a[4] = { Time=570, IsValid = false }; // <-- should be ignored
a[5] = { Time=16, IsValid = true };
a[6] = { Time=23, IsValid = true }; // <-- time-to-index offset changed
a[7] = { Time=24, IsValid = true };
a[8] = { Time=25, IsValid = true };
a[9] = { Time=26, IsValid = true };
a[10] = { Time=40, IsValid = true }; // <-- time-to-index offset changed
a[11] = { Time=41, IsValid = true };
a[12] = { Time=700, IsValid = false }; // <-- should be ignored
a[13] = { Time=43, IsValid = true };
If the offset between a timestamp and a counter was constant, seeking the first record would be an O(1) operation (I would simply jump to the index). Since it isn't, I am looking for a different way to (quickly) find this information.
One way might be a modified binary search, but I am not completely sure how to handle larger blocks of invalid records. I suppose I could also create an "index" to speed up lookup, but since there will be many large files like this, and extracted data size will be much smaller than the entire file, I don't want to traverse each of these files, record by record, to generate the index. I am thinking if a binary search would also help while building the index.
Not to mention that I'm not sure what would be the best structure for the index. Balanced binary tree?
You can use modified binary search. The idea is to do usual binary search to figure out lower bound and upper bound and then return the in between entries which are valid.
The modification lies in the part where if current entry is invalid. In that case you have to figure out two end points where you have a valid entry.
e.g if mid point is 3,
a[0] = { Time=11, IsValid = true };
a[1] = { Time=12, IsValid = true };
a[2] = { Time=401, IsValid = false };
a[3] = { Time=570, IsValid = false }; // <-- Mid point.
a[4] = { Time=571, IsValid = false };
a[5] = { Time=16, IsValid = true };
a[6] = { Time=23, IsValid = true };
In above case the algorithm will return two points a[1] and a[5]. Now algo will decide to binary search lower half or upper half.
it's times like this that using someone elses database code starts to look like a good idea,
Anyway, you need to fumble about until you find the start of the valid data and then read until you hit the end,
start by taking pot shots and moving the markers accordingly same as a normal binary search
except when you hit an invalid record begin a search for a valid record just reading forward from the guess is as good as anything
it's probably worthwhile running a maintenance task over the files to replace the invalid timestamps with valid ones, or perhaps maintaining an external index,
You may bring some randomness in binary searching. In practical the random algorithms perform well for large data sets.
It does sound like a modified binary search can be a good solution. If large blocks of invalid records are a problem you can handle them by skipping blocks of exponentially increasing size, e.g 1,2,4,8,.... If this makes you overshoot the end of the current bracket, step back to the end of the bracket and skip backwards in steps of 1,2,4,8,... to find a valid record reasonably close to the center.

How to eliminate duplicate filename in hadoop mapreduce?

I want to eliminate duplicate filenames in my output of the hadoop mapreduce inverted index program. For example, the output is like - things : doc1,doc1,doc1,doc2 but I want it to be like
things : doc1,doc2
Well you want to remove duplicates which were mapped, i.e. you want to reduce the intermediate value list to an output list with no duplicates. My best bet would be to simply convert the Iterator<Text> in the reduce() method to a java Set and iterate over it changing:
while (values.hasNext()) {
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
}
To something like:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
valueSet.add(values.next());
}
for(Text value : valueSet) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
}
Unfortunately I do not know of any better (more concise) way of converting an Iterator to a Set.
This should have a smaller time complexity than orange's solution but a higher memory consumption.
#Edit: a bit shorter:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
Text next = values.next();
if(!valueSet.contains(next)) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
valueSet.add(next);
}
}
Contains should be (just like add) constant time so it should be O(n) now.
To do this with the minimal amount of code change, just add an if-statement that checks to see if the thing you are about to append is already in toReturn:
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
gets changed to
String v = values.next().toString()
if (toReturn.indexOf(v) == -1) { // indexOf returns -1 if it is not there
if (!first) {
toReturn.append(", ") ;
}
toReturn.append(v);
first = false
}
The above solution is a bit slow because it has to traverse the entire string every time to see if that string is there. Likely the best way to do this is to use a HashSet to collect the items, then combining the values in the HashSet into a final output string.

Resources