Reading a Balanced Binary Tree in Order - binary-tree

I'm working on a project for my data structures class that wants me to read a text file and put each line on a balanced binary tree. It is my understanding that this structure will look like the following:
1
/ \
2 3
/ \ / \
4 5 6 7
1 Representing the first line, 2 the second, etc.
If I want to read this in order, how do I go about that?
How I see it, if I use the order (node, left, right) I would get 1,2,4,5,3,6,7
Is the only way to do this is assign an integer along with each string that represents which line it is and then sort the tree to look like:
4
/ \
2 6
/ \ / \
1 3 5 7

From what I understand, you want to try a breadth-first search to read a tree in level order.
http://en.wikipedia.org/wiki/Breadth-first_traversal
That wiki explains a good way to go about implementing a level-order traversal. Hope that helps.

you can use two queues to print the tree level wise where each level is printed in different line

Related

Looking for a sorting algorithm

I am looking for a sorting algorithm to help me in my work. My objective is the following: after receiving an input of this kind:
5 4
1 2
2 3
3 4
4 5
The first line tells me how many ids I have, and the second number tells me how many connections. The following lines tell me the connections, and tell me that the first Id comes before the second one, for example: 1 comes before 2, 2 comes before 3, and so on. And if an impossible situation occurs:
3 2
1 2
2 3
3 1
or
2 2
1 2
2 1
I want to be able to send an error message.
Is there an algorithm that already does this? or can u give me some guide lines to how to start my work? I do not want ur code just some help/tips/advices. Thanks in advance for ur time.
From your description, I think you are probably looking for topological sorting.
It is based on the assumption that 'impossible situation' occurs when one connections suggests that A comes before B but there is some another connection which suggests that B comes before A.
Link for topological sort:
Topological Sorting

Computing the Dot Product for calculating proximity

I have already asked a similar question at Calculating Word Proximity in an inverted Index.
However i felt that the question was too general and not refined enough. So here goes.
I have a List which contains the location of tokens in a document. for each token it goes as
public List<int> hitLocation;
Lets say the the document is
Java programming language has a name similar to java island in Indonesia however
local language in java bears no resemblance to the programming language called java.
and the query is
java island language
So Say i lock on to the Java HitList and attempt to directly calculate the distance between the Java HisList, Island HitList and Language Hitlist.
Now the first problem is that there are 4 java tokens occurrences in the sentence. Which one do i select. Assuming i select the first one.
I go onto the island token list and after comparing find it that it adjacent to the second occurrence of java. So i change my selection and lock onto the second occurrence of java.
Proceeding to the third token language i find that it situated at quite a distance from our selection however i find it that it is quite near the first java occurrence.
So you see the dilemma here if now again revert back to the original selection i.e the first occurrence of java the distance to second token "island" increases and if i stay with my current selection the sheer distance of the second occurrence of the token "language" will make relevance busted.
Previously there was the suggestion of dot product however i am at loss on how to proceed forward with that option.
Any other solution would also be welcomed.
I Understand that this question is quite detailed. However i have searched long and hard and haven't found any question like this on this topic.
I feel if this question is answered it will be a great addition to the community and will make anybody who is designing anything related to relevancy quite happy.
Thank You.
You seem to be using the hit lists a little differently then how they are intended to be used (at least given my understanding).
Typically people compare hit lists returned by different documents. This is how they rank one document as being "more relevant" than a different document.
That said, if you want to find all locations of some multi-word phrase like "java island" given the locations of the words "java" and "island" you would...
Get a list of locations for "java"
Get a list of locations for "island"
Sort both lists
Iterate through both lists at the same time. You start be getting the first entry of both lists. Now test this pair of entries. I.E., if these entries are "off by one" you have found one instance of "java island" (or perhaps "island java"). Get the next entry in the list that currently shows the minimum value. Test this new pair of entries. Repeat.
BTW -- The dot product is more useful when comparing 2 different documents.
Well, since you explicitly ask about the dot product suggestion, i'll try to explain a little more formally what I had in mind. Keep in mind that it's not very efficient as it might convert the complexity from basing on the lengths of the hitlists, into something based on the length of the text (unless there's some trick to cut that).
My initial thought was to convert each hitlist into a series of binary value at the text length, high where there's a hit and low otherwise.
for e.g. java would look
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
But since you want proximity, convert each occurrence into a pyramid, for e.g. -
3 2 1 0 0 0 1 2 3 2 1 0 0 0 1 2 3 2 0 0 0 0 0 1 2 3
Same way for island -
0 0 0 0 0 0 0 1 2 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now a dot product would give you some sort of proximity "score" between the two vectors, since it accumulates all the locations where two words are close (the closer the better). Java and island can be said to have a mutual score of 16. For a higher threshold you could stretch the pyramid further, or play with the shape.
Now, here you add another suggestion that this method isn't very suited for, you also want to catch the exact location of highest proximity, this isn't very well defined IMHO, what if word1 matches word2 (at some level) in position1, but word2 matches word3 at the same level in position2 - what location would you want?
Also, keep in mind that this method is O(text_length * words^2), that might be good in some cases, but very bad for others (if you're searching the bible for e.g.)

Find minimum number of moves for Tower of London task

I am looking for a solution for a task similar to the Tower of Hanoi task, however this is different from Hanoi as the disks are not constrained by size. The Tower of London task I am creating has 8 disks, instead of the traditional 3 or 5 (as shown in the Wikipedia link). I am using PEBL software that is "programmed primarily in C++ (although you do not need to know C++ to use PEBL), but also uses flex and bison (GNU versions of lex and yacc) to handle parsing."
Here is a video of what the task looks like in action: http://www.youtube.com/watch?v=IiBJ94HRpeM&noredirect=1
*Each disk is a number. e.g., blue disk=1, red disk = 2, etc.
1 \
2 ----\
3 ----/ 3 1
4 5 / 2 4 5
========= =========
The left side consists of the disks you have to move, to match the right side. There are 3 columns.
So if I am making it with 8 disks, I would create a trial to look like this:
1 \
2 ----\ 7 8
6 3 8 ----/ 3 6 1
7 4 5 / 2 4 5
========= =========
How do I figure out what is the minimum amount of moves needed for the left to look like the right? I don't need to use PEBL to code this, but I need to know since I am calculating how close to the minimum a person would get for each trial.
The principle is easy and its called breadth first search:
Each state has a certain number of successor states (defined by the moves possible).
You start out with a set of states that contains the initial state and step number 0.
If the end state is in the set of states, return the step number.
Increment the step number.
Rebuild the set of states by replacing the current states with each of their successor states.
Go to 2
So, in each step, compute the successor states of your currently available states and look if you reached the target state.
BUT, be warned, this can take a while and eat up a lot of memory!
You can optimize a bit in our case, since you can leave out the predecessor state.
Still, you will have 5 possible moves in most states. Which means you will have 5^N states to consider after N steps.
For example, your second example will need 10 moves, if I don't err. This will give you about 10 million states. Most contemporary computers will not be able to search beyond depth 15.
I think that an algorithm to find a solution would be easy and fast, but we have no proof this solution would be the shortest one.

Wikipedia pages co-edit graph extraction using Hadoop

I am trying the build the graph of Wikipedia co-edited pages using hadoop. The raw data contains the list of edits, i.e. has one row per edit telling who edited what:
# revisionId pageId userId
1 1 10
2 1 11
3 2 10
4 3 10
5 4 11
I want to extract a graph, in which each node is a page, and there is a link between two pages if at lease one editor edited both pages (the same editor). For the above example, the output would be:
# edges: pageId1,pageId2
1,2
1,3
1,4
2,3
I am far from being an expert in Map/Reduce, but I think this has to be done in two jobs:
The first job extracts the list of edited pages for each user.
# userId pageId1,pageId2,...
10 1,2,3
11 1,4
The second job takes the output above, and simply generates all pairs of pages that each user edited (these pages have thus been edited by the same user, and will therefore be linked in the graph). As a bonus, we can actually count how many users co-edited each page, to get the weight of each edge.
# pageId1,pageID2 weight
1,2 1
1,3 1
1,4 1
2,3 1
I implemented this using Hadoop, and it works. The problem is that the map phase of the second job is really slow (actually, the first 30% are OK, but then it slows down quite a lot). The reason I came up with is that because some users have edited many pages, the mapper has to generate a lot of these pairs as outputs. Hadoop thus has to spill to disk, rendering the whole thing pretty slow.
My questions are thus the following:
For those of you who have more experience than I with Hadoop: am I doing it wrong? Is there a simpler way to extract this graph?
Can disk spills be the reason why the map phase of the second job is pretty slow? How can I avoid this?
As a side node, this runs fine with a small sample of the edits. It only gets slow with GBs of data.
Apparently, this is a common problem known as combinations/cross-correlation/co-occurrences, and there are two patterns to solve it using Map/Reduce, Pairs or Stripes:
Map Reduce Design Patterns :- Pairs & Stripes
MapReduce Patterns, Algorithms, and Use Cases (Cross-correlation section)
Pairs and Stripes
The way I presented in my question is the pairs approach, which usually generates much more data. The stripes approach benefits more from a combiner, and gave better results in my case.

finding common sub trees in a tree

Consider this conceptual diagram below, which is for demonstrational purposes only.
Abc Foo
\ / \
\ / Foo2
Bar \
/ \ Foo3
Bar2 Bar3 \
/ \ Foo4
X Y
In the above tree, there is a unique "path", Foo->Bar->Bar2->X. This path is distinct from Abc->Bar->Bar2->X .. Obviously this information is lost in the above represenation, but consider I have all the individual unique paths stored.
They do however, share some part of the path "Bar->Bar2->X".
The purpose of the algorithm I'm trying to either find or implement is that I would like to aggregate this information so I can not store individual paths. But more importantly, I'm trying to find all these common paths, and given them weights. So for example, in the above case, I could condense the information about the "Bar->Bar2->X" and say it happened 2 times. Obviously I'd require it to work for all cases.
And yes, the ultimate idea is to be able to quickly ask the question "Show me all the distinct paths from Foo". In this example there is only 1, Foo->Bar->Bar2->X. Foo->Bar->Bar2->Y and Foo->Bar->Bar3 do not exist. The diagram is for viewing purposes only.
Any ideas?
So this is just a starting point I hope others will help me fill in but I would think about them as Strings and look at the common sub paths as the common substring problem which has been looked at quite a bit in the past. Off the top of my head I might invert each path/string and then build a trie structure from those because then by counting the number of keys below a given node you can see how many times that ending path gets used... There is probably a better and more efficient way but that should work. Anyone else have ideas on treating them as strings?
You could store each unique path separately. To answer questions such as “who does Foo call”, you could create an index in the form of a hash table.
As an alternative, you could try using DAWG, but I'm not sure how much would it help in your case.

Resources