The picture below is the case of different binary trees that can be made with 3 nodes.
But why is the following case not included in the number of cases?
Is it the same case as the third case from the left in the picture above? If so, I think the parent-child relationship will be different.
I'd appreciate it if you could tell me how to determine how binary trees are equal to or different from each other.
You mean 3 nodes. Your case is not included because the cases all use A as the root node, so it is easier to demonstrate the different possible combinations using always the same elements in the same order, i.e. A as root, then B -> C on top and symmetrically C -> B at the bottom.
Using B or C as the root, you could achieve the same number of variations. This number can be computed using the following formula:
In this case, n=3, so F(0)*F(2) + F(1)*F(1) + F(2)*F(0) = 2 + 1 + 2 = 5
F(0) = 1 (empty is considered one variation)
F(1) = 1
F(2) = 2
So in your picture, only one row is actually relevant if that is supposed to be a binary search tree. Also note that there is a difference between a binary tree and a binary search tree. From first article below:
As we know, the BST is an ordered data structure that allows no
duplicate values. However, Binary Tree allows values to be repeated
twice or more. Furthermore, Binary Tree is unordered.
References
https://www.baeldung.com/cs/calculate-number-different-bst
https://en.wikipedia.org/wiki/Binary_tree#Using_graph_theory_concepts
https://encyclopediaofmath.org/wiki/Binary_tree
I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.
Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:
hi = len(names)
lo = 0
while lo < hi:
mid = (lo+hi)//2
midval = names[mid].lower()
if midval < query.lower():
lo = mid+1
elif midval > query.lower():
hi=mid
else:
return midval
return None
This code adapted from here: https://stackoverflow.com/a/212413/215608
Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.
I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")
Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.
I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.
Let say you've run the first stage and have found no match.
You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.
Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.
All of this can be done using O(log n) comparisons.
In your binary search you either hit an exact match OR an area where the match would be.
So in your case you need to get the upper and lower boundaries (hi lo as you call them) for the area that would include the Peter and return all the intermediate strings.
But if you aim to do something like show next words of a word you should look into Tries instead of BSTs
Quick question,
if a is a regular expression then is it true that a* = (a*)* ?
Is (a*)* a valid expression? If it is, then can anyone explain why is it the same as a*? I apologize for asking here, but I couldn't find anything via Google.
Yes, a*=(a*)* are same. Both generate same language that is string any numbers a's including null.
L(a*) = {^, a, aa, aa...... } = L ((a*)*)
Is (a*)* a valid expression?
Yes, this expression is called REGULAR-EXPRESSION (I saw you missed the tag). Any Regular Language(RL) can be represented by Regular Expression(RE). A alphabetical way of represent RL.
why is it the same?
* means repetition any numbers of time (including 0 times).
a* means 0 a, 1 a, 2 a or any number of a.
(a*)* means repetition for all string in a* set for any number of time (including 0 times).
Because L(a*) means all string consists using a. its supper-set of every set consists of strings of a's. and L((a*)*) is same.
I want to preface this by saying that this is a homework assignment.
I am given a set of Q binary input variables that will be used to classify output of Y which is also binary.
The first part of the question is: at most how many examples do I need to enumarate all possibile combinations of Q? I am currently think that since it asks for at most I will need Q as it is possible that all values up to Q-1 are the same for instance 1 and the item at Q is 0 .
The second part of the question is: at most how many leaf nodes can the tree have given Z examples?
My current answer is that at most the tree would have 2 leaf nodes, one representing true and one representing false since it is dealing with binary inputs and binary outputs.
Is this the correct way of examining this problem or am I generalizing my answers too deeply?
Edit
After looking at Cameron's response, I would now turn my first answer into 2^Q and to build on his example of Q = 3, I would get 2^3 or 8 (2*2*2). Please correct if that is incorrect thinking.
Edit #2
The second part of the question it appears as though it should be (2^Q) * Z or to provide an example: (2^3) * 3) or 8*3 = 24 leaf nodes. To recap if I have 3 inputs that are binary I would initially take 2^3 and get 8 now I want to go over 3 examples. Therefore I should get 8*3 or 24.
Edit #3
In hindsight it seems that no matter how many examples I use the number of leaf nodes should never increase, as it is a per tree basis.
I'd suggest you approach the problem by working out small example cases by hand.
For the first part, choose a small value for Q, say 3, and write down all possible combinations of Q. Then you can figure out how many examples you need. Increase Q and do it again.
For the second part of your question, pick a small Z and run the decision tree algorithm by hand. See how many leaves you get. Then pick another Z and see if/how it changes. Try generating different examples (with the same Z) and see if you can change the number of leaves.
I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.