Binary search implementation with actors in scala? - algorithm

So I have such problem in Scala, I need to implement binary search with the help of actors, with no loops and recursion, preferably with concurrency between actors. Of course it has no sense, but the problem is as follows. I think that it will be nice to have one actor-coordinator,which coordinates the work of others. So input data is sorted array and the key for searching. Output - index of a key. Do you have some ideas how it possible to implement?
Thanks in advance.

I'm not sure how you could have concurrency for binary search, as every step of the algorithm needs the result of the last one.
You could do "n-ary" search: Split the array in n parts and let every actor compare the value at the boundaries of the sub-arrays. You don't even have to wait for all answers, as soon as you got the two actors with different comparision result, you could start the next round recursively for the subarray you found.

Related

Algorithm for searching dictionary in both direction in O(log n) <= speed < O(n)

I am trying to implement an (Language1 - Language2) dictionary.
I want to make an search algorithm in both direction that speed is faster than O(n)
For example, if (hello, hola) is one pair,
SEARCH_SPANISH_BY_ENGLISH (hello) = "hola", and
SEARCH_ENGLISH_BY_SPANISH (hola) = "hello"
If you have an idea for an algorithm, can you tell me how to set up an dictionary and implement an search algorithm? It seems like I have to use an divide and conquer but I am not sure how. Thanks.
The side of dictionary should be singular, which means I cannot build both English-Spanish and Spanish-English dictionary.
You can have two dictionary as the search time will remain constant. However, if you still wish to have two way dictionary, then Jon Skeet's version is here:
Getting key of value of a generic Dictionary?
Any balanced tree, or sorted array, or hashtable data structure will give you better than O(n) when searching.
For the bidirectionality, you can just have two data structures, one for English-Español the other for Español-English. That doesn't necessarily mean two complete dictionaries, just that you need a search index of some description for either direction.
However, since you appear to have a one-to-one mapping of words, you can just maintain one dictionary as persistent and build the reverse one at run time.

Simple mapreduce pairwise comparison

I am learning mapreduce. I want to implement a naive nearest neighbor search-- complexity O(n^2). To do this, I expect to use nested loops to iterate over the input items. The inner loop compares two items and writes out the distance between them.
I think what I need to do is pass all the items in the input split into the mapper. I do not know how to do this. If I use a TextInputFormat, what will the the context's getCurrentValue() method return? All lines in all input files, or something else?
How about NLineFormat? Will the split size be set to N?
Advice is welcome. I'm not ready to dive into academic papers on the subject.
...
Thanks for the comments. Here are my updates:
Each input item is a feature vector of nominal values. The distance between two items is just the number of corresponding fields with different values.
The output is going to be something simple: item#1_ID, item#2_ID, distance
I'm just testing a sample of 500 items so it runs quickly. I would not use this approach on a large real-life data set. There a slide deck out there an approximate nearest-neighbor matching in mapreduce. If I take this project further, I'll probably follow that approach.

Find the number occurences of each word in a document?

I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.
Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn't want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.
Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?
Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a Map<String,Integer>
Iterate the document once (with your nextDoc() method), while maintaining your histogram.
The best implementation for this interface, in terms of big O notations - would probably be to use a trie, and in each leaf node - add the counter of occurances.
Getting the actual (word,number) pairs from the trie will be done by a simple DFS on the trie.
This solution gives you O(n * |S|) time complexity, where |S| is the average size for a string.
The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does - increase the counter, else - add the word to the dictionary with a counter value of 1.
I'd try to implement a B-Tree (or smth quite similar) to store all the words. Therefore I could easily find a next word, if already have it and increase associated counter in the node. Or just insert a new one.
The time complexity in that case would be: O(nlogn), where n is all words count and logn is a Big-Oh for such kind of tree.
I think the simplest solution would be a a Trie. O(N) is given in this case (both for insertion and getting the count). Just store the count in an additional space at every node.
Basically each node in the tree contains 26 links to 26 possible children (1 for each letter) + 1 counter (for words the are terminated in the current node)
.
Just look at the link for a graphic image of a trie.

Grouping items in an array?

Hey guys, if I have an array that looks like [A,B,C,A,B,C,A,C,B] (random order), and I wish to arrange it into [A,A,A,B,B,B,C,C,C] (each group is together), and the only operations allowed are:
1)query the i-th item of the array
2)swap two items in the array.
How to design an algorithm that does the job in O(n)?
Thanks!
Sort algorithms aren't something you design fresh (i.e. first step of your development process) anymore; you should research known sort algorithms and see what meets your needs.
(It is of course possible you might really require your own new sort algorithm, but usually that has different—and highly-specific—requirements.)
If this isn't your first step (but I don't think that's the case), it would be helpful to know what you've already tried and how it failed you.
This is actually just counting sort.
Scan the array once, count the number of As, Bs, Cs—that should give you an idea. This becomes like bucket sort—not quite but along those lines. The count of As Bs and Cs should give you an idea about where the string of As, Bs and Cs belongs.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources