Simple mapreduce pairwise comparison - hadoop

I am learning mapreduce. I want to implement a naive nearest neighbor search-- complexity O(n^2). To do this, I expect to use nested loops to iterate over the input items. The inner loop compares two items and writes out the distance between them.
I think what I need to do is pass all the items in the input split into the mapper. I do not know how to do this. If I use a TextInputFormat, what will the the context's getCurrentValue() method return? All lines in all input files, or something else?
How about NLineFormat? Will the split size be set to N?
Advice is welcome. I'm not ready to dive into academic papers on the subject.
...
Thanks for the comments. Here are my updates:
Each input item is a feature vector of nominal values. The distance between two items is just the number of corresponding fields with different values.
The output is going to be something simple: item#1_ID, item#2_ID, distance
I'm just testing a sample of 500 items so it runs quickly. I would not use this approach on a large real-life data set. There a slide deck out there an approximate nearest-neighbor matching in mapreduce. If I take this project further, I'll probably follow that approach.

Related

Algorithm: Understand when two lines chart are similar

I am trying to develop a script capable of understanding when two lines chart are similar (they have similar direction or similar values).
For instance suppose I have two arrays:
array1 = [0,1,2,3,4,5,6,7,8,9,10];
array2 = [2,3,4,5,6,7,8,8,10,11,12];
As you can see they both growth and their values are quite similar.
At the moment I have found a perfectly working solution using a DTW algorithm.
The problem is that the DTW has a "training part" very fast (I just have to store a lot of lines chart) but it has a heavy prediction part because it compares the last line chart with all the others in memory.
So my question is: is it possible to move the computational complexity time during the training part in order to have a faster prediction?
For example creating a search tree or something like that?
And if it is possible accordingly to which specific value can I cluster the information?
Do you have any advice or useful links?
It is often possible by mapping the objects from your domain to a linear space. For example, you can see how that works for word embeddings in natural languages (word2vec tutorial, skip to "Visualizing the Learned Embeddings"). In this setting, similarity between objects is defined by a distance in the linear space, which is very fast to compute.
How complex should the mapping be in your case greatly depends on your data: how diverse are the charts and what kind of similarity you wish to capture.
In your example with two vectors, it's possible to compute a single value: the slope of the regression line. This will probably work is your charts are "somewhat linear" in nature. If you'd like to capture sinusoidal patterns as well, you can try to normalize the time series by subtracting the first value. Again, in your particular example it'll show a perfect fit.
Bottom line: complexity of the mapping is determined by the complexity of the data.
If they always have the same length, Pearson correlation should be much more appropriate, and much faster.
If you standardize your vectors, Pearson is Euclidean and you can use any multidimensional search tree for further acceleration.

Finding matches in two bags

This was one of the interview questions, and I was wondering if I did it right, and want to know if my solution is the most efficient one. O(n log n)
Given a bag of nuts and a bag of bolts, each having a different size within a bag but exactly one match in the other bag, give a fast algorithm to find all matches.
Since there is always a match between 2 bags, I said there will be same number of nuts and same number of bolts. Let the number be n.
I would first sort components in each bag each component's weight, and it is possible to do that because they all have different size within a bag. Using merge sort, it will have O(n log n) time complexity.
Next, it would be just easy process of matching each component in 2 bags from lightest to heaviest.
I want to know if this is the right solution, and also if there is any other interesting way to solve this problem.
There is a solution to this problem in O(nlgn) that works even if you can't compare two nuts or two bolts directly. That is without using a scale, or if the differences in weight are too small to observe.
It works by applying a small variation of randomised quicksort as follows:
Pick a random bolt b.
Compare bolt b with all the nuts, partitioning the nuts into those of size less than b and greater than b.
Now we must also partition the bolts into two halves as well and we can't compare bolt to bolt. But now we know what is the matching nut n to b. So we compare n to all the bolts, partitioning them into those of size less than n and greater than n.
The rest of the problem follows directly from randomised quicksort by applying the same steps to the lessThan and greaterThan partitions of nuts and bolts.
The standard solution is to put the bag of nuts into a hash (or HashMap, or dictionary, or whatever your language of choice calls it), and then walk the other bag using hash lookups to find the matches.
Average performance of this algorithm with be O(n), with better constants than your sort/binary search variation.
Assuming you have no exact measurements of size, and you can only see if something is too large, too small, or exactly right by trying, this needs the extra step of how you are sorting -- which is probably by taking a random nut, comparing with each bolt and sorting into smaller, bigger piles, and then taking a bolt and sorting the nuts with that (a modified quicksort... keep in mind you'll have to deal with the problem of how to choose a nut or bolt to compare to when the piles get smaller as to make sure you're using a reasonable pivot).
Since the problem doesn't tell you that the nuts and bolts are labeled with sizes, or that you have some way of measuring the nuts and bolts' sizes, it would be difficult to use a hash table to solve this, or even a regular comparison sort.
That looks like the best solution to me, complexity-wise.
An alternative would be to sort only one of the bags, then use binary search to try matching bolt to nut (removing the pair from the sorted list). Still O(n log n), but the second part is optimized slightly, since we're removing each pair after the match, the worst case would be: log n + log n-1 + log n-2 ... 1 = (n log n)/2
Take two bags extra. Say left and right bags. Now take each nut and compare with each bolt. If the bolt is small place it in left bag, large then in right bag, fits then place it away. Now take the next nut, compare it with previously taken nut, if its smaller, then we need to check only those nuts in left bag, else to the ones in right bag. Continue in this fashion.

Binary search implementation with actors in scala?

So I have such problem in Scala, I need to implement binary search with the help of actors, with no loops and recursion, preferably with concurrency between actors. Of course it has no sense, but the problem is as follows. I think that it will be nice to have one actor-coordinator,which coordinates the work of others. So input data is sorted array and the key for searching. Output - index of a key. Do you have some ideas how it possible to implement?
Thanks in advance.
I'm not sure how you could have concurrency for binary search, as every step of the algorithm needs the result of the last one.
You could do "n-ary" search: Split the array in n parts and let every actor compare the value at the boundaries of the sub-arrays. You don't even have to wait for all answers, as soon as you got the two actors with different comparision result, you could start the next round recursively for the subarray you found.

Grouping items in an array?

Hey guys, if I have an array that looks like [A,B,C,A,B,C,A,C,B] (random order), and I wish to arrange it into [A,A,A,B,B,B,C,C,C] (each group is together), and the only operations allowed are:
1)query the i-th item of the array
2)swap two items in the array.
How to design an algorithm that does the job in O(n)?
Thanks!
Sort algorithms aren't something you design fresh (i.e. first step of your development process) anymore; you should research known sort algorithms and see what meets your needs.
(It is of course possible you might really require your own new sort algorithm, but usually that has different—and highly-specific—requirements.)
If this isn't your first step (but I don't think that's the case), it would be helpful to know what you've already tried and how it failed you.
This is actually just counting sort.
Scan the array once, count the number of As, Bs, Cs—that should give you an idea. This becomes like bucket sort—not quite but along those lines. The count of As Bs and Cs should give you an idea about where the string of As, Bs and Cs belongs.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources