How to efficiently find top-k elements? - hadoop

I have a big sequence file storing the tfidf values for documents. Each line represents line and the columns are the value of tfidfs for each term (the row is a sparse vector). I'd like to pick the top-k words for each document using Hadoop. The naive solution is to loop through all the columns for each row in the mapper and pick the top-k but as the file becomes bigger and bigger I don't think this is a good solution. Is there a better way to do that in Hadoop?

1. In every map calculate TopK (this is local top K for each map)
2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
Think of the problem as
1. You have been given the results of X number of horse races.
2. You need to find Top N fastest horse.

Related

Select and filter algorithm

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.
Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.
You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.
Store the values you wish to exclude in a hash table.
Make an empty min-heap.
Iterate over all of the values and for each value:
If it is in the hash table skip it.
If the heap contains fewer than k values, add it to the heap.
If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.
I will share my thoughts and since the author still has not specified the scope of data to be processed, I will assume that it is too large to be handled by a single machine and I will also assume that the author is familiar with Hadoop.
So I would suggest using the MapReduce as follows:
Mappers simply emit pairs (x,y)
Combiners select k pairs with largest values of x (k=100 in this case) in the meantime maintaining the unique y's in the hashset to avoid duplicates, then emit k pairs found.
There should be only one reducer in this job since it has to get all pairs from combiners to finalize the job by selecting k pairs for the last time. Reducer's implementation is identical to combiner.
The number of combiners should be selected considering memory resources needed to select top k pairs out of incoming dataset since whichever method is used (sorting, heap or anything else) it is going to be done in-memory, as well as keeping that hashset with unique y's

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Randomly Partition versus Partition then Shuffle

Given a set of n data points generated from the same distribution, I want to "randomly partition" the set into k groups, where each contains n / k points randomly chosen from the original data set.
Alternatively, I can first divide the input data set into k contiguous chunks, where the first chunk contains 1, ..., n/k, and the second chunk contains n/k+1, ..., 2n/k, and so on. Then I "shuffle" the data points within each partition.
Are these two approaches always equal, given the data set are generated from the same distribution? If not, what assumptions do we need when these two approaches produces the same results?
Obviously they are not equivalent; the second restricts the values that can go in each partition, while the first does not.
If by "results" you mean what is done with these partitions, that would be wholly dependent on just what that is, which you provide no hint to.

Computing median in map reduce

Can someone example the computation of median/quantiles in map reduce?
My understanding of Datafu's median is that the 'n' mappers sort the
data and send the data to "1" reducer which is responsible for sorting
all the data from n mappers and finding the median(middle value)
Is my understanding correct?,
if so, does this approach scale for
massive amounts of data as i can clearly see the one single reducer
struggling to do the final task.
Thanks
Trying to find the median (middle number) in a series is going to require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value.
Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.
Another way you could scale this (again if you know the range and rough distribution of values) is to use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on). This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)
Do you really need the exact median and quantiles?
A lot of the time, you are better off with just getting approximate values, and working with them, in particular if you use this for e.g. data partitioning.
In fact, you can use the approximate quantiles to speed up finding the exact quantiles (actually in O(n/p) time), here is a rough outline of the strategy:
Have a mapper for each partition compute the desired quantiles, and output them to a new data set. This data set should be several order of magnitues smaller (unless you ask for too many quantiles!)
Within this data set, compute the quantiles again, similar to "median of medians". These are your initial estimates.
Repartition the data according to these quantiles (or even additional partitions obtained this way). The goal is that in the end, the true quantile is guaranteed to be in one partition, and there should be at most one of the desired quantiles in each partition
Within each of the partitions, perform a QuickSelect (in O(n)) to find the true quantile.
Each of the steps is in linear time. The most costly step is part 3, as it will require the whole data set to be redistributed, so it generates O(n) network traffic.
You can probably optimize the process by choosing "alternate" quantiles for the first iteration. Say, you want to find the global median. You can't find it in a linear process easily, but you can probably narrow it down to 1/kth of the data set, when it is split into k partitions. So instead of having each node report its median, have each node additionally report the objects at (k-1)/(2k) and (k+1)/(2k). This should allow you to narrow down the range of values where the true median must lie signficantly. So in the next step, you can each node send those objects that are within the desired range to a single master node, and choose the median within this range only.
O((n log n)/p) to sort it then O(1) to get the median.
Yes... you can get O(n/p) but you can't use the out of the box sort functionality in Hadoop. I would just sort and get the center item unless you can justify the 2-20 hours of development time to code the parallel kth largest algorithm.
In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:
Calculate frequencies of values in your dataset (Word Count job, basically)
Identity mapper + a reducer which calculates median based on < value - frequency> pairs
Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process n (n = cardinality of your value set) items instead of all values, as with the naive approach.
Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are ints, but can be easily adopted for doubles
import sys
item_to_index_range = []
total_count = 0
# Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values
for line in sys.stdin:
item, count = line.strip().split("\t", 1)
new_total_count = total_count + int(count)
item_to_index_range.append((item, (total_count + 1, new_total_count + 1)))
total_count = new_total_count
# Calculate index(es) of middle items
middle_items_indexes = [(total_count / 2) + 1]
if total_count % 2 == 0:
middle_items_indexes += [total_count / 2]
# Retrieve middle item(s)
middle_items = []
for i in middle_items_indexes:
for item, index_range in item_to_index_range:
if i in range(*index_range):
middle_items.append(item)
continue
print sum(middle_items) / float(len(middle_items))
This answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:
reducer will first have to compute final < value - frequency > pairs and then calculate median.
In the worst case scenario, combiners will never be executed and the reducer will still have to struggle with processing all individual values

Choosing number of clusters in k means

I want to cluster a large sample of data and for it I am using k means function in MATLAB. The problem is that it returns a matrix with all the data sorted in the number of clusters I specify.
How can I know which number of clusters is optimal.
I thought that if I would get the equal number of elements in each cluster that would be optimal but this never happens. Rather it can go on clustering the data for any number I put.
Please help...
I read and I think an answer to this could be :- In kmeans we are trying to partition the data according to the means as the data comes so theoretically our best dataset would be where each partition has equal number of data.
I used kmeans++ which was a better algorithm than kmeans because it does not initialise a random value and then iterated over the number of partitions till the sizes of partitions were almost equal. This was an approximate figure as say for 3 i got 2180,729,1219 and for 4 i was getting 30,2422, 1556,120 so I chose 3 as my final answer............

Resources