I'd like to perform the expensive operation of cross product across two data sets in Hadoop using Java MapReduce.
For example, I have records from data set A and data set B, and I'd like each record in data set A to be matched up to each record in data set B in the output. I realize that the output size of this would be |A| * |B|, but want to do it anyways.
I see that Pig has CROSS but am unaware of how it is implemented at a high-level. Perhaps I will go take a look at the source code.
Not looking for any code, just want to know at a high-level how I should approach this problem.
I have done something similar when looking at document similarity (comparing a document to every other document) and ended up with a custom input format that splits up the two datasets and then ensured there was a 'split' for each subset of data.
So your splits would look like (each merging two sets of 10 records, outputting 100 records)
A(1-10) x B(1-10)
A(11-20) x B(1-10)
A(21-30) x B(1-10)
A(1-10) x B(11-20)
A(11-20) x B(11-20)
A(21-30) x B(11-20)
A(1-10) x B(21-30)
A(11-20) x B(21-30)
A(21-30) x B(21-30)
I don't remember how performant it was though, but had a document set in the size order of thousands to compare against one another (on an 8 node dev cluster), with millions of cross products calculated.
I could also make improvements to the algorithm as some documents would never score well against others (if they had too much temporal time between them for example), and generate better splits as a result.
Related
In KNN like algorithm we need to load model Data into cache for predicting the records.
Here is the example for KNN.
So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.
Example:
Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. So we need to get the model result in our hands. And if it is large file it cannot be loaded into Distributed cache for finding distance.
The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.
How can we parttion the file and perform the operation on these partition ?
ie 1 record <Distance> file1,file2,....filen
2nd record <Distance> file1,file2,...filen
This is what came to my thought.
Is there any further way.
Any pointers would help me.
I think the way you partitionin the data mainly depends on your data itself.
Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop).
That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. This is a hard problem and requires knowledge of the data you operate on.
Additional tricks that you can use to fish out performance are:
Pre-sorting the input data so that the input items that will be compared from the same partition come together. Again depends on the data you operate on.
Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it.
In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. Pre-sorting the data is still helpful there.
I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started
Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.
Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.
How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.