pair-wise aggregation of input lines in Hadoop - hadoop

A bunch of driving cars produce traces (sequences of ordered positions)
car_id order_id position
car1 0 (x0,y0)
car1 1 (x1,y1)
car1 2 (x2,y2)
car2 0 (x0,y0)
car2 1 (x1,y1)
car2 2 (x2,y2)
car2 3 (x3,y3)
car2 4 (x4,y4)
car3 0 (x0,y0)
I would like to compute the distance (path length) driven by the cars.
At the core, I need to process all records line by line pair-wise. If the
car_id of the previous line is the same as the current one then I need to
compute the distance to the previous position and add it up to the aggregated
value. If the car_id of the previous line is different from the current line
then I need to output the aggregate for the previous car_id, and initialize the
aggregate of the current car_id with zero.
How should the architecture of the hadoop program look like? Is it possible to
archieve the following:
Solution (1):
(a) Every mapper computes the aggregated distance of the trace (per physical
block)
(b) Every mapper aggregates the distances further in case the trace was split
among multiple blocks and nodes
Comment: this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Solution (2)
(a) The mappers read the data line by line (do no computations) and send the
data to the reducer based on the car_id.
(b) The reducers sort the data for individual car_ids based on order_id,
computes the distances, and aggregates them
Comment: high network load due to laziness of mappers
Solution (3)
(a) implement a custom reader to read define a logical record to be the whole
trace of one car
(b) each mapper computes the distances and the aggregate
(c) reducer is not really needed as everything is done by the mapper
Comment: high main memory costs as the whole trace needs to be loaded into main
memory (although only two lines are used at a time).

I would go with Solution (2), since it is the cleanest to implement and reuse.
You certainly want to sort based on car_id AND order_id, so you can compute the distances on the fly without loading them all up into memory.
Your concern about high network usage is valid, however, you can pre-aggregate the distances in a combiner.
How would that look like, let's take some pseudo-code:
Mapper:
foreach record:
emit((car_id, order_id), (x,y))
Combiner:
if(prev_order_id + 1 == order_id): // subsequent measures
// compute distance and emit that as the last possible order
emit ((car_id, MAX_VALUE), distance(prev, cur))
else:
// send to the reducer, since it is probably crossing block boundaries
emit((car_id, order_id), (x,y))
The reducer then has two main parts:
compute the sum over subsequent measures, like the combiner did
sum over all existing sums, tagged with order_id = MAX_VALUE
That's already best-effort what you can get from a network usage POV.
From a software POV, better use Spark- your logic will be five lines instead of 100 across three class files.
For your other question:
this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Hadoop only guarantees that it is not splitting through records when reading, it may very well be that your record is already touching two different blocks underneath. The way to find that out is basically to rewrite your input format to make this information available to your mappers, or even better- take your logic into account when splitting blocks.

Related

Parallel top ten algorithm for distributed data

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.
For example: Suppose there are only 3 computers and we need the top two most visited URLs.
Computer A: url1, url2, url1, url3
Computer B: url4, url2, url1, url1
Computer C: url3, url4, url1, url3
url1 appears 5 times in all logs
url2 2
url3 3
url4 2
So the answer is url1, url3
The log files are too large to fit in RAM and copy them by network. As I understand, it is important also to make the computation parallel and use all given computers.
How would you solve it?
This is a pretty standard problem for which there is a well-known solution. You simply sort the log files on each computer by URL and then merge them through a priority queue of size k (the number of items you want) on the "master" computer. This technique has been around since the 1960s, and is still in use today (although slightly modified) in the form of MapReduce.
On each computer, extract the URL and the count from the log file, and sort by URL. Because the log files are larger than will fit into memory, you need to do an on-disk merge. That entails reading a chunk of the log file, sorting by URL, writing the chunk to disk. Reading the next chunk, sorting, writing to disk, etc. At some point, you have M log file chunks, each sorted. You can then do an M-way merge. But instead of writing items to disk, you present them, in sorted order (sorted by URL, that is), to the "master".
Each machine sorts its own log.
The "master" computer merges the data from the separate computers and does the top K selection. This is actually two problems, but can be combined into one.
The master creates two priority queues: one for the merge, and one for the top K selection. The first is of size N, where N is the number of computers it's merging data from. The second is of size K: the number of items you want to select. I use a min heap for this, as it's easy and reasonably fast.
To set up the merge queue, initialize the queue and get the first item from each of the "worker" computers. In the pseudo-code below, "get lowest item from merge queue" means getting the root item from the merge queue and then getting the next item from whichever working computer presented that item. So if the queue contains [1, 2, 3], and the items came from computers B, C, A (in that order), then taking the lowest item would mean getting the next item from computer B and adding it to the priority queue.
The master then does the following:
working = get lowest item from merge queue
while (items left to merge)
{
temp = get lowest item from merge queue
while (temp.url == working.url)
{
working.count += temp.count
temp = get lowest item from merge queue
}
// Now have merged counts for one url.
if (topK.Count < desired_count)
{
// topK queue doesn't have enough items yet.
// so add this one.
topK.Add(working);
}
else if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
working = temp;
}
// Here you need to check the last item:
if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
At this point, the topK queue has the K items with the highest counts.
So each computer has to do a merge sort, which is O(n log n), where n is the number of items in that computer's log. The merge on the master is O(n), where n is the sum of all the items from the individual computers. Picking the top k items is O(n log k), where n is the number of unique urls.
The sorts are done in parallel, of course, with each computer preparing its own sorted list. But the "merge" part of the sort is done at the same time the master computer is merging, so there is some coordination, and all machines are involved at that stage.
Given the scale of the log files and the generic nature of the question, this is quite a difficult problem to solve. I do not think that there is one best algorithm for all situations. It depends on the nature of the contents of the log files. For example, take the corner case that all URLs are all unique in all log files. In that case, basically any solution will take a long time to draw that conclusion (if it even gets that far...), and there is not even an answer to your question because there is no top-ten.
I do not have a watertight algorithm that I can present, but I would explore a solution that uses histograms of hash values of the URLs as opposed to the URLs themselves. These histograms can be calculated by means of one-pass file reads, so it can deal with arbitrary size log files. In pseudo-code, I would go for something like this:
Use a hash function with a limited target space (say 10,000, note that colliding hash-values are expected) to calculate the hash value of each item in the log file and count how many times each of the has value occurs. Communicate the resulting histogram to a server (although it is probably also possible to avoid a central server at all by multicasting the result to every other node -- but I will stick with the more obvious server-approach here)
The server should merge the histograms and communicate the result back. Depending on the distribution of the URLs, there might be a number of clearly visible peaks already, containing the top-visited URLs.
Each of the nodes should then focus on the peaks in the histogram. It should go trough its log file again, use an additional hash function (again with a limited target space) to calculate a new hash-histogram for those URLs that have their first hash value in one of the peaks (the number of peaks to focus on would be a parameter to be tuned in the algorithm, depending on the distribution of the URLs), and calculate a second histogram with the new hash values. The result should be communicated to the server.
The server should merge the results again and analyse the new histogram versus the original histogram. Depending on clearly visible peaks, it might be able to draw conclusions about the two hash values of the top ten URLs already. Or it might have to instruct the machines to calculate more hash values with the second hash function, and probably after that go through a third pass of hash-calculations with yet another hash function. This has to continue until a conclusion can be drawn from the collective group of histograms what the hash values of the peak URLs are, and then the nodes can identify the different URLs from that.
Note that this mechanism will require tuning and optimization with regard to several aspects of the algorithm and hash-functions. It will also need orchestration by the server as to which calculations should be done at any time. It probably will also need to set some boundaries in order to conclude when no conclusion can be drawn, in other words when the "spectrum" of URL hash values is too flat to make it worth the effort to continue calculations.
This approach should work well if there is a clear distribution in the URLs though. I suspect that, practically speaking, the question only makes sense in that case anyway.
Assuming the conditions below are true:
You need the top n urls of m hosts.
You can't store the files in RAM
There is a master node
I would take the approach below:
Each node reads a portion of the file (ie. MAX urls, where MAX can be, let's say, 1000 urls) and keeps an array arr[MAX]={url,hits}.
When a node has read MAX urls off the file, it sends the list to the master node, and restarts reads until MAX urls is reached again.
When a node reaches the EOF, he sends the remaining list of urls and an EOF flag to the master node.
When the master node receives a list of urls, it compares it with its last list of urls and generates a new, updated one.
When the master node receives the EOF flag from every node and finishes reading his own file, the top n urls of the last version of his list are the ones we're looking for.
Or
A different approach that would release the master from doing all the job could be:
Every node reads its file and stores an array same as above, reading until EOF.
When EOF, the node will send the first url of the list and the number of hits to the master.
When the master has collected the first url and number of hits for each node, it generates a list. If the master node has less than n urls, it will ask the nodes to send the second one and so on. Until the master has the n urls sorted.
Pre-processing: Each computer system processes complete log file and prepares Unique URLs list with count against them.
Getting top URLs:
Calculate URL counts at each computer system
Collating process at a central system(Virtual)
Send URLs with count to a central processing unit one by one in DESC order(i.e from top most)
At central system collate incoming URL details
Repeat until sum of all the counts from incoming URLs is less than count of Tenth URL in the master list. A vital step to be absolutely certain
PS: You'll have top ten URLs across systems not necessarily in that order. To get the actual order you can reverse collation. For a given URL on top ten get individual count from dist-computers and form final order.
On each node count the number of occurrences of URL.
Then use a sharding function to distribute the url to another node which owns the key for URL. Now each node will have unique keys.
On Each node then again reduce to get the number occurrences of URLs and then find the top N URLs. Finally send only top N urls to master node which will find the top N URls among K*N items where K is number of node.
Eg: K=3
N1 - > url1,url2,url3,url1,url2
N2 - > url2,url4,url1,url5,url2
N3 - > url1,url4,url3,url1,url3
Step 1: Count the occurrence per url in each node.
N1 -> (url1,2),(url2,2),(url3,1)
N2 -> (url4,1),(url2,2),(url5,1),(url1,1)
N3 -> (url1,2),(url3,2),(url4,1)
Step 2: Sharding use hash function(for simplicity, let it be url number % K)
N1 -> (url1,2),(url1,1),(url1,2),(url4,1),(url4,1)
N2 -> (url2,2),(url2,2),(url5,1)
N3 -> (url3,2),(url3,1)
Step 4: Find the number of occurrences of each key within the node again.
N1 -> (url1,5),(url4,2)
N2 -> (url2,4),(url5,1)
N3 -> (url3,3)
Step 5: Send only top N to master. Let N=1
Master -> (url1,5),(url2,4),(url3,3)
Sort the result and get top 1 item which is url1
Step 1 is called map side reduce and it is done to avoid huge shuffle which will occur in Step2.
The below description is the idea for the solution. it is not a pseudocode.
Consider you have a collection of systems.
1.for each A: Collections(systems)
1.1) Run a daemonA in each computer which probes on the log file for changes.
1.2) When a change is noticed, wakeup AnalyzerThreadA
1.3) If AnalyzerThreadA finds a URL using some regex, then update localHashMapA with count++.
(key = URL, value = count ).
2) Push topTen entries of localHashMapA to ComputerA where AnalyzeAll daemon will be running.
The above step will be the last step in each system, which will push topTen entries to a master system, say for example: computerA.
3) AnalyzeAll running in computerA will resolve duplicates and update count in masterHashMap of URLs.
4) Print the topTen from masterHashMap.

MapReduce - how do I calculate relative values (average, top k and so)?

I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out.
If I have all values sent into a single reducer, it has that global view, but then I loose concurrency, and it seems awkward. Is there a better way?
(The framework I'd like to use is Google's, but I'm trying to figure out the technique - no framework specific tricks please)
My first thought is to do something like this:
MAP: Use some dummy value as the key, maybe the empty string for efficiency, and create class that holds both a salary and an employee ID. In each Mapper, create an array that holds 10 elements. Fill it up with the first ten salaries you see, sorted (so location 0 is the highest salary, location 9 is the 10th highest). For every salary after that, see if it is in the top ten and if it is, insert it in the correct location and then move the lower salaries down, as appropriate.
Combiner/Reducer: merge sort the lists. I'd basically do the same thing as in the mapper by creating a ten element array and then loop over all the arrays that match the key, merging them in according to the same comparison/replace/move down sequence as in the mapper
If you run this with one reducer, it should ensure that the top 10 salaries are output.
I don't see a way to do this while using more than one reducer. If you use a combiner, then the reducer should only have to merge a ten-element array for each node that ran mappers (which should be manageable unless you're running on thousands of nodes).
[Edit: I misunderstood. Update for the Top 10% ]
For doing something that relates to the "total" there is no other way than determining the total first and then do the calculations.
So the "Top 10% salaries" could be done roughly as follows:
Determine total:
MAP: Identity
REDUCE: Aggregate the info of all records that go through and create a new "special" record with the "total". Note that you would like to scale
This can also be done by letting the MAP output 2 records (data, total) and the reducer then only touches the "total" records by aggregating.
Use total:
MAP: Prepare for SecondarySort
SHUFFLE/SORT: Sort the records so that the records with the "total" are first into the reducer.
REDUCE: Depending on your implementation the reducer may get a bunch of these total records (aggregate them) and for all subsequent records determine where they are in relation to everything else.
The biggest question on this kind of processing is: Will is scale?
Remember you are breaking the biggest "must have" for scaling out: Independent chunks of information. This makes them dependent around the "total" value.
I expect a technically different way of making the "total" value available to the second step is essential in making this work on "big data".
The "Hadoop - The definitive Guide" book by Tom White has a very good chapter on Secondary Sort.
I would do something like this
The mapper will use an UUID as part of the key, created in the setup() method of the mapper. The mapper emits as key, the UUID appended with either 0 or the salary. The mapper accumulates the count and total.
In the cleanup() method, the mapper emits UUID appended with 0 as the key and the count and total as the value. In the map() method, the mapper emits the UUID appended with salary as the key and salary as the value.
Since the keys are sorted, the first call to combiner will have the count and total as the value. The combiner could store them as class members. We could also find out what 10% of total count is and save that as well as class member (call it top). We initialize a list and save it as a class member.
Subsequent calls to combiner will contain the salary as the value, arriving in sorted order. We add the value to the list and at the same time increment a counter. When the counter reaches the value top, we don't store any more values in our list. We ignore values in rest of the combiner calls.
In the combiner cleanup(), we do the emit. The combiner will emit only the UUID as the key. The value will contain count and total followed by the top 10% of the values. So the output of the combiner will have partial results, based on the subset of the data that passed through the mapper.
The reducer will be called as many times as the number of mappers in this case, because each mapper/combiner emits only one key.
The reducer will accumulate the counts, totals and the top 10% values in the reduce() method. In the cleanup() method, the average is calulated. The top 10% is also calculated in the cleanup() method from the aggregation of top 10% arriving in each call of the reducer. This is basically a merge sort.
The reducer cleanup() method could do multiple emits, so that average is in the first row, followed by the top 10% of salaries in the subsequent rows.
Finally, to ensure that final aggregate statistics are global, you have to set the number of reducers to one.
Since there is data accumulation and sorting in the reducer, although on partial data set, there may be memory issues.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

How to compute the absolute minimum amount of changes to convert one sortorder into another?

Goal
How to encode the data that describes how to re-order a static list from a one order to another order using the minimum amount of bytes possible?
Original Motivation
Originally this problem arose while working on a problem relaying sensor data using expensive satellite communication. A device had a list of about 1,000 sensors they were monitoring. The sensor list couldn't change. Each sensor had a unique id. All data was being logged internally for eventual analysis, the only thing that end users needed daily was which sensor fired in which order.
The entire project was scrapped, but this problem seems too interesting to ignore. Also previously I talked about "swaps" because I was thinking of the sorting algorithm, but really it's the overall order that's important, the steps required to arrive at that order probably wouldn't matter.
How the data was ordered
In SQL terms you could think of it like this.
**Initial Load**
create table sensor ( id int, last_detected datetime, other stuff )
-- fill table with ids of all sensors for this location
Day 0: Select ID from Sensor order by id
(initially data is sorted by the sensor.id because its a known value)
Day 1: Select ID from Sensor order by last_detected
Day 2: Select ID from Sensor order by last_detected
Day 3: Select ID from Sensor order by last_detected
Assumptions
The starting list and ending list is composed of the exact same set of items
Each sensor has a unique id (32 bit integer)
The size of the list will be approximately 1,000 items
Each sensors may fire multiple times per minute or not at all for days
Only the change in ID sort order needs to be relayed.
Computation resources for figuring optimal solutions is cheap / unlimited
Data costs are expensive, roughly a dollar per kilobyte.
Data could only be sent as whole byte (octet) increments
The Day 0 order is known by the sender and receiver to start with
For now assume the system functions perfectly and no error checking is required
As I said the project/hardware is no more so this is now just an academic problem.
The Challenge!
Define an Encoder
Given A. Day N sort order
Given B. Day N + 1 sort order
Return C. a collection of bytes that describe how to convert A to B using the least number of bytes possible
Define a Decoder
Given A. Day N sort order
Given B. a collection of bytes
Return C. Day N + 1 sort order
Have fun everyone.
As an academic problem, one approach would be to look at Algorithm P section 3.3.2 of Vol II of Knuth's the art of computer programming, which maps every permutation on N objects into an integer between 0 and N!-1. If every possible permutation is equally likely at any time, then the best you can do is to compute and transmit this (multi-precision) integer. In practice, giving each sensor a 10-bit number and then packing those 10 bit numbers up so you have e.g. 4 numbers packed into each chunk of 5 bytes would do almost as well.
Schemes based on diff or off the shelf compression make use of knowledge that not all permutations are equally likely. You may have knowledge of this based on the equipment, or you could see if this is case by looking at previous data. Fine if it works. In some cases with sensors and satellites you might want to worry about rare exceptions where you get worst case behaviour of your compression scheme and you suddenly have more data to transmit than you bargained for.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources