design an algorithm to do queries within a log file - algorithm

I was recently asked this question in an interview:
Given a log file which has customer id and corresponding to that id it has a page id visited by that customer. Given such log files for 3 consecutive days, design an algo to find those customers who visited the site on exactly 2 out of 3 days and visited at least 3 distinct pages. Discuss the design and space complexity.
I thought of this approach:
1)sorting the 3 log files
2) looking through the top of each file: look for the same customer ids and see the number of different days they visited the pages. If the same customer has visited the pages on 3 different days, then we can discard that customer and move on looking to the next customer in each file. Else, if this customer has visited the the pages on exactly 2 different days, we will keep storing his page ids in a set. We can finally check whether this set has 3 distinct pages or not.
But, I am not sure whether this approach is the most efficient, also what will be the approach which will scale well if I have to find k visited sites out of K sites, visiting exactly N distinct pages. I thought that k-d trees(2-d trees) may help because they can handle multiple queries.
So, what will be the best data strucuture and the algorithm to handle these kinds of queries.

Make a list of (Id,day,page) tuples. Sort lexicographically. Now each customer's accesses are consecutive in the list, and the questions are easy to answer with a single scan. The time complexity is that of sorting the data - O(n log n).

I think I'd approaching it using two HashMaps. The first hash would map from customer id to a Set containing the days visited. The second Hash would map from customer id to a Set of pages visited. After passing through the log files once, you'll have populated both Hashes. You'll have to pass thru the first Hash once to determine who visited exactly twice (using the Set size to do that). For every customer who passes the first test, look-up the customer id in the second Hash and see who visited at least 3 distinct pages (using the Set size to do that).
This would be fairly quick but require a good amount of memory.

You only need to remember minimal state per client:
struct client_state {
day1 :1;
day2 :1;
day3 :1;
npage :2;
page_id pages[2];
};
For every {client_id, day, page_id} tuple you have to perform the following steps:
0) lookup client record (see below)
1) switch(day) {
case day1: day1=1; break;
case day2: day2=1; break;
case day3: day3=1; break;
}
2) switch(npage) {
case 3: break;
case 2: if (pages[1] == page_id) break;
case 1: if (pages[0] == page_id) break;
case 0: pages[npage++] = page_id; break;
}
If everything fits in memory, the state can be kept in a hashtable, with client_id as key. A second final pass is needed: inspect all status-records and filter on {day1,day2,day3} and (npage==3) Complexity is two sequential passes. (O(N))
Otherwise, the file has to be sorted (O Nlog(N)) and the sorted file can be used to sequentially update the state-record (we only need one!), Filtering can be done at end-of-group.

This is a data base question.
You have 2 entities: Customer(ID) and Page(ID).
You also have 1 relationship between them that we will call Visited with a date attribute.
But if we must...
Use a Set to store the different Page ID's.
Use a Dictionary with the key as Customer ID and value will have 2 properties:
1) A Set of Page ID's.
2) A Set of Date's.
This model gives you access to each Customer in O(lgn) and the number of different Date's that were visited by that Customer in O(1) and the number of Page's in O(1).

Related

Using hashes to round-robin objects between buckets?

We have an incoming queue of support cases from customers.
Each support case has at least the following fields:
Case Number (Unique numeric ID)
Creation Time (Timestamp)
Title (Text String)
Description (Text String)
Other fields...
We'd like to split these into three distinct buckets, in a repeatable way.
For example, the first case that comes in goes to queue A, the second to queue B, the third to queue C etc.
It doesn't necessarily need to be in that order, but the distribution needs to be equal (or close to equal).
The case number is monotonically increasing but they are not sequential (that is, there will be gaps - e.g. case 10005, then case 10400, then case 10405 etc.). The reason is that the case numbers are shared among several categories, but we are only looking at a single specific category of cases.
We don't want to have to maintain a lookup table - but rather I was thinking of generating some kind of hash based on case number + creation time, for example, and then doing a modulus 3 on it?
Question
Does the above approach look sane? Any comments?
What sort of hashing algorithm, and on what fields should I do it across, in order to get a good distribution for the modulus?

Parallel top ten algorithm for distributed data

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.
For example: Suppose there are only 3 computers and we need the top two most visited URLs.
Computer A: url1, url2, url1, url3
Computer B: url4, url2, url1, url1
Computer C: url3, url4, url1, url3
url1 appears 5 times in all logs
url2 2
url3 3
url4 2
So the answer is url1, url3
The log files are too large to fit in RAM and copy them by network. As I understand, it is important also to make the computation parallel and use all given computers.
How would you solve it?
This is a pretty standard problem for which there is a well-known solution. You simply sort the log files on each computer by URL and then merge them through a priority queue of size k (the number of items you want) on the "master" computer. This technique has been around since the 1960s, and is still in use today (although slightly modified) in the form of MapReduce.
On each computer, extract the URL and the count from the log file, and sort by URL. Because the log files are larger than will fit into memory, you need to do an on-disk merge. That entails reading a chunk of the log file, sorting by URL, writing the chunk to disk. Reading the next chunk, sorting, writing to disk, etc. At some point, you have M log file chunks, each sorted. You can then do an M-way merge. But instead of writing items to disk, you present them, in sorted order (sorted by URL, that is), to the "master".
Each machine sorts its own log.
The "master" computer merges the data from the separate computers and does the top K selection. This is actually two problems, but can be combined into one.
The master creates two priority queues: one for the merge, and one for the top K selection. The first is of size N, where N is the number of computers it's merging data from. The second is of size K: the number of items you want to select. I use a min heap for this, as it's easy and reasonably fast.
To set up the merge queue, initialize the queue and get the first item from each of the "worker" computers. In the pseudo-code below, "get lowest item from merge queue" means getting the root item from the merge queue and then getting the next item from whichever working computer presented that item. So if the queue contains [1, 2, 3], and the items came from computers B, C, A (in that order), then taking the lowest item would mean getting the next item from computer B and adding it to the priority queue.
The master then does the following:
working = get lowest item from merge queue
while (items left to merge)
{
temp = get lowest item from merge queue
while (temp.url == working.url)
{
working.count += temp.count
temp = get lowest item from merge queue
}
// Now have merged counts for one url.
if (topK.Count < desired_count)
{
// topK queue doesn't have enough items yet.
// so add this one.
topK.Add(working);
}
else if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
working = temp;
}
// Here you need to check the last item:
if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
At this point, the topK queue has the K items with the highest counts.
So each computer has to do a merge sort, which is O(n log n), where n is the number of items in that computer's log. The merge on the master is O(n), where n is the sum of all the items from the individual computers. Picking the top k items is O(n log k), where n is the number of unique urls.
The sorts are done in parallel, of course, with each computer preparing its own sorted list. But the "merge" part of the sort is done at the same time the master computer is merging, so there is some coordination, and all machines are involved at that stage.
Given the scale of the log files and the generic nature of the question, this is quite a difficult problem to solve. I do not think that there is one best algorithm for all situations. It depends on the nature of the contents of the log files. For example, take the corner case that all URLs are all unique in all log files. In that case, basically any solution will take a long time to draw that conclusion (if it even gets that far...), and there is not even an answer to your question because there is no top-ten.
I do not have a watertight algorithm that I can present, but I would explore a solution that uses histograms of hash values of the URLs as opposed to the URLs themselves. These histograms can be calculated by means of one-pass file reads, so it can deal with arbitrary size log files. In pseudo-code, I would go for something like this:
Use a hash function with a limited target space (say 10,000, note that colliding hash-values are expected) to calculate the hash value of each item in the log file and count how many times each of the has value occurs. Communicate the resulting histogram to a server (although it is probably also possible to avoid a central server at all by multicasting the result to every other node -- but I will stick with the more obvious server-approach here)
The server should merge the histograms and communicate the result back. Depending on the distribution of the URLs, there might be a number of clearly visible peaks already, containing the top-visited URLs.
Each of the nodes should then focus on the peaks in the histogram. It should go trough its log file again, use an additional hash function (again with a limited target space) to calculate a new hash-histogram for those URLs that have their first hash value in one of the peaks (the number of peaks to focus on would be a parameter to be tuned in the algorithm, depending on the distribution of the URLs), and calculate a second histogram with the new hash values. The result should be communicated to the server.
The server should merge the results again and analyse the new histogram versus the original histogram. Depending on clearly visible peaks, it might be able to draw conclusions about the two hash values of the top ten URLs already. Or it might have to instruct the machines to calculate more hash values with the second hash function, and probably after that go through a third pass of hash-calculations with yet another hash function. This has to continue until a conclusion can be drawn from the collective group of histograms what the hash values of the peak URLs are, and then the nodes can identify the different URLs from that.
Note that this mechanism will require tuning and optimization with regard to several aspects of the algorithm and hash-functions. It will also need orchestration by the server as to which calculations should be done at any time. It probably will also need to set some boundaries in order to conclude when no conclusion can be drawn, in other words when the "spectrum" of URL hash values is too flat to make it worth the effort to continue calculations.
This approach should work well if there is a clear distribution in the URLs though. I suspect that, practically speaking, the question only makes sense in that case anyway.
Assuming the conditions below are true:
You need the top n urls of m hosts.
You can't store the files in RAM
There is a master node
I would take the approach below:
Each node reads a portion of the file (ie. MAX urls, where MAX can be, let's say, 1000 urls) and keeps an array arr[MAX]={url,hits}.
When a node has read MAX urls off the file, it sends the list to the master node, and restarts reads until MAX urls is reached again.
When a node reaches the EOF, he sends the remaining list of urls and an EOF flag to the master node.
When the master node receives a list of urls, it compares it with its last list of urls and generates a new, updated one.
When the master node receives the EOF flag from every node and finishes reading his own file, the top n urls of the last version of his list are the ones we're looking for.
Or
A different approach that would release the master from doing all the job could be:
Every node reads its file and stores an array same as above, reading until EOF.
When EOF, the node will send the first url of the list and the number of hits to the master.
When the master has collected the first url and number of hits for each node, it generates a list. If the master node has less than n urls, it will ask the nodes to send the second one and so on. Until the master has the n urls sorted.
Pre-processing: Each computer system processes complete log file and prepares Unique URLs list with count against them.
Getting top URLs:
Calculate URL counts at each computer system
Collating process at a central system(Virtual)
Send URLs with count to a central processing unit one by one in DESC order(i.e from top most)
At central system collate incoming URL details
Repeat until sum of all the counts from incoming URLs is less than count of Tenth URL in the master list. A vital step to be absolutely certain
PS: You'll have top ten URLs across systems not necessarily in that order. To get the actual order you can reverse collation. For a given URL on top ten get individual count from dist-computers and form final order.
On each node count the number of occurrences of URL.
Then use a sharding function to distribute the url to another node which owns the key for URL. Now each node will have unique keys.
On Each node then again reduce to get the number occurrences of URLs and then find the top N URLs. Finally send only top N urls to master node which will find the top N URls among K*N items where K is number of node.
Eg: K=3
N1 - > url1,url2,url3,url1,url2
N2 - > url2,url4,url1,url5,url2
N3 - > url1,url4,url3,url1,url3
Step 1: Count the occurrence per url in each node.
N1 -> (url1,2),(url2,2),(url3,1)
N2 -> (url4,1),(url2,2),(url5,1),(url1,1)
N3 -> (url1,2),(url3,2),(url4,1)
Step 2: Sharding use hash function(for simplicity, let it be url number % K)
N1 -> (url1,2),(url1,1),(url1,2),(url4,1),(url4,1)
N2 -> (url2,2),(url2,2),(url5,1)
N3 -> (url3,2),(url3,1)
Step 4: Find the number of occurrences of each key within the node again.
N1 -> (url1,5),(url4,2)
N2 -> (url2,4),(url5,1)
N3 -> (url3,3)
Step 5: Send only top N to master. Let N=1
Master -> (url1,5),(url2,4),(url3,3)
Sort the result and get top 1 item which is url1
Step 1 is called map side reduce and it is done to avoid huge shuffle which will occur in Step2.
The below description is the idea for the solution. it is not a pseudocode.
Consider you have a collection of systems.
1.for each A: Collections(systems)
1.1) Run a daemonA in each computer which probes on the log file for changes.
1.2) When a change is noticed, wakeup AnalyzerThreadA
1.3) If AnalyzerThreadA finds a URL using some regex, then update localHashMapA with count++.
(key = URL, value = count ).
2) Push topTen entries of localHashMapA to ComputerA where AnalyzeAll daemon will be running.
The above step will be the last step in each system, which will push topTen entries to a master system, say for example: computerA.
3) AnalyzeAll running in computerA will resolve duplicates and update count in masterHashMap of URLs.
4) Print the topTen from masterHashMap.

Have/Want List Matching Algorithm

Have/Want List Matching Algorithm
I am implementing an item trading system on a high-traffic site. I have a large number of users that each maintain a HAVE list and a WANT list for a number of specific items. I am looking for an algorithm that will allow me to efficiently suggest trading partners based on your HAVEs and WANTs matched with theirs. Ideally I want to find partners with the highest mutual trading potential (i.e. I have a ton of things you want, you have a ton of things I want). I don't need to find the global highest-potential pair (which sounds hard), just find the highest-potential pairs for a given user (or even just some high-potential pairs, not the global max).
Example:
User 1 HAS A,C WANTS B,D
User 2 HAS D WANTS A
User 3 HAS A,B,D WANTS C
User 1 goes to the site and clicks a button that says
"Find Trading Partners" and the top-ranked result is
User 3, followed by User 2.
An additional source of complexity is that the items have different values, and I want to match on the highest valued trade possible, rather than on the most number of matches between two traders. So in the example above, if all items are worth 1, but A and D are both worth 10, User 1 now gets matched with User 2 above User 3.
A naive way to do this would to compute the max trade value between the user looking for partners vs. all other users in the database. I'm thinking with some lookup tables on the right things I might be able to do better. I've tried googling around, since this seems like a classical problem, but I don't know the name for it.
Can anyone recommend a good approach to solving this problem? I've seen sites like the Magic Online Trading League that seem to solve it in realtime.
You could do this in O(n*k^2) (n is the number of people, k is the average number of items they have/want) by keeping hash tables (or, in a database, indexes) of all the people who have and want given items, then giving scores for all the people who have items the current user wants, and want items the current user has. Display the top 10 or 20 scores.
[Edit] Example of how this would be implemented in SQL:
-- Get score for #userid wants
SELECT UserHas.UserID, SUM(Items.Weight) AS Score
FROM UserWants
INNER JOIN UserHas ON UserWants.ItemID = UserHas.ItemID
INNER JOIN Items ON Items.ItemID = UserWants.ItemID
WHERE UserWants.UserID = #userid
GROUP BY UserWants.UserID, UserHas.UserID
This gives you a list of other users and their score, based on what items they have that the current user wants. Do the same for items the current user has the others want, then combine them somehow (add the scores or whatever you want) and grab the top 10.
This problem looks pretty similar to stable roomamates problem. I don't see any thing wrong with the SQL implementation that got highest votes but as some else suggested this is like a dating/match making problem similar to the lines of stable marriage problem but here all the participants are in one pool.
The second wikipedia entry also has a link to a practical solution in javascript which could be useful
You could maintain a per-item list (as a complement to per-user list). Item search is then spot on. Now you can allow your self brute force search for most valuable pair by checking most valuable items first. If you want more complex (arguably faster) search you could introduce set of items that often come together as meta-items, and look for them first.
Okay, what about this:
There are basically giant "Pools"
Each "pool" contains "sections." Each "Pool" is dedicated to people who own a specific item. Each section is for people who own that item, and want another.
What I mean:
Pool A (For those requesting A)
--Section B (For those requesting A that have B)
--Section C (For those requesting A that have C, even if they also have B)
Pool B
--Section A
--Section B
Pool C
--Section A
--Section C
Each section is filled with people.
"Deals" would consist of one "Requested" item, and a "Pack," you're willing to give any or all of the items up to get the item you requested.
Every "Deal" is calculated per-pool.... if you want a given item, you go to the pools of the items you'd be willing to give, and it find the Section which belongs to the item you are requesting.
Likewise, your deal is placed in the pools. So you can immediately find all of the applicable people, because you know EXACTLY which pools, and EXACTLY which sections to search in, no sorting necessary once they've entered the system.
And, then, age would have priority, older deals would be picked, rather than new ones.
Let's assume you can hash your items, or at least sort them. Assume your goal is to find the best result for a given user, on request, as in your original example. (Optimizing trading partners to maximize overall trade value is a different question.)
This would be fast. O(log n) for each insertion operation. Worst case O(n) for suggesting trading partners, but you bound this by processing time.
You're already maintaining a list of items per user.
Give each user a score equal to the sum of the values of the items they have.
Maintain a list of user-HAVES and user-WANTS per item (#Dialecticus), sorted by user score. (You can sort on demand, or keep the lists sorted dynamically every time a user changes their HAVE list.)
When a user user1 requests suggested trade partners
Iterate over their items item in order by value.
Iterate over the user-HAVES user2 for each item, in order by user score.
Compute trade value for user1 trades-with user2.
Remember best trade so far.
Keep hash of users processed so far to avoid recomputing value for a user multiple times.
Terminate when you run out of processing time (your real-time guarantee).
Sorting by item value and user score is the approximation that makes this fast. I'm not sure how sub-optimal it would be, though. There are certainly easy examples where this would fail to find the best trade if you don't run it to completion. In practice, it seems like it might be good enough. In the limit, you can make it optimal by letting it run until it exhausts the lists in step 4.1 and 4.2. There's extra memory cost associated with the inverted lists, but you didn't say you were memory constrained. And generally, if you want speed, it's not uncommon to trade-off space to get it.
I mark item by letter and user by number.
m - number of items in all have/want lists (have or want, not have and want)
x - number of users.
For each user you have list of his wants and haves. Left line is want list, right is have list (both will be sorted so we can use binary search).
1 - ABBCDE FFFGH
2 - CFGGH BE
3 - AEEGH BBDF
For each pair of users you generate two values and store them somewhere, you'd only generate it once and than actualize. Sorting first table and generating second, is O(m*x*log(m/x)) + O(log(m)) and will require O(x^2) extra memory. These values are: how many would first user get and how many another (if you want you can modify these values by multiplying them by value of particular item).
1-2 : 1 - 3 (user 1 gets 1) - (user 2 gets 3)
1-3 : 3 - 2
2-3 : 1 - 1
You also compute and store best trader for each user. After you've generated this helpful data you can quickly query.
Adding/Removing item - O(m*log(m/x)) (You loop through user's have/want list and do binary search on have/want list of every other user and actualize data)
Finding best connection - O(1) or O(x) (Depends on whether result stored in cache is correct or needs to be updated. You loop through user's pairs and do whatever you want with data to return to user the best connection)
By m/x I estimate number of items in single user's want/have list.
In this algorithm I'm assuming that all data isn't stored in Database (I don't know if binary search is possible with Databases) and that inserting/removing item into list is O(1).
PS. Sorry for bad english and I hope I've computed it all correctly and that it is working because I also need it.
Of course you could always seperate the system into three categories; "Wants," "Haves," and "Open Offers." So lets say User1 has Item A, User2 has Item B & C and is trading those for item A, but User1 still wants Item D, and User2 wants Item E. So User1 (assuming he's the trade "owner") puts a request, or want for Item D and Item E, thus the offer stands, and goes on the "Open Offers" list. If it isn't accepted or edited within two or so days, it's automatically cancelled. So User3 is looking for Item F and Item G, and searches on the "Have list" for Items F & G, which are split between User1 & User2. He realizes that User1 and User2's open offer includes requests for Items D & E, which he has. So he chooses to "join" the operation, and it's accepted on their terms, trading and swaping they items among them.
Lets say User1 now wants Item H. He simply searches on the "Have" list for the item, and among the results, he finds that User4 will trade Item H for Item I, which User1 happens to have. They trade, all is well.
Just make it BC only. That solves all problems.

How can I create a unique 7-digit code for an entity?

When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.

What is a quad linked list?

I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?
I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)
I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.
I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.
My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).
One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.
The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.

Resources