What algorithm would allow building optimal "groups" of terms? - algorithm

I have a table of data and I want to pull specific records. The records are indicated in various, nigh-random ways (how isn't important), but I want to be able to identify them using 11 specific terms. Essentially, I'm being given a lot of queries against non-indexed fields and having to rewrite them using specific indexed fields -- except thanks to an Enterprisey System it's not as simple as that: the data has to be packaged in a certain way that avoids directly touching SQL.
It might be easier to give an example in 2-dimensions, although the problem itself uses 11 that will probably change:
123
+---+
A|X O|
B| X |
C|X O|
+---+
If I wanted to group all the X's in the above grid, I could say: A1 and B2 and C1. Better would be (A,C)1 and B2. Even better would be (A,B,C)(1,2) -- empty spaces can be included or excluded for this problem, they don't matter. What's important is keeping the number of groups down, getting all the Xs and avoiding all the Os.
To give a hint on sizing, the actual problem will generally deal with anywhere between 100 and 5000 "good" records. It is also not necessary to have The Ideal Answer -- a Pretty Good answer would suffice.

This sounds a lot like Karnaugh maps, with X=true, 0=false, and blank="don't care".

Related

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Synchronize two ordered lists

We have two offline systems that normally can not communicate with each other. Both systems maintain the same ordered list of items. Only rarely will they be able to communicate with each other to synchronize the list.
Items are marked with a modification timestamp to detect edits. Items are identified by UUIDs to avoid conflicts when inserting new items (as opposed to using auto-incrementing integers). When synchronizing new UUIDs are detected and copied to the other system. Likewise for deletions.
The above data structure is fine for an unordered list, but how can we handle ordering? If we added an integer "rank", that would need renumbering when inserting a new item (thus requiring synchronizing all successor items due to only 1 insertion). Alternatively, we could use fractional ranks (use the average of the ranks of the predecessor and successor item), but that doesn't seem like a robust solution as it will quickly run into accuracy problems when many new items are inserted.
We also considered implementing this as a doubly linked-list with each item holding the UUID of its predecessor and successor item. However, that would still require synchronizing 3 items when 1 new items was inserted (or synchronizing the 2 remaining items when 1 item was deleted).
Preferably, we would like to use a data structure or algorithm where only the newly inserted item needs to be synchronized. Does such a data structure exist?
Edit: we need to be able to handle moving an existing item to a different position too!
There is really no problem with the interpolated rank approach. Just define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 with no trailing zeros. The binary point is to the left of the first digit.
The only inconvenience of this system is that the minimum possible key is 0 given by the empty bit vector. Therefore you use this only if you're positive the associated item will forever be the first list element. Normally, just give the first item the key 1. That's equivalent to 1/2, so random insertions in the range (0..1) will tend to minimize bit usage. To interpolate an item before and after,
01 < newly interpolated = 1/4
1
11 < newly interpolated = 3/4
To interpolate again:
001 < newly interpolated = 1/8
01
011 < newly interpolated = 3/8
1
101 < newly interpolated = 5/8
11
111 < newly interpolated = 7/8
Note that if you wish you can omit storing the final 1! All keys (except 0 which you won't normally use) end in 1, so storing it is supefluous.
Comparison of binary fractions is a lot like lexical comparison: 0<1 and the first bit difference in a left-to-right scan tells you which is less. If no differences occur, i.e. one vector is a strict prefix of the other, then the shorter one is smaller.
With these rules it's pretty simple to come up with an algorithm that accepts two bit vectors and computes a result that's roughly (or exactly in some cases) between them. Just add the bit strings, and shift right 1, dropping unnecessary trailing bits, i.e. take the average of the two to split the range between.
In the example above, if deletions had left us with:
01
111
and we need to interpolate these, add 01(0) and and 111 to obtain 1.001, then shift to get 1001. This works fine as an interpolant. But note the final 1 unnecessarily makes it longer than either of the operands. An easy optimization is to drop the final 1 bit along with trailing zeros to get simply 1. Sure enough, 1 is about half way between as we'd hope.
Of course if you do many inserts in the same location (think e.g. of successive inserts at the start of the list), the bit vectors will get long. This is exactly the same phenomenon as inserting at the same point in a binary tree. It grows long and stringy. To fix this, you must "rebalance" during a synchronization by renumbering with the shortest possible bit vectors, e.g. for 14 you'd use the sequence above.
Addition
Though I haven't tried it, the Postgres bit string type seems to suffice for the keys I've described. The thing I'd need to verify is that the collation order is correct.
Also, the same reasoning works just fine with base-k digits for any k>=2. The first item gets key k/2. There is also a simple optimization that prevents the very common cases of appending and prepending elements at the end and front respectively from causing keys of length O(n). It maintains O(log n) for those cases (though inserting at the same place internally can still produce O(p) keys after p insertions). I'll let you work that out. With k=256, you can use indefinite length byte strings. In SQL, I believe you'd want varbinary(max). SQL provides the correct lexicographic sort order. Implementation of the interpolation ops is easy if you have a BigInteger package similar to Java's. If you like human-readable data, you can convert the byte strings to e.g. hex strings (0-9a-f) and store those. Then normal UTF8 string sort order is correct.
You can add two fields to each item - 'creation timestamp' and 'inserted after' (containing the id of the item after which the new item was inserted). Once you synchronize a list, send all the new items. That information is enough for you to be able to construct the list on the other side.
With the list of newly added items received, do this (on the receiving end): sort by creation timestamp, then go one by one, and use the 'inserted after' field to add the new item in the appropriate place.
You may face trouble if an item A is added, then B is added after A, then A is removed. If this can happen, you will need to sync A as well (basically syncing the operations that took place on the list since the last sync, and not just the content of the current list). It's basically a form of log-shipping.
You could have a look at "lenses", which is bidirectional programming concept.
For instance, your problem seems to be solved my "matching lenses", described in this paper.
I think the datastructure that is appropriate here is order statistic tree. In order statistic tree you also need to maintain sizes of subtrees along with other data, the size field helps easy to find element by rank as you need it to be. All operations like rank,delete,change position,insert are O(logn).
I think you can try kind of transactional approach here. For example you do not delete items physically but mark them for deletion and commit changes only during synchronization. I'm not absolutely sure which data type you should choose, it depends on which operations you want to be more productive (insertions, deletions, search or iteration).
Let we have the following initial state on both systems:
|1| |2|
--- ---
|A| |A|
|B| |B|
|C| |C|
|D| |D|
After that the first system marks element A for deletion and the second system inserts element BC between B and C:
|1 | |2 |
------------ --------------
|A | |A |
|B[deleted]| |B |
|C | |BC[inserted]|
|D | |C |
|D |
Both systems continue processing taking into account local changes, System 1 ignores element B and System 2 treats element BC as normal element.
When synchronization occurs:
As I understand, each system receives the list snapshot from other system and both systems freeze processing until synchronization will be finished.
So each system iterates sequentially through received snapshot and local list and writes changes to local list (resolving possible conflicts according to modified timestamp) after that 'transaction is commited', all local changes are finally applied and information about them erases.
For example for system one:
|1 pre-sync| |2-SNAPSHOT | |1 result|
------------ -------------- ----------
|A | <the same> |A | |A |
|B[deleted]| <delete B> |B |
<insert BC> |BC[inserted]| |BC |
|C | <same> |C | |C |
|D | <same> |D | |D |
Systems wake up and continue processing.
Items are sorted by insertion order, moving can be implemented as simultaneous deletion and insertion. Also I think that it will be possible not to transfer the whole list shapshot but only list of items that were actually modified.
I think, broadly, Operational Transformation could be related to the problem you are describing here. For instance, consider the problem of Real-Time Collaborative text editing.
We essentially have a sorted list of items( words) which needs to be kept synchronized, and which could be added/modified/deleted at random within the list. The only major difference I see is in the periodicity of modifications to the list.( You say it does not happen often)
Operational Transformation does happen to be well studied field. I could find this blog article giving pointers and introduction. Plus, for all the problems Google Wave had, they actually made significant advancements to the domain of Operational Transform. Check this out. . There is quite a bit of literature available on this subject. Look at this stackoverflow thread, and about Differential Synchronisation
Another parallel that struck me was the data structure used in Text Editors - Ropes.
So if you have a log of operations,lets say, "Index 5 deleted", "Index 6 modified to ABC","Index 8 inserted",what you might now have to do is to transmit a log of the changes from System A to System B, and then reconstruct the operations sequentially on the other side.
The other "pragmatic Engineer" 's choice would be to simply reconstruct the entire list on System B when System A changes. Depending on actual frequency and size of changes, this might not be as bad as it sounds.
I have tentatively solved a similar problem by including a PrecedingItemID (which can be null if the item is the top/root of the ordered list) on each item, and then having a sort of local cache that keeps a list of all items in sorted order (this is purely for efficiency—so you don't have to recursively query for or build the list based on PrecedingItemIDs every time there is a re-ordering on the local client). Then when it comes time to sync I do the slightly more expensive operation of looking for cases where two items are requesting the same PrecedingItemID. In those cases, I simply order by creation time (or however you want to reconcile which one wins and comes first), put the second (or others) behind it, and move on ordering the list. I then store this new ordering in the local ordering cache and go on using that until the next sync (just making sure to keep the PrecedingItemID updated as you go).
I haven't unit tested this approach yet—so I'm not 100% sure I'm not missing some problematic conflict scenario—but it appears at least conceptually to handle my needs, which sound similar to those of the OP.

Which is more efficient - Computing results using a functionin realtime or reading the results directly from a database?

Let us take this example scenario:
There exists a really complex function that involves mathematical square roots and cube roots (which are slower to process) to compute its output. As an example, let us assume the function accepts two parameters a and b and the input range for both the values a and b are well-defined. Let us assume the input values a and b can range from 0 to 100.
So essentially fn(a,b) can be either computed in real time or its results can be pre-filled in a database and fetched as and when required.
Method 1: Compute in realtime
function fn(a,b){
result = compute_using_cuberoots(a,b)
return result
}
Method 2: Fetch the function result from a database
We have a database pre-filled with the input values mapped to the corresponding result:
a | b | result
0 | 0 | 12.4
1 | 0 | 14.8
2 | 0 | 18.6
. | . | .
. | . | .
100 | 100 | 1230.1
And we can
function fn(a,b){
result = fetch_from_db(a,b)
return result
}
My question:
Which method would you advocate and why? Why do you think one method is more efficient than the other?
I believe this is a scenario that most of us will face at some point during our programming life and hence this question.
Thank you.
Question Background (might not be relevant)
Example : In scenarios like Image-Processing, it is possible to come across such situations more often, where the range of values for the input (R,G,B) are known (0-255) and mathematical computation of square-roots and cube-roots introduce too much time for the server requests to be completed.
Let us take for an example you're building an app like Instagram - The time taken to process an image sent to the server by the user and the time taken to return the processed image must be kept minimal for an optimal User-Experience. In such situations, it is important to minimize the time taken to process the image. Worse yet, scalability problems are introduced when the number of such processing requests grow large.
Hence it is necessary to choose between one of the methods described above that will also be the most optimal method in such situations.
More details on my situation (if required):
Framework: Ruby on Rails, Database: MongodB
I wouldn't advocate either method, I'd test them both (if I thought they were both reasonable) and get some data.
Having written that, I'll rise to the bait: given the relative speed of computation vs I/O I would expect computation to be faster than retrieving the function values from a database. I'll acknowledge the possibility (and no more) that in some special cases an in-memory database will be able to outperform (re-)computation, but as a general rule, no.
"More efficient" is a fuzzy term. "Faster" is more concrete.
If you're talking about a few million rows in a SQL database table, then selecting a single row might well be faster than calculating the result. On commodity hardware, using an untuned server, I can usually return a single row from an indexed table of millions of rows in just a few tenths of a millisecond. But I'd think hard before installing a dbms server and building a database only for this one purpose.
To make "faster" a little less concrete, when you're talking about user experience, and within certain limits, actual speed is less important than apparent speed. The right kind of feedback at the right time makes people either feel like things are running fast, or at least makes them feel like waiting just a little bit is not a big deal. For details about exactly how to do that, I'd look at User Experience on the Stack Exchange network.
The good thing is that it's pretty simple to test both ways. For speed testing just this particular issue, you don't even need to store the right values in the database. You just need to have the right keys and indexes. I'd consider doing that if calculating the right values is going to take all day.
You should probably test over an extended period of time. I'd expect there to be more variation in speed from the dbms. I don't know how much variation you should expect, though.
Computing results and reading from a table can be a good solution if inputs are fixed values. Computing real time and caching results for an optimum time can be a good solution if inputs varies in different situations.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" Donald Knuth
I'd consider using a hash as a combination of calculating and storing. With he really complex function represented as a**b:
lazy = Hash.new{|h,(a,b)|h[[a,b]] = a**b}
lazy[[4,4]]
p lazy #=> {[4, 4]=>256}
I'd think about storing the values on the code itself:
class MyCalc
RESULTS = [
[12.4, 14.8, 18.6, ...]
...
[..., 1230.1]
]
def self.fn a, b
RESULTS[a][b]
end
end
MyCalc.fn(0,1) #=> 14.8

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
test-1
test-2
test-3
Excel will continue it with:
test-4
test-5
test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
test-blue-somethingelse
test-yellow-somethingelse
test-red-somethingelse
Now based on this entries I want say that the pattern is:
test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
if (s1.charAt(s1I) != s2.charAt(s2I))
break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
if (s1.charAt(s1E) != s2.charAt(s2E))
break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

Resources