An efficient way to find matching items in N lists? - algorithm

Given a number of lists of items, find the lists with matching items.
The brute force pseudo-code for this problem looks like:
foreach list L
foreach item I in list L
foreach list L2 such that L2 != L
for each item I2 in L2
if I == I2
return new 3-tuple(L, L2, I) //not important for the algorithm
I can think of a number of different ways of going about this - creating a list of lists and removing each candidate list after searching the others for example - but I'm wondering if there is a better algorithm for this?
I'm using Java, if that makes a difference to your implementation.
Thanks

Create a Map<Item,List<List>>.
Iterate through every item in every list.
each time you touch an item, add the current list to that item's entry in the Map.
You now have a Map entry for each item that tells you what lists that item appears in.
This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).
Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.

As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.
Map<Object, List>
of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.

I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.
I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).
Comparing three lists: L0, L1 and L2.
Read L0, placing each element in a set: S0.
Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
Discard S0.
Read L2, keeping items that match an element of S1 and discarding others.
Update
Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)
Update 2
Some untested C++ code to illustrate the algorithm
#include <string>
#include <vector>
#include <set>
#include <cassert>
typedef std::vector<std::string> strlist_t;
strlist_t GetMatches(std::vector<strlist_t> vLists)
{
assert(vLists.size() > 1);
std::set<std::string> s0, s1;
std::set<std::string> *pOld = &s1;
std::set<std::string> *pNew = &s0;
// unconditionally load first list as "new"
s0.insert(vLists[0].begin(), vLists[0].end());
for (size_t i=1; i<vLists.size(); ++i)
{
//swap recently read "new" to "old" now for comparison with new list
std::swap(pOld, pNew);
pNew->clear();
// only keep new elements if they are matched in old list
for (size_t j=0; j<vLists[i].size(); ++j)
{
if (pOld->end() != pOld->find(vLists[i][j]))
{
// found match
pNew->insert(vLists[i][j]);
}
}
}
return strlist_t(pNew->begin(), pNew->end());
}

You can use a trie, modified to record what lists each node belongs to.

Related

Data Structure with fast access to nth of elements satisfying condition

I'm filling a stack/vector (a dynamically sized container with fast random access by index with insertion only at the end) with composite data (a struct, class, tuple…). For a specific attribute with a small set of possible values, I will want to access the nth of all elements in the stack where this attribute satisfies a condition. To achieve this, additional information can be stored along each composite or in a separate data structure.
Note that the vector is large and that the compared attribute has a small value range but is compared to a set of allowed values. Also the attributes aren't distributed evenly throughout composites in the vector.
Pseudocode of a O(n) naïve approach. How can I improve this:
enum Fruit { apple, orange, banana, potato };
struct c {
Fruit a;
Data d;
}
// Let's assume v has a length of many thousand and that the distribution of fruits is *not* completely random e.g. that maybe potato only rarely occurs or that bananas tend to come in packs
c getFruit(vector<c> v, set<Fruit> s, int n) {
int counter=0;
// iterate over all of v's indices
for(int i=0 ; i<v.length; i+=1) {
if(v[i].a in s) {
if(n==counter) {
return v[i];
}
counter+=1;
}
}
}
// note: The attribute is compared to a set (arbitrary combination of fruits)!
getFruit(largeVector, set{apple, orange, potato}, 15234)
Another approach would be to create a vector for each possible set of fruits which would be super fast O(1) but not so memory efficient.
(Although I do have to implement this now, I'm really just asking out of curiousity because my data is small enough to just go with the naïve approach.)
Any argument why there doesn't seem to a more efficient way is very much approved as well.
Edit: It should be noted that new elements may be appended between queries for indices using the algorithm in question so any caches have to grow with the vector and both growing the vector and this filtered access should be fast.
For each index of the vector, store the preceding number of each fruit.
Then you can do a binary search to find the first index where the sum of the desired fruit counts is sufficient.
If you don't want to use that much memory, then store the counts in a separate arrays, and only store them for every 16th index or so in the main array. Your binary search will then get you an index within 16 positions of the desired answer, and you can do a linear scan from there.

Linked List - Remove numbers from a specified range

I have a linked list that contains numbers from 0 to 1 and my task is to remove numbers from a given range (x, y) from this list. Do you have any idea how to solve that problem in a reasonable complexity?
Let's first think about how a LinkedList is structured. Lets take a look at the following image:
Each element in a (doubly) linked list has a pointer to the next (and the previous) element. The Java class LinkedList is for example a doubly-linked list.
In such a list there is no direct access to "give me the index of element B". We just have a head reference (pointing at the start of the list) and a tail reference (pointing at the end). To find the element B, we need to start at head (or tail) and completely walk through the entire list, following the next (or prev) pointer of the elements until we found element B.
So, back to your question, there is no efficient way to remove elements of range(x, y) from a LinkedList. This can only be done efficient in sorted structures like PriorityQueue or a sorted ArrayList (binary search yields O(log(n)) or one with direct access to elements like HashSet for example.
Here is a code snippet in Java that solves your task for LinkedList, however, as stated, it is not efficient and has a running time of O(n) (we need to take a look at each element in order to find out which elements need to be deleted):
LinkedList<Integer> list = ...
// Inclusive lower bound
int lowerBound = ...
// Exclusive upper bound
int upperBound = ...
ListIterator<Integer> listIter = list.listIterator();
while (listIter.hasNext()) {
int value = listIter.next();
// Check if the value is inside bounds
if (value >= lowerBound || value < upperBound) {
// Remove the element from the list using the iterator
// which prevents ConcurrentModificationException
listIter.remove();
}
}
If you think about it, linkedlist has no method getAtIndex. You can only start from Head and work your way to the tail or vice versa. The complexity of this would be O(n)

Does Enumerable.OrderBy order complete list if only first element is requested

If a sequence is ordered. And you only ask for the first element of the ordered sequence. Is Orderby smart enough not to order the complete sequence?
IEnumerable<MyClass> myItems = ...
MyClass maxItem = myItems.OrderBy(item => item.Id).FirstOrDefault();
So if the first element is asked, only the item with the minimum value is ordered as first element of the sequence. When the next element is asked, the item with the minimum value of the remaining sequence is ordered etc.
Or is the complete sequence completely ordered if you only want the first element?
Addition
Apparently the question is unclear. Let's give an example.
The Sort function could do the following:
Create a linked list containing all the elements
as long as the linked list contains element:
Take the first element of the linked list as the smallest
scan the rest of the linked list once to find any smaller elements
remove the smallest element from the linked list
yield return the smallest element
Code:
public static IEnumerable<TSource> Sort<TSource, TKey>(
this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));
IComparer<TKey> comparer = Comparer<TKey>.Default;
// create a linkedList with keyValuePairs of TKey and TSource
var keyValuePairs = source
.Select(source => new KeyValuePair<TKey, TSource>(keySelector(source), source);
var itemsToSort = new LinkedList<KeyValuePair<Tkey, TSource>>(keyValuePairs);
while (itemsToSort.Any())
{ // there are still items in the list
// select the first element as the smallest one
var smallest = itemsToSort.First();
// scan the rest of the linkedList to find the smallest one
foreach (var element in itemsToSort.Skip(1))
{
if (comparer.Compare(element.Key, smallest.Key) < 1)
{ // element.Key is smaller than smallest.Key: element becomes the smallest:
smallest = element;
}
}
// remove the smallest element from the linked list and return the value:
itemsToSort.Remove(smallestElement);
yield return smallestElement.Value;
}
Suppose I have a sequence of integers.
Suppose I have the following sequence of integers:
{4, 8, 3, 1, 7}
At the first iteration the iterator internally creates a linked list of key/value pairs and assigns the first element of the list as smallest
Linked List = 4 - 8 - 3 - 1 - 7
Smallest = 4
The linked list is scanned once to see if there is a smaller one.
Linked List = 4 - 8 - 3 - 1 - 7
Smallest = 1
The smallest is removed from the linked list and yield return:
Linked List = 4 - 8 - 3 - 7
return 1
The second iteration the same is done with the shorter linked list
Linked List = 4 - 8 - 3 - 7
smallest = 4
Again the linked list is scanned once to find the smallest one
Linked List = 4 - 8 - 3 - 7
smallest = 3
Remove the smallest from the linked list and return the smallest
Linked List = 4 - 8 - 7
return 3
It's easy to see that if you only ask for first element in the sorted list, the list is scanned only once. Every iteration the list to scan becomes smaller.
Back to my original question:
I understand that if you only want the first element, you have to scan the list at least once. If you don't ask for a second element, the rest of the list is not ordered.
Is the sort that is used by Enumerable.OrderBy thus smart that if doesn't sort the remainder of the list if you only ask for the firs ordered item?
It depends on the version.
In the framework versions (4.0, 4.5, etc.) then:
The entire source is loaded into a buffer.
Produce a map of keys (so that they key production is only once per element).
A map of integers is produced and then sorted according to those keys (using a map means swap operations have cheaper copies if the source elements are large value types).
The FirstOrDefault attempts to obtain the first item according to this mapping by using MoveNext and Current on the resulting object. Either it finds one, or (if the buffer is empty because the source was empty) returns default(TSource).
In .NET Core, then:
The FirstOrDefault operation on the IOrderedEnumerable scans through the source. If there are no elements it returns default(TSource) otherwise it holds onto the first element found and the key produced by the key generator and compares it with all subsequent, replacing that held-onto value and key with the next found if the next found compares as lower than the current value.
The held-onto value will be the same element as the Framework version would have found by first sorting, so it is returned.
This means that in the Framework version myItems.OrderBy(item => item.Id).FirstOrDefault() is O(n log n) time complexity (worse case O(n²)) and O(n) space complexity, but in the .NET Core version it is O(n) time complexity and O(1) space complexity.
The main difference here is that in .NET Core FirstOrDefault() has knowledge of how the results of OrderBy (and ThenBy etc.) differ from other possible sources and has code to handle it*, while in the framework version it does not.
Both scan the entire sequence (you can't know the last element in myItems isn't the first by the sorting rules until you've examined it) but they differ in the mechanism and efficiency after that point.
When the next element is asked, the item with the minimum value of the remaining sequence is ordered etc.
If the next element is asked, then not only would any sorting be done again, but it would have to be done again as the contents of myItems could have change in the meantime.
If you were trying to obtain it with myItems.OrderBy(item => item.Id).ElementAtOrDefault(i) then the framework version would find the element by first doing a sort (O(n log n)) and then a scan (O(n) relative to i) while the .NET Core version would find it with a quickselect (O(n) though the constant factors are bigger than for FirstOrDefault() and can be as high as O(n²) in the same cases that sorting is, so its a slower O(n) than with that (it's smart enough to turn ElementAtOrDefault(0) into FirstOrDefault() for that reason). Both versions also use space complexity of O(n) (unless .NET Core can turn it into FirstOrDefault()).
If you were finding the first few values with myItems.OrderBy(item => item.Id).Take(k) then the Framework version would again do a sort (O(n log n)) and the put a limit on the subsequent enumeration of the results so that it stopped returning elements after k were obtained. The .NET Core version would do a partial sort, not bothering to sort elements it realised were always going to come after the portion taken, which is O(n + k log k) time complexity. .NET Core would also do a single partial sort for combinations of Take and Skip reducing the amount of sorting necessary further.
In theory the sorting of just OrderBy(cmp) could be lazier as per:
Load the elements into the buffer.
Do a sort, probably favouring the "left" partition as partitioning is happening.
yield elements as soon as it is found that they are the next to enumerate.
This would improve time-to-first-result (low time-to-first-result is often a nice feature of other Linq operations), and particularly benefit consumers who may stop working on the result part way through. However it adds extra constant costs to the sorting operation and either prevents picking the next partition to work on in such a way as to reduce the amount of recursion (an important optimisation of partition-based sorting) or else would often not actually yield anything until near the end anyway (making the exercise rather pointless). It would also make the sorting much more complicated. While I experimented with this approach the pay-offs to some cases didn't justify the costs to others, especially as it seemed likely to hurt more people than it benefited.
*Strictly speaking, the results of several linq operations have knowledge of how to find the first element in a way that is optimised for each of them, and FirstOrDefault() knows how to detect any of those cases.
If a sequence is ordered ...
That is fine but not a property of IEnumerable so OrderBy can never 'know' this directly.
There are precedents for this though, Count() will check at runtime if its IEnumerable<> source is actually pointing at a List and then take a shortcut to the Count property.
Likewise, OrderBy could look to see if it's called on a SortedList or something but there is no clear marker interface and those collections are used far too infrequently to make this worth the effort.
There are other ways to optimize this, .OrderBy().First() could conceivably map to a .Min() but again, nobody has bothered till now as far as I knew. See Jon's answer.
No, it's not. How can it know that the list is in order without iterating through the entire list?
Here's a simple test:
void Main()
{
Console.WriteLine(OrderedEnumerable().OrderBy(x => x).First());
}
public IEnumerable<int> OrderedEnumerable()
{
Console.WriteLine(1);
yield return 1;
Console.WriteLine(2);
yield return 2;
Console.WriteLine(3);
yield return 3;
}
This, as expected, outputs:
1
2
3
1
If you look at the reference source and follow the classes you will see that all keys will be computed and then a quick sort algorithm will sort the index table according to the keys.
So the sequence is read once, all the keys are computed, then an index is sorted according to the keys and then you get your first output.

abstract inplace mergesort for effective merge sort

I am reading about merge sort in Algorithms in C++ by Robert Sedgewick and have following questions.
static void mergeAB(ITEM[] c, int cl, ITEM[] a, int al, int ar, ITEM[] b, int bl, int br )
{
int i = al, j = bl;
for (int k = cl; k < cl+ar-al+br-bl+1; k++)
{
if (i > ar) { c[k] = b[j++]; continue; }
if (j > br) { c[k] = a[i++]; continue; }
c[k] = less(a[i], b[j]) ? a[i++] : b[j++];
}
}
The characteristic of the basic merge that is worthy of note is that
the inner loop includes two tests to determine whether the ends of the
two input arrays have been reached. Of course, these two tests usually
fail, and the situation thus cries out for the use of sentinel keys to
allow the tests to be removed. That is, if elements with a key value
larger than those of all the other keys are added to the ends of the a
and aux arrays, the tests can be removed, because when the a (b) array
is exhausted, the sentinel causes the next elements for the c array to
be taken from the b (a) array until the merge is complete.
However, it is not always easy to use sentinels, either because it
might not be easy to know the largest key value or because space might
not be available conveniently.
For merging, there is a simple remedy. The method is based on the
following idea: Given that we are resigned to copying the arrays to
implement the in-place abstraction, we simply put the second array in
reverse order when it is copied (at no extra cost), so that its
associated index moves from right to left. This arrangement leads to
the largest element—in whichever array it is—serving as sentinel for
the other array.
My questions on above text
What does statement "when the a (b) array is exhausted"? what is 'a (b)' here?
Why is the author mentioning that it is not easy to determine the largest key and how is the space related in determining largest key?
What does author mean by "Given that we are resigned to copying the arrays"? What is resigned in this context?
Request with simple example in understanding idea which is mentioned as simple remedy?
"When the a (b) array is exhausted" is a shorthand for "When either the a array or the b array is exhausted".
The interface is dealing with sub-arrays of a bigger array, so you can't simply go writing beyond the ends of the arrays.
The code copies the data from two arrays into one other array. Since this copy is inevitable, we are 'resigned to copying the arrays' means we reluctantly accept that it is inevitable that the arrays must be copied.
Tricky...that's going to take some time to work out what is meant.
Tangentially: That's probably not the way I'd write the loop. I'd be inclined to use:
int i = al, j = bl;
for (int k = cl; i <= ar && j <= br; k++)
{
if (a[i] < b[j])
c[k] = a[i++];
else
c[k] = b[j++];
}
while (i <= ar)
c[k++] = a[i++];
while (j <= br)
c[k++] = b[j++];
One of the two trailing loops does nothing. The revised main merge loop has 3 tests per iteration versus 4 tests per iteration for the one original algorithm. I've not formally measured it, but the simpler merge loop is likely to be quicker than the original single-loop algorithm.
The first three questions are almost best suited for English Language Learners.
a(b) and b(a)
Sometimes parenthesis are used to tell one or more similar phrases at once:
when a (b) is exhausted we copy elements from b (a)
means:
when a is exhausted we copy elements from b,
when b is exhausted we copy elements from a
What is difficult about sentinels
Two annoying things about sentinels are
sometimes your array data may potentially contain every possible value, so there is no value you can use as sentinel that is guaranteed to be bigger that all the values in the array
to use a sentinel instead of checking the index to see if you are done with an array requires that you have room for one extra space in the array to store the sentinel
Resigning
We programmers are never happy to copy (or move) things around and leaving them where they already are is, if possible, better (because we are lazy).
In this version of the merge sort we already gave up about trying to not copy things around... we resigned to it.
Given that we must copy, we can copy things in the opposite order if we like (and of course use the copy in opposite order) because that is free(*).
(*) is free at this level of abstraction, the cost on some real CPU may be high. As almost always in the performance area YMMV.

Algorithm to match roots between two string lists

The problem:
I am using a watch service to monitor a directory for input so I can fire an event once I have two (semi)matching input files. The problem I have is: If I have two lists, each containing strings that may differ how can I find matching roots between lists as they occur.
The filename structure looks like this:
<companyname>-<ordernum><postfix>.csv
so for example:
list1 could contain:
mycomp-1234.csv
mycomp-4567.csv
newcomp-7891.csv
oldcomp-3376.csv
list2 could contain:
mycomp-2232_items.csv
newcomp-13123_items.csv
oldcomp-87078777_items.csv
mycomp-1234_items.csv
I want to find, and fire the event as soon as a match occurs between lists. A match being any filename, less the suffix. i.e. mycomp-1234 would return a match for both lists.
What I'm looking for
I'm looking to find the most efficient manner to do this. I know I can iterate over each list comparing values, but I am sure there is a more efficient way to do this.
I do not need code, I'd rather learn this by myself, so a push in the right direction is perfect. If your fingers make you write code, please write pseudo code so it can benefit as many languages as possible.
And no, this is not homework. For those of you intensely curious folk this is to perform EDI transformations from csv to X12 EDI files.
Sort the lists alphabetically then compare the values and step forward in the list that has the smaller value. If the lists have any elements in common the values will match.
A side by side comparison of two sorted lists.
Collections.sort(list1);
Collections.sort(list2);
int i1 = 0;
int i2 = 0;
while (i1 < list1.size() && i2 < list.size()) {
String name1 = list1.get(i1);
String name2 = list2.get(i2);
String[] parts1 = name1.split("[-_.]");
String[] parts2 = name2.split("[-_.]");
if (parts1.length < 3) {
++i1;
continue;
}
if (parts2.length < 3) {
++i2;
continue;
}
int cmp = parts1[0].compareTo(parts1[0]);
if (cmp == 0) {
cmp = parts1[1].compareTo(parts1[1]);
}
if (cmp < 0) {
++i1;
continue
}
if (cmp > 0) {
++i2;
continue
}
// Found match:
...
++i1;
++i2;
}
An online method: Maintain a binary search tree containing all the current filenames. Use as keys the relevant bits of filenames. For example, the key for either newcomp-7891.csv or newcomp-7891_items is newcomp-7891. Each time the watch service reports a directory event, you can delete disused names and can attempt to add new names to the tree. If a key already is in the tree, fire your desired event.
A hash table can be used similarly, if the hash implementation supports deletion of keys when filenames are removed.
The question asks for “the most efficient manner to do this”. Note that this method is far more efficient than sorting the lists from scratch each time a directory event occurs. At an event with k additions and deletions, it uses O(k·lg n) time if the dataset has n entries, so over a period of time where the average tree size is n and m additions/deletions occur, in u directory events, it will do O(m·lg n) work. By contrast, the sort-each-time methods suggested in other answers will do O(u·n·lg n) work, which is much more.

Resources